Menu

#887 HTML parsing error with Р (ER) character

v1.0_(example)
closed
1
2014-02-14
2014-02-05
No

Hi,
I encountered issue on parsing HTML code with writeHTML.

Example how to reproduce:

require_once 'tcpdf/tcpdf.php';

$pdf = new TCPDF(PDF_PAGE_ORIENTATION, PDF_UNIT, PDF_PAGE_FORMAT, true, 'UTF-8', false);

$pdf->AddPage();
$pdf->SetFont('freeserif', '', 12);

$content = '<div>Test: Р 123</div>';

$pdf->writeHTML($content);
$pdf->Write(0, $content);

$pdf->Output('test.pdf', 'F');

It works fine and outputs 2 lines (HTML and plain) to PDF.

But if change $content:

$content = '<div>Test: Р</div>';

then TCPDF outputs only second (plain text) line.

Looks like the problem is the last character "Р" (Unicode: U+0420, UTF-8: D0A0, CYRILLIC CAPITAL LETTER ER) before closing </div>.

Discussion

  • Nicola Asuni

    Nicola Asuni - 2014-02-05

    The following code seems working fine with the freesans font:

    $content = '

    Test: Ƥ 123
    ';
    $content = '
    Test: Ƥ
    ';

    However the freeserif font seems to have some problems to render on Firefox when the subsetting is turned on, so you can turn it off for now:

    $pdf->SetFont('freeserif', '', 12, '', false);

    I'll investigate if there is a font issue or not.

     
  • Nicola Asuni

    Nicola Asuni - 2014-02-05
    • status: open --> closed
     
  • Alexander Vasilyev

    Unfortunately it does not help.

    Check the code below:

    require_once 'tcpdf/tcpdf.php';

    $pdf = new TCPDF(PDF_PAGE_ORIENTATION, PDF_UNIT, PDF_PAGE_FORMAT, true, 'UTF-8', false);
    $pdf->setFontSubsetting(false);

    $pdf->AddPage();

    // Default font
    $pdf->Write(0, 'Test 1: Р');
    $pdf->writeHTML('Test 2: Р');
    $pdf->Write(0, '<div>Test 3: Р</div>');
    $pdf->writeHTML('<div>Test 4: Р</div>'); // No output

    $pdf->SetFont('FreeSerif', '', 12, '', false);
    $pdf->Write(0, 'Test 5: Р');
    $pdf->writeHTML('Test 6: Р');
    $pdf->Write(0, '<div>Test 7: Р</div>');
    $pdf->writeHTML('<div>Test 8: Р</div>'); // No output

    $pdf->SetFont('DejaVuSans', '', 12, '', false);
    $pdf->Write(0, '<div>Test 9: Р</div>');
    $pdf->writeHTML('<div>Test 10: Р</div>'); // No output
    $pdf->writeHTML('<div>Test 11: М</div>'); // Outputs "Test 11: М"

    $pdf->Output('test.pdf', 'F');

    Sreenshot from Adobe Reader 11 (TCPDF v.6.0.058) attached. No Firefox used.

     
  • Nicola Asuni

    Nicola Asuni - 2014-02-06

    I am unable to reproduce your problem.
    Maybe your letter is not properly encoded?
    Try to use the html entity instead for the Ƥ letter:
    &#x01A4;

     
  • Alexander Vasilyev

    According http://htmlentities.net "Р" letter is &#1056; not Ƥ. In origin it is a normal capitalized cyrillic letter "Р" (pronunciation: ER). Other cyrillic letters placed before </div> outputs as normal. But "Р" fails.

    $pdf->writeHTML('<div>Test A: &#1056;</div>'); // Outputs: Test A: Р
    $pdf->writeHTML('<div>Test B: Р</div>'); // No output

    Very sad.

     
  • Alexander Vasilyev

    Try to use the html entity instead for the Ƥ letter:

    Did you try it yourself?

    $pdf->writeHTML('<div>Test A: &#x01A4;</div>');

    Outputs: Test A: &#x01A4;

    HTML entity is not Unicode.

    See http://www.fileformat.info/info/unicode/char/420/index.htm — Section HTML Entity (decimal) = &#1056;

    Then:

    $pdf->writeHTML('<div>Test B: &#1056;</div>');

    Outputs: Test A: Р (as needed result)

    But:

    $pdf->writeHTML('<div>Test B: Р</div>');

    Outputs nothing.

    Look at the bug hack below:

    class MyTCPDF extends TCPDF
    {
        public function writeHTML ($html, $ln = true, $fill = false, $reseth = false, $cell = false, $align = '')
        {
            // Bug hack
            // @see: https://sourceforge.net/p/tcpdf/bugs/887/
            $html = str_replace('Р</div>', '&#1056;</div>', $html); // Replace "Р" (CYRILLIC CAPITAL LETTER ER) for HTML Entity (decimal)
            parent::writeHTML($html, $ln, $fill, $reseth, $cell, $align);
        }
    }

    Maybe it helps somebody to save time.

     

    Last edit: Alexander Vasilyev 2014-02-14
  • Alexander Vasilyev

    Just corrected replace code to support of any closing tag:

    $html = str_replace('Р</', '&#1056;</', $html); // Replace "Р" (CYRILLIC CAPITAL LETTER ER) for HTML Entity (decimal)

     
  • Nicola Asuni

    Nicola Asuni - 2014-02-14

    Please do NOT reply on a closed ticket, instead use the Help forum.
    Your document seems to be bad encoded since the html entities are accepted.
    &#1056; and &#x01A4; are equivalent since one is in decimal and the other in hexadecimal representation.

     

Log in to post a comment.

MongoDB Logo MongoDB