#115 Tidy.NET: funky characters appending to the text

closed-out-of-date
5
2015-01-26
2007-08-31
Perry Lee
No

Below is my code and the config file. For some reasons, tidy just doesn't like my text that there are funky characters appending to the text after fixing it.
The weirdest thing is that the funky characters are gone if I remove one character or add one character from the text. I'll appreciate it if anyone can help.

Thanks in advance.

Perry

/* Config File */
clean: yes
bare: yes
input-encoding: utf8
output-encoding: utf8
word-2000: yes
tidy-mark: no
DocType: omit
output-html: yes
show-body-only: yes

/* Source Code */
using System;
using System.Collections.Generic;
using System.Text;
using System.IO;
using Tidy;

namespace Tidy.net
{
class Program
{
static void Main(string[] args)
{
string badHTML = "<span>GENERAL SUMMARY Working under limited direction and within general practices, provides technical expertise by independently determining and developing approaches to solutions for a wide range of complex software engineering problems. Understands company goals, practices and product strategies and applies them when resolving a variety of problems. Uses judgement and creativity and sound technical knowledge to obtain and recommend solutions. Assignments may include new products as well as upgrades and enhancements, or fixes to existing products. PRINCIPAL DUTIES AND RESPONSIBILITIES Develops new software engineering methods or processes, re-evaluate existing processes; designs simulation and test criteria and verifies functionality and performance. Works on the overall design and development of new ideas and products, and develops project plans. Represents the organization on project teams and may perform technical project leadership roles; contributes to the development and achievement of organizational goals and objectives. Duties may include research, evaluation, development and application of new process and methods into products. Sphere of influence is likely to extend outside of work group or department. Work may encompass one or more areas of engineering including mechanical systems, equipment and packages, electronic design, production techniques, product definition and planning, or other related fields. May be responsible for establishing and conducting testing routines, developing or executing project plans, and contributing to budgets and schedules. Provides documentation of work and results; reviews projects against goals and provides status reports. Understands and adheres to cost targets established during the program design phase. SKILLS General knowledge and application of engineering concepts. Problem solving skills. Ability to work independently. Communication skills. Problem solving skills. Ability to multi-task. Ability to work in a team environmant.";
string fixedHTML = "";
fixedHTML = tidy(badHTML);

}

static public string tidy(string HTMLtext)
{
string well_formed_HTML_text = "";
try
{
FileInfo optFile = new FileInfo("foo.tidy");

string errFile = @"foo.errors.txt";
int status = 0;

Document tdoc = new Document();

if (status >= 0)
{
status = tdoc.LoadConfig(optFile.FullName);
}

if (status >= 0)
{
status = tdoc.SetErrorFile(errFile);
}

if (status >= 0)
{
status = tdoc.ParseString(HTMLtext);
}

if (status >= 0)
{
status = tdoc.CleanAndRepair();
}

if (status >= 0)
{
status = tdoc.RunDiagnostics();
}

if (status > 1)
{
tdoc.SetOptBool(TidyOptionId.TidyForceOutput, 1);
}

if (status >= 0)
{
well_formed_HTML_text = tdoc.SaveString();
}
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
}

return well_formed_HTML_text;
}
}
}

Discussion

  • Perry Lee

    Perry Lee - 2007-08-31
     
  • Arnaud Desitter

    Arnaud Desitter - 2007-09-05
    • assigned_to: nobody --> creitzel
     
  • Arnaud Desitter

    Arnaud Desitter - 2007-09-05

    Logged In: YES
    user_id=566665
    Originator: NO

    Please provide a C version of your test case if you want some help. Otherwise report this problem to tidy.net
    (http://users.rcn.com/creitzel/tidy.html). See tidy/include/tidy.h for a C example.

     
  • Charles Reitzel

    Charles Reitzel - 2007-09-05
    • status: open --> open-out-of-date
     
  • Charles Reitzel

    Charles Reitzel - 2007-09-05

    Logged In: YES
    user_id=101393
    Originator: NO

    If you are using my Tidy.NET wrapper, be aware that it is a wrapper around the TidyATL COM library (which is, finally, a wrapper around TidyLib). Bottom line, available TidyATL binaries are linked statically with a very old version of TidyLib. There are some TidyATL bugs and (long fixed, thanks Arnaud) TidyLib bugs that interact. Sometimes just adding blanks to the input markup will fix the problem.

    Better, download TidyATL source from the tidywrap SF project. You need to use anonymous CVS to get the sources, but it is all there. http://sourceforge.net/projects/tidywrap

     
  • Charles Reitzel

    Charles Reitzel - 2007-09-05
    • summary: funky characters appending to the text --> Tidy.NET: funky characters appending to the text
     
  • Nobody/Anonymous

    Logged In: NO

    Thank you guys for the tips. Apparently, the wrapper I am using is old which is 1.0.0.0. I will try the one provided by creitzel.

     
  • Arnaud Desitter

    Arnaud Desitter - 2007-09-12
    • status: open-out-of-date --> pending-out-of-date
     
  • Geoff

    Geoff - 2007-09-17

    Logged In: YES
    user_id=1408861
    Originator: NO

    Monday, September 17, 2007.

    Hi Charles,

    Thank you for the pointer to tidywrap. I CVS downloaded this, and had some _FUN_ re-doing the TidyATL.DLL in MSVC8, with the latest libtidy CVS ...

    After some trying false starts, I got it working with both C# v.8 and VB v.8, except the 'callback' failed? Ran out of time to explore this further, since could not get the JIT debug working ... but will try to get back to it ...

    If you have time, you can read all the GORY details on :-

    http://geoffair.net/tidy/tidy_06.htm

    I was not able to duplicate this 'funky character' thing, so maybe the many subsequent source fixes, thanks to Arnaud's tireless work, removed this problem ;=))

    I have provided some downloads, and a diff file on the above tidy_06,htm page, but give me a shout if you want anything more to update your sourceforge source ...

    I am going to try to find the time to also do the Perl package ... In some previous Perl scripts I had run Tidy 'externally', but it would be nice to have it like 'built-in' ...

    Regards,

    Geoff.

    EOF - Wrap-01.doc

     
  • Arnaud Desitter

    Arnaud Desitter - 2007-09-17

    Logged In: YES
    user_id=566665
    Originator: NO

    Geoff & Charlie,

    Looking at http://geoffair.net/tidy/tidy_06.htm, I see:
    TidyBuffer buf = {0};
    This is not strictly correct since January 2007, tidyBufInit or friends should be called explicitly although I added some code to try to detect when it was not done properly.
    So TidyATL should be updated consequently.

    Last, __stdcall is the default for tidylib on Windows 32bit although TIDY_CALL can be used to overwrite it.

    Regards,

     
  • Geoff

    Geoff - 2007-09-19

    Logged In: YES
    user_id=1408861
    Originator: NO

    Wednesday, September 19, 2007.

    RE: http://tidy.sf.net/issue/1786061

    Hi Arnaud and Charlie,

    Thank you both for your inputs ...

    Arnaud, I have now added tidyBufInit() - see -
    http://geoffair.net/tidy/tidy_06.htm#fixbuf
    I had noted this before, but had also noted that tidyBufAttach() had 'protective' coding if this was not done ...

    And I have subsequently found that I only needed /Gz (__stdcall) on the TidyATL code, but in the solution files have left it on both since it does no harm to have the whole library using __stdcall, not only those functions declared with TIDY_CALL ...

    Charlie, the callback problem was just a 'name' problem - see -
    http://geoffair.net/tidy/tidy_06.htm#callback
    Simple, when you find it ...

    And the 'funky' characters re-appear using your 2003 TidyATL.DLL, and are 'gone' with my 2007 TidyATL.DLL - see -
    http://geoffair.net/tidy/tidy_06.htm#proof

    On debugging, I guess I was just voicing that I do not find the VB nor C# debugging quite as 'powerful' as the C++, but quite frequently now, it is not possible to 'trace' into DLL code anyway, even when the DLL is a Debug configuration ...

    I have also re-done several other parts of the tidy_06.htm page, and re-did the downloads yesterday ...

    It ALL appears to now be working fine. I have moved onto the Perl package, but have already run into many warnings and errors, so this may take a bit of time ;=))

    Thanks again for your respective feedbacks ...

    Regards,

    Geoff.

    EOF - Wrap-03.doc

     
  • SourceForge Robot

    • status: pending-out-of-date --> closed-out-of-date
     
  • SourceForge Robot

    Logged In: YES
    user_id=1312539
    Originator: NO

    This Tracker item was closed automatically by the system. It was
    previously set to a Pending status, and the original submitter
    did not respond within 30 days (the time period specified by
    the administrator of this Tracker).

     
  • Geoff

    Geoff - 2007-12-01

    Logged In: YES
    user_id=1408861
    Originator: NO

    Saturday, December 01, 2007.

    Hi Perry,

    Thanks for your email ...

    I should have bumped the assembly version number, and not left it at 1.0.0.0, but I was not really changing anything - just recompiling it against a more recent library source - maybe next time ...

    I don't think you missed anything. I can confirm, using my compiled assembly, and using the configuration, in a foo.tidy file with just :-

    show-body-only: yes

    the 'funky' characters still get appended in the output file, templog2.txt, when using my testTidyCOMcs.EXE ...

    The problem is that these additional 'funky' character do NOT seem to appear when using a WIN32 Tidy.exe compiled from the CVS source, version -
    HTML Tidy for Windows released on 6 November 2007

    Maybe you could do some additional experimentation with a Tidy.exe to try to find a scenario where they appear using the library directly like this ... If that could be done, then it would be easier to track it down ...

    Or maybe there is an error, after the string is parsed, in the -
    STDMETHODIMP CTidyDocument::SaveString(BSTR *putHere)
    service in TidyDocument.cpp, but it certainly 'looks' fine ...

    A Tidy buffer is allocated and initialized;
    TidyBuffer outbuf;
    tidyBufInit( &outbuf );

    Regardless of what 'encoding' was in the configuration, the current encoding is saved, and the output encoding is set to UTF16LE with -
    ctmbstr saveEnc = tidyOptGetCurrPick( _tdoc, TidyCharEncoding );
    tidySetCharEncoding( _tdoc, _T("UTF16LE") );

    Then the Tidy library is used to fill that buffer, from the document tree, according to the configuration parameters -
    int status = tidySaveBuffer( _tdoc, &outbuf );

    Then, if no error, that is status >= 0, this outbuf contents is copied into the allocated returned data pointer -
    *putHere = ::SysAllocStringLen( (BSTR) outbuf.bp, outbuf.size / sizeof(OLECHAR) );

    The outbuf is discarded, and the encoding returned to what is was -
    tidyBufFree( &outbuf );
    tidySetCharEncoding( _tdoc, saveEnc );

    Since these 'funky' characters appear APPENDED I thought it may be something to do with the outbuf.size, since the MSDN HELP on this function clearly states :-

    "BSTR SysAllocStringLen(const OLECHAR *pch, unsigned int cch );
    Allocates a new string, copies cch characters from the passed string into it, and then appends a null character."
    And in the Comments -
    "The pch string can contain embedded null characters and does not need to end with a NULL.".

    Everything works like a charm, in that the main block of text is returned correctly. Just that there is an additional line appended with hex values of 27 C8 81 E1 80 80 0D 0A. Where does this come from???

    I will find the time soon to get back to check this again. Now that it has been isolated down to the single configuration item of 'show-body-only', I may be able to track down a problem in the Tidy library, or in the interface ... but the fact that it is only when 'show-body-only' is used _does_ suggest the library ...

    The problem with all this ATL/COM stuff is that it is not possible to exactly trace into the functions to observe what happens ... As stated, if it is in Tidy library, and you can find another scenario where it happens, then the fix should be quick ... otherwise ...

    I have posted this rather than replying directly, since then others will see it and maybe have a 'bright' idea or 2, and to hopefully open back up the tracker item ;=))

    Hope to be back soon ...

    Regards,

    Geoff.

    Perry said:

    From: Nobody (nobody@sc8-sf-web21.sourceforge.net) on behalf of Perry Lee (perryism@users.sourceforge.net)
    Sent: Wed 28/11/07 19:07
    To: geoffmc@users.sourceforge.net
    Cc: perryism@users.sourceforge.net
    RE: Thank you for fixing #1786061

    Hi Geoffmc,

    I posted a tracker regarding "funky characters".
    http://tidy.sf.net/issue/1786061
    Thank you very much for your affords to fix it. I
    appreciate it very much. However, I am still producing the
    same error. I hope that you might be able to help me. What
    I did was to download the project from
    http://geoffair.net/tidy/zips/testTidyCOMcs.zip
    I ran it and the output was good. Then I realized the tidy
    configuration file(foo.tidy) wasn't there and I added it
    there in the /bin/debug as followings:

    clean: yes
    bare: yes
    input-encoding: utf8
    output-encoding: utf8
    word-2000: yes
    tidy-mark: no
    DocType: omit
    output-html: yes
    show-body-only: yes

    and the "funky characters" happened again. Did I miss
    something? I assumed I was using your dll. I couldn't
    really tell because assembly version of both the old one and
    yours are 1.0.0.0. Or I've been using the old one. I'm not
    quite sure, but the date the assembly created is in September.

    Many Thanks.

    Perry

    EOF - Wrap-08.doc

     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks