Hi Lee,
I have logged a lot of hours the last couple days on Tidy, so this will be=
=20
my last response on this issue for a while. I would _much_ rather discuss=
=20
details of pluggable transcoding. Once that is working, I think we will=20
all have a much more concrete understanding of the impact - or not - of=20
Bj=F6rn's proposed code.
Specific responses below.
take it easy,
Charlie
At 10:55 AM 2/4/2003 -0700, Lee Passey wrote:
>Charles Reitzel wrote:
>
>>Predictably, I will just repeat all my previous comments about accessing=
=20
>>the raw buffer. Although you have done a good job encapsulating the=20
>>buffer details in the public API, the encoding issues remain. I.e. the=20
>>results will be downright weird with the Asian or raw encodings.
>>
>>I think you can get the same results in a safe and sane way by setting=20
>>the output encoding to UTF-8 (perhaps temporarily) and calling=20
>>tidyNodeGetText(). Note, even at parse time, the whitespace has already=
=20
>>been collapsed.
>
>
>I think you may be missing the point, Charlie. tidyNodeGetText() is=20
>misnamed; it should be tidyNodePrintText(), because what it really does is=
=20
>pretty print the tree, starting with the referenced node, to the current=20
>output sink. Now it may be possible to retrieve the text using this=20
>mechanism, but it would require creating a memory output sink and=20
>attaching that sink to the document. The output from this method will not=
=20
>necessarily be text contents of the node, but its "prettified"=
representaton.
No, I understood all that from the get go. Point taken about the=20
name. But, be aware, the input is already mangled. For example, all=20
whitespace is condensed (most newlines removed), end tags are dropped. All=
=20
by itself, parsing does considerable violence to the input markup. And, if=
=20
we add a new node type for entities (which we should do as part of the=20
encoding overhaul), it will only get worse.
If you want access to the intact input text, much more work will be=20
necessary. Your best bet, I think, would be to write a new pretty printer=
=20
according to your tastes. Or, maybe better, use a different HTML parser -=
=20
if there is one that preserves the input better.
Just to be clear, I have no problem with Bj=F6rn's intent here. It makes=20
sense. I just don't think Tidy will do that. If we do a better job of=20
separating parsing and tree-repair, it will get better. In any case, this=
=20
issue is better revisited after we implement pluggable character=20
encodings. If you want less whitespace, you can always set --indent no=20
(the default). You can also manipulate --indent-spaces and --wrap options=
=20
to your liking. For the simple cases (where Bj=F6rn's code will work fine),=
=20
it does about the same thing. So I really don't see what the problem is=20
with tidyNodeGetText().
Also, a minor correction, tidyNodeGetText() takes a buffer as an argument=20
and handles all the details of setting the output sink, etc. for you.
>I suspect that what Bjoern is trying to do is manipulate the DOM tree=20
>before it is output, which is something I do alot. I even wrote a=20
>function amazingly similar to Bjoern's for my local copy; mine was named=20
>getTextFromNode().
I presume you are using internal functions for update. I agree some=20
manipulation functions should be wrapped for public use. As always, adding=
=20
ASCII or UTF-8 text will be simple enough. But we will want to support the=
=20
same encodings as the parser. I also think it will be desirable to copy=20
nodes between trees.
In all likelihood, however, we will need to support fragment parsing (which=
=20
is why I didn't expose any update functions in the first design). Not=20
terribly difficult, I think, but out of scope at the time. Again,=20
pluggable transcoding comes first, imo.
>The big difference between the two functions is that in Bjoern's case, if=
=20
>the text is larger than the buffer, his function fails but returns the=20
>size of the required buffer.
Bj=F6rn's approach is a well understood pattern. In essence, it allows the=
=20
app to query how big a buffer to allocate.
>In my case, I return the number of bytes actually copied into the buffer,=
=20
>not to exceed the maximum buffer size, even if the copy was=20
>incomplete. That way, if I want to look at the text in a H3 element, for=
=20
>example, and I'm only interested in the first 16 characters, I can pass in=
=20
>a 16 byte buffer and get back the first 16 characters, regardless of the=20
>actual text length.
I have no problem with a fixed output buffer type of GetText()=20
function. Probably a good thing. I object only to shoveling raw Lexer=20
buffer contents.
>If anyone is interested in looking at my code (as well as some other tree=
=20
>traversal and query routines I have written) I would be happy to post=20
>them, or even check them into CVS, perhaps in an experimental section?
I'd like to see them. Don't get me wrong. I think this is a fruitful area=
=20
for development. The "contrib" forum on SF might be a good place to post=20
these. Or maybe just put them up on SF with an "experimental routines"=20
cover page.
It might also be helpful to cook up some more test cases. In fact, our=20
testing approach probably needs some more thought. I have created a small=
=20
test suite to exercise library features based on my Perl wrapper. My GUI=20
and COM wrappers also serve as useful tests for library features (e.g.=20
UTF-16 support).
>In any case, the output from Bjoern's routine ought to be sane, because=20
>until I can convince the rest of the group that there ought to be an=20
>option to put off encoding conversions until the pretty print phase, the=20
>data in lexbuf[] has already been converted to UTF-8 as the buffer is=20
>being filled. So long as the function is well-documented that the=20
>contents returned are UTF-8 encoded (or UTF-16 when that happens) there=20
>should be no problems.
Not so fast ... you need to define what happens to mixed content model=20
elements?! Once we introduce entity nodes, what happens them?!
You will quickly end up re-implementing PPrintTree(). I am just trying to=
=20
save you the trip. Also, I would really like to address the transcoding=20
issue _before_ protracted haggling over unformatted vs. "pretty" text=
output.
>>Your point about comments and PIs is well taken. You should have said=20
>>something <g>. Jeff Pohlmeyer submitted a patch for this a while ago and=
=20
>>I thought it had already gone in. In tidylib.c, tidyNodeGetText(),=20
>>change the test for "nodeHasText(nimp)" to just "nimp". It should do=20
>>what you want.
In your experiments, did you do anything like this?
>>At 01:21 PM 2/4/2003 +0100, Bjoern Hoehrmann wrote:
>>
>>>* Bjoern Hoehrmann wrote:
>>> > I was unable to locate a function in tidy.h which
>>> > allows me to access the
>>> > lexer->lexbuf[ node->start .. node->end ] which I'd
>>> > need to get the content of text / comment / processing
>>> > instruction / etc. nodes.
>>>
>>> int tidyNodeGetLength(TidyNode node)
>>> {
>>> Node* n =3D tidyNodeToImpl(node);
>>> return n ? n->end - n->start : 0;
>>> }
>>>
>>> int tidyNodeGetValue(TidyDoc tdoc,
>>> TidyNode tnode,
>>> char buffer[],
>>> int bufsize[])
>>> {
>>> Node *node;
>>> TidyDocImpl *doc;
>>> int i, len;
>>>
>>> if (!buffer || !tdoc || !tnode || !bufsize)
>>> return -1;
>>>
>>> node =3D tidyNodeToImpl(tnode);
>>> doc =3D tidyDocToImpl(tdoc);
>>> len =3D node->end - node->start;
>>>
>>> if (*bufsize <=3D len)
>>> {
>>> *bufsize =3D len + 1;
>>> return 0;
>>> }
>>>
>>> for (i =3D 0; i < len; ++i)
>>> buffer[i] =3D doc->lexer->lexbuf[node->start + i];
>>>
>>> buffer[node->start + i++] =3D 0;
>>> *bufsize =3D i;
>>>
>>> return i;
>>> }
>>>
>>>Comments?
|