Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo

Close

#21 It could be possible to specify the set of element types for which attributes are parsed

General
closed
nobody
None
5
2014-02-24
2014-02-19
Sebastiano Vigna
No

In some kind of parsing activity (in our case, a crawler) one is often interested to know the attributes of a small subset of element types. The other attributes generate a large amount of garbage (strings, CharBuffers, etc.) that must be collected.

It would be nice to be able to "register" with a parser a set of element types, and then only attributes associated with such element types would be parsed and put into an AttributeList. For the other element types, an empty list (possibly a singleton) could be returned.

Discussion

  • Martin Jericho
    Martin Jericho
    2014-02-19

    Hi Sebastiano,
    This behaviour is too obscure to implement as a standard feature. Although the code would be fairly simple the documentation would be a little time consuming.
    I'd suggest you implement a patch yourself. It would probably be only require a couple of lines in the Attributes.construct method.
    Cheers
    Martin

     
  • Martin Jericho
    Martin Jericho
    2014-02-19

    • status: open --> closed
     
  • Well, I tried to have a look but it's not so simple as it seems. The problem is that Attributes.construct(), besides building the attribute list, finds the end of the attributes, and that end is used by the caller. Thus, just skipping that part of code won't work.

    Some of the code uses internally objects to propagate values. E.g., valueSegmentIncludingQuotes is used to update attributesEnd. So if we want to avoid to build those objects we have do update differently attributeEnd, etc.

    I don't think the feature is obscure, though--not more than deregistering, say, server tags.

    We will try an implementation and let you know what happens...

     
  • Martin Jericho
    Martin Jericho
    2014-02-24

    Yes sorry I misunderstood the problem. On first read I thought you were having a problem with the memory being used up storing all the attribute objects. Some users have had issues with this due to java runtime speed optimisations resulting in excessing memory use.

    As you discovered there isn't really a foolproof way of parsing HTML without parsing all of the attributes.

    I'm very surprised though that you would be having performance issues with the garbage collector. They are super efficient these days even when needing to collect a large number of objects. You must be parsing a huge volume of pages for it to be an issue.

    The only thing I can suggest is that you register a custom tag type that is the same as a normal tag type but doesn't look for attributes, then you can use the Segment.parseAttributes method to parse attributes only in the relevant start tags, or use a regular expression parser if even that's not efficient enough. Just keep in mind this will fail if any of the attribute values contain a right angle bracket.

    May I ask what sort of project you're working on that has to parse so many documents?

    Cheers
    Martin

     
  • So, actually the problem was with the memory used, but after some heap profiling it turned out the our problem is more with strings than attributes, and more with strings generated by URI.create() than anything else, so we are OK.

    We are developing a high-speed crawler (BUbiNG, you can get it from http://law.di.unimi.it/software/). We parse several thousand pages per second (in tests with a local proxy, more than 10,000 pages/s) so yes, we generate a lot of garbage :).

     
  • Martin Jericho
    Martin Jericho
    2014-02-24

    Thanks for the link. It looks like interesting research you're doing with BUbiNG, and I'm impressed with the performace you're squeezing out of the parser. That's probably thanks to all the improvements you've suggested over the last year!