Re: [Vtd-xml-users] Storing parsing info

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Well, I don't need much of XPath. All XPath expressions I think I'm going to
do are the same: get the nth child of an element. I thought the XPath
evaluation was going to have a worse performance with the chunk approach.
However I obtained better results than I expected and those results are good
enough for me. Also keep in mind that my benchmark is not thorough. I tested
only a XPath expression with a unique 100Mb XML file.

On 3/21/07, Jimmy Zhang < cra...@co...> wrote:
>
>  Not a problem :-), one reason we are having this discussion is that the
> indexing feature (and
> VTD-XML itself) is so new and we have yet to understand the possiblities
> and design trade-offs
> ... Yes, I can see why tuning the optimum buffer size can potentially
> improve performance...
> but in general do you see any issue with the XPath evaluation ?
>
>
>
> ----- Original Message -----
>  *From:* Fernando Gonzalez <fer...@gm...>
> *To:* vtd...@li...
> *Sent:* Wednesday, March 21, 2007 3:14 AM
> *Subject:* Re: [Vtd-xml-users] Storing parsing info
>
> Sorry Jimmy, I misunderstood you. Please, forget the first paragraph of my
> last mail.
>
> yes, you're right, when you ask for a byte it may not be in the loaded
> chunk... so you have to load another chunk. Of course it's quite slower than
> the current solution. But I think there must be a buffer size that optimizes
> performance so the solution is only a bit slower without loading more than
> 20-30Mb in memory. Don't you think so?
>
> Anyway, the proposed changes don't force to use that approach. I'ts still
> possible to load the whole XML in memory, so the only side effect of the
> proposal is that the library is a bit more complex (there are some exception
> handling issues and the user have to provide a implementation of an
> interface instead of a byte[]) and the library is a bit slower (because of
> the added indirection level)
>
> Fernando
>
> On 3/21/07, Fernando Gonzalez <fer...@gm...> wrote:
> >
> > "if the chunks don't have what one is looking for, you will have to load
> > in another chunk... then another chunk.."
> >
> > Not exactly. If something asks for the the byte number 'x' I guess which
> > chunk the byte is in and I load only that chunk. Only if the information
> > asked for by the user is spread in two or more contiguous chunks will be
> > necessary to load more than one. The implementation can be seen in the "
> > org.ChunkByteBuffer" class in the "public byte byteAt(int streamIndex)"
> > method.
> >
> > About the alternative you propose. As I have told before, it's not a
> > good solution to remove or archive the original GML. Splitting wouldn't be
> > as bad as removal or archiving, but it would add some complexity to the
> > user. The user should keep track of the splitted GML files that form the
> > original GML file. The splitting could be implemented in a way that the user
> > doesn't notice it... while he doesn't try to access the file with another
> > application.
> >
> > Indeed, I think that, in the end, I'm doing something similar to
> > splitting since I logically split the GML file into several GML chunks.
> >
> > Fernando
> >
> >
> > On 3/20/07, Jimmy Zhang <cra...@co...> wrote:
> > >
> > >  Ok, I see... it seems that you can be sure that the "chunks" of GML
> > > files contain what the user would need...
> > > But in general, if the chunks don't have what one is looking for, you
> > > will have to load in another chunk... then
> > > another chunk.. that could mean a lot of disk activities
> > > As an alternative, would it be possible to split GML into little
> > > chunks of well-formed GML files, then index
> > > them individually.
> > > So instead of dealing with 10 big GML files, split them into 100
> > > smaller GML files and the algorithm you describe
> > >  may still work..
> > >
> > > ----- Original Message -----
> > >  *From:* Fernando Gonzalez <fer...@gm...>
> > > *To:* vtd...@li...
> > >  *Sent:* Tuesday, March 20, 2007 2:39 AM
> > > *Subject:* Re: [Vtd-xml-users] Storing parsing info
> > >
> > > On 3/20/07, Jimmy Zhang <cra...@co...> wrote:
> > > >
> > > >  So what you are trying to accomplish is to load all the GML
> > > > docs into memory at once...
> > > > I guess you can simply index all those files to avoid parsing...
> > > > but I still don't seem to understand the benefits of read teh parse
> > > > info and a chunk of
> > > > the XML file..
> > > >
> > >
> > > Quite near. What I need is to access a random feature at any time with
> > > as a low cost as possible. That could be possible loading all the GML docs
> > > in memory but the GML files are very big so I cannot do it.
> > >
> > > As that solution wasn't suitable to my problem, I thought of opening
> > > one file each time (using buffer reuse) and then it came to my mind that I
> > > could save parsing time storing the parse info. As I told before I cannot
> > > delete the GML. Storing the GML twice will waste disk space. I'm talking
> > > about an environment where the user can have in his computer a lot of
> > > digital cartography. Disk space is quite a bottle neck. It could be valid,
> > > but storing only the parse info was so easy that I did it and I obtained a
> > > better solution (for my environment).
> > >
> > > There is a use case where the user doesn't work with the files
> > > directly, but with a spatial region. In this case, the GML files and other
> > > spatial data are "layers", so the user can work at the same time with a lot
> > > of files. These files can be in other formats than GML, satellite images,
> > > different raster or vectorial formats; and these can bring the system to a
> > > even more memory constrained situation. That's what lead me to load chunks
> > > of the GML file.
> > >
> > > The workflow is the following
> > > * I open a file with the chunk approach
> > > * I parse the file (loading it with the chunks approach takes a lot,
> > > but no problem)
> > > * I store the parse info
> > > The user asks for information:
> > > * I load the parse info
> > > * I load the chunk
> > > * I return the asked information
> > >
> > > I want to speed up the asking of information because the user can ask
> > > for a map image with 20 GML files, and the map code is something like this:
> > >
> > > for each gml file
> > >   guess what "features" are inside the map bounds (GML is indexed
> > > spatially previously)
> > >   get those features from the GML (random access) (load parse info +
> > > load chunk + return info)
> > >   draw the features on a image
> > > next gml file
> > >
> > > Maybe this will make things a bit clearer. This screenshot (http://www.gvsig.gva.es/fileadmin/conselleria/images/Documentacion/capturas/raster_shp_dgn_750.gif
> > > ) shows a program that uses the library. You can see on the left all
> > > the loaded (from the user point of view) files: four "dgn" files, one "shp"
> > > and seven "ecw" files. A lot of operations done in the map are done over
> > > *every* file listed on the left so I don't care how much time it takes to
> > > put all those files on the left (generating parse info, etc). I care how
> > > much time takes to read the information after they are loaded (again, from
> > > the user point of view).
> > >
> > > Well, I hope it's clear enough. Notice that I'm not proposing changing
> > > the way VTD-XML works but I'm proposing to add new ways.
> > >
> > > greetings,
> > > Fernando
> > >
> > >   ----- Original Message -----
> > > >  *From:* Fernando Gonzalez <fer...@gm...>
> > > >  *To:* vtd...@li...
> > > > *Sent:* Monday, March 19, 2007 2:56 AM
> > > > *Subject:* Re: [Vtd-xml-users] Storing parsing info
> > > >
> > > > Well, jeje, the computer is new but I don't think my disk is so
> > > > fast. I think Java or the operating system has to cache something because
> > > > the first time I load the file it takes a bit more than 2 seconds and after
> > > > the first load, it only takes 300ms to read the file...
> > > > I have no experience on doing benchmarks and maybe I'm am missing
> > > > something. That's why I attached the program.
> > > >
> > > > "So if you can't delete the orginal XML files, can you compress them
> > > > and store them away (archiving)?"
> > > > I cannot delete nor archive the GML file because in this context it
> > > > won't be rare to be reading it from two different programs at the same
> > > > time... It's difficult to find an open source program that does everything
> > > > you need. For example, in a development context, there may be a map server
> > > > serving a map image based on a GML file while you are opening it to see some
> > > > data in it.
> > > >
> > > > "The other issue you raised is buffer reuse. To reuse internal
> > > > buffers of VTDGen, you can call setDoc_BR(...). But there is more
> > > > you can do...
> > > > you can in fact reuse the byte array containing the XML document."
> > > > Buffer reuse absolutly solves my memory constraints. But the problem
> > > > I see with buffer reuse is that it will force me to read and parse the whole
> > > > XML file every time the user ask for information on another XML file, won't
> > > > it? If I read the XML file by chunks and I store/read the parse information,
> > > > each time the user asks for information on another XML file I only have to
> > > > read the parse info and a chunk of the XML file.
> > > >
> > > > To show you my point of view:
> > > > The "user asking for another XML file" may be a map server that
> > > > reads some big GML files and draws its spatial information in a map image.
> > > > If each time the map server draws a GML file and "changes" to the next it
> > > > takes 2 seconds or so, the drawing of the map (all the GML files) takes too
> > > > much time.
> > > >
> > > > best regards,
> > > > Fernando
> > > >
> > > >
> > > > On 3/19/07, Jimmy Zhang <cra...@co... > wrote:
> > > > >
> > > > >
> > > > > What intrigues me with Fernando's test results is that it only
> > > > > takes 300ms to read a 100MB
> > > > > file? He got a super fast disk...
> > > > >
> > > > > ----- Original Message -----
> > > > >  *From:* Rodrigo Cunha <rn...@gm...>
> > > > > *To:* Jimmy Zhang <cra...@co...>
> > > > > *Cc:* Fernando Gonzalez <fer...@gm...> ; vtd...@li...
> > > > >
> > > > > *Sent:* Sunday, March 18, 2007 8:40 PM
> > > > > *Subject:* Re: [Vtd-xml-users] Storing parsing info
> > > > >
> > > > > In fact the idea occured to me in the past also... but VTD is so
> > > > > fast reading large files anyway! With a fast processor I think we might be
> > > > > disk-limited rather than processor-limited. Still, if the code is made
> > > > > already, the option seems cute enought to keep :-)
> > > > >
> > > > > Since I mainly deal with large files requiring a lots of
> > > > > processing this has not been an issue. Others, in different environments,
> > > > > might disagree.
> > > > >
> > > > > Jimmy Zhang wrote:
> > > > >
> > > > > Fernando,  The option for storing VTD in a separate file  is open.
> > > > >
> > > > > I attached  the technical document from your last email, and am also
> > > > >
> > > > > interested in the suggestions/comments from the mailing list ...
> > > > >
> > > > >
> > > > >
> > > >  ------------------------------
> > > >
> > > >
> > > > -------------------------------------------------------------------------
> > > > Take Surveys. Earn Cash. Influence the Future of IT
> > > > Join SourceForge.net's Techsay panel and you'll get the chance to
> > > > share your
> > > > opinions on IT & business topics through brief surveys-and earn cash
> > > > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> > > >
> > > >
> > > > ------------------------------
> > > >
> > > > _______________________________________________
> > > > Vtd-xml-users mailing list
> > > > Vtd...@li...
> > > > https://lists.sourceforge.net/lists/listinfo/vtd-xml-users
> > > >
> > > >
> > > >
> > > > -------------------------------------------------------------------------
> > > > Take Surveys. Earn Cash. Influence the Future of IT
> > > > Join SourceForge.net's Techsay panel and you'll get the chance to
> > > > share your
> > > > opinions on IT & business topics through brief surveys-and earn cash
> > > > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> > > >
> > > > _______________________________________________
> > > > Vtd-xml-users mailing list
> > > > Vtd...@li...
> > > > https://lists.sourceforge.net/lists/listinfo/vtd-xml-users
> > > >
> > > >
> > >  ------------------------------
> > >
> > >
> > > -------------------------------------------------------------------------
> > > Take Surveys. Earn Cash. Influence the Future of IT
> > > Join SourceForge.net's Techsay panel and you'll get the chance to
> > > share your
> > > opinions on IT & business topics through brief surveys-and earn cash
> > > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> > >
> > >
> > > ------------------------------
> > >
> > > _______________________________________________
> > > Vtd-xml-users mailing list
> > > Vtd...@li...
> > > https://lists.sourceforge.net/lists/listinfo/vtd-xml-users
> > >
> > >
> >
>  ------------------------------
>
> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share
> your
> opinions on IT & business topics through brief surveys-and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
>
> ------------------------------
>
> _______________________________________________
> Vtd-xml-users mailing list
> Vtd...@li...
> https://lists.sourceforge.net/lists/listinfo/vtd-xml-users
>
>