Re: [Vtd-xml-users] Storing parsing info

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Sorry Jimmy, I misunderstood you. Please, forget the first paragraph of my
last mail.

yes, you're right, when you ask for a byte it may not be in the loaded
chunk... so you have to load another chunk. Of course it's quite slower than
the current solution. But I think there must be a buffer size that optimizes
performance so the solution is only a bit slower without loading more than
20-30Mb in memory. Don't you think so?

Anyway, the proposed changes don't force to use that approach. I'ts still
possible to load the whole XML in memory, so the only side effect of the
proposal is that the library is a bit more complex (there are some exception
handling issues and the user have to provide a implementation of an
interface instead of a byte[]) and the library is a bit slower (because of
the added indirection level)

Fernando

On 3/21/07, Fernando Gonzalez <fer...@gm...> wrote:
>
> "if the chunks don't have what one is looking for, you will have to load
> in another chunk... then another chunk.."
>
> Not exactly. If something asks for the the byte number 'x' I guess which
> chunk the byte is in and I load only that chunk. Only if the information
> asked for by the user is spread in two or more contiguous chunks will be
> necessary to load more than one. The implementation can be seen in the "
> org.ChunkByteBuffer" class in the "public byte byteAt(int streamIndex)"
> method.
>
> About the alternative you propose. As I have told before, it's not a good
> solution to remove or archive the original GML. Splitting wouldn't be as bad
> as removal or archiving, but it would add some complexity to the user. The
> user should keep track of the splitted GML files that form the original GML
> file. The splitting could be implemented in a way that the user doesn't
> notice it... while he doesn't try to access the file with another
> application.
>
> Indeed, I think that, in the end, I'm doing something similar to splitting
> since I logically split the GML file into several GML chunks.
>
> Fernando
>
>
> On 3/20/07, Jimmy Zhang <cra...@co...> wrote:
> >
> >  Ok, I see... it seems that you can be sure that the "chunks" of GML
> > files contain what the user would need...
> > But in general, if the chunks don't have what one is looking for, you
> > will have to load in another chunk... then
> > another chunk.. that could mean a lot of disk activities
> > As an alternative, would it be possible to split GML into little chunks
> > of well-formed GML files, then index
> > them individually.
> > So instead of dealing with 10 big GML files, split them into 100 smaller
> > GML files and the algorithm you describe
> >  may still work..
> >
> > ----- Original Message -----
> >  *From:* Fernando Gonzalez <fer...@gm...>
> > *To:* vtd...@li...
> > *Sent:* Tuesday, March 20, 2007 2:39 AM
> > *Subject:* Re: [Vtd-xml-users] Storing parsing info
> >
> > On 3/20/07, Jimmy Zhang <cra...@co...> wrote:
> > >
> > >  So what you are trying to accomplish is to load all the GML docs into
> > > memory at once...
> > > I guess you can simply index all those files to avoid parsing...
> > > but I still don't seem to understand the benefits of read teh parse
> > > info and a chunk of
> > > the XML file..
> > >
> >
> > Quite near. What I need is to access a random feature at any time with
> > as a low cost as possible. That could be possible loading all the GML docs
> > in memory but the GML files are very big so I cannot do it.
> >
> > As that solution wasn't suitable to my problem, I thought of opening one
> > file each time (using buffer reuse) and then it came to my mind that I could
> > save parsing time storing the parse info. As I told before I cannot delete
> > the GML. Storing the GML twice will waste disk space. I'm talking about an
> > environment where the user can have in his computer a lot of digital
> > cartography. Disk space is quite a bottle neck. It could be valid, but
> > storing only the parse info was so easy that I did it and I obtained a
> > better solution (for my environment).
> >
> > There is a use case where the user doesn't work with the files directly,
> > but with a spatial region. In this case, the GML files and other spatial
> > data are "layers", so the user can work at the same time with a lot of
> > files. These files can be in other formats than GML, satellite images,
> > different raster or vectorial formats; and these can bring the system to a
> > even more memory constrained situation. That's what lead me to load chunks
> > of the GML file.
> >
> > The workflow is the following
> > * I open a file with the chunk approach
> > * I parse the file (loading it with the chunks approach takes a lot, but
> > no problem)
> > * I store the parse info
> > The user asks for information:
> > * I load the parse info
> > * I load the chunk
> > * I return the asked information
> >
> > I want to speed up the asking of information because the user can ask
> > for a map image with 20 GML files, and the map code is something like this:
> >
> > for each gml file
> >   guess what "features" are inside the map bounds (GML is indexed
> > spatially previously)
> >   get those features from the GML (random access) (load parse info +
> > load chunk + return info)
> >   draw the features on a image
> > next gml file
> >
> > Maybe this will make things a bit clearer. This screenshot (http://www.gvsig.gva.es/fileadmin/conselleria/images/Documentacion/capturas/raster_shp_dgn_750.gif
> > ) shows a program that uses the library. You can see on the left all the
> > loaded (from the user point of view) files: four "dgn" files, one "shp" and
> > seven "ecw" files. A lot of operations done in the map are done over *every*
> > file listed on the left so I don't care how much time it takes to put all
> > those files on the left (generating parse info, etc). I care how much time
> > takes to read the information after they are loaded (again, from the user
> > point of view).
> >
> > Well, I hope it's clear enough. Notice that I'm not proposing changing
> > the way VTD-XML works but I'm proposing to add new ways.
> >
> > greetings,
> > Fernando
> >
> >   ----- Original Message -----
> > >  *From:* Fernando Gonzalez <fer...@gm...>
> > >  *To:* vtd...@li...
> > > *Sent:* Monday, March 19, 2007 2:56 AM
> > > *Subject:* Re: [Vtd-xml-users] Storing parsing info
> > >
> > > Well, jeje, the computer is new but I don't think my disk is so fast.
> > > I think Java or the operating system has to cache something because the
> > > first time I load the file it takes a bit more than 2 seconds and after the
> > > first load, it only takes 300ms to read the file...
> > > I have no experience on doing benchmarks and maybe I'm am missing
> > > something. That's why I attached the program.
> > >
> > > "So if you can't delete the orginal XML files, can you compress them
> > > and store them away (archiving)?"
> > > I cannot delete nor archive the GML file because in this context it
> > > won't be rare to be reading it from two different programs at the same
> > > time... It's difficult to find an open source program that does everything
> > > you need. For example, in a development context, there may be a map server
> > > serving a map image based on a GML file while you are opening it to see some
> > > data in it.
> > >
> > > "The other issue you raised is buffer reuse. To reuse internal buffers
> > > of VTDGen, you can call setDoc_BR(...). But there is more you can
> > > do...
> > > you can in fact reuse the byte array containing the XML document."
> > > Buffer reuse absolutly solves my memory constraints. But the problem I
> > > see with buffer reuse is that it will force me to read and parse the whole
> > > XML file every time the user ask for information on another XML file, won't
> > > it? If I read the XML file by chunks and I store/read the parse information,
> > > each time the user asks for information on another XML file I only have to
> > > read the parse info and a chunk of the XML file.
> > >
> > > To show you my point of view:
> > > The "user asking for another XML file" may be a map server that reads
> > > some big GML files and draws its spatial information in a map image. If each
> > > time the map server draws a GML file and "changes" to the next it takes 2
> > > seconds or so, the drawing of the map (all the GML files) takes too much
> > > time.
> > >
> > > best regards,
> > > Fernando
> > >
> > >
> > > On 3/19/07, Jimmy Zhang <cra...@co...> wrote:
> > > >
> > > >
> > > > What intrigues me with Fernando's test results is that it only takes
> > > > 300ms to read a 100MB
> > > > file? He got a super fast disk...
> > > >
> > > > ----- Original Message -----
> > > >  *From:* Rodrigo Cunha <rn...@gm...>
> > > > *To:* Jimmy Zhang <cra...@co...>
> > > > *Cc:* Fernando Gonzalez <fer...@gm...> ; vtd...@li...
> > > >
> > > > *Sent:* Sunday, March 18, 2007 8:40 PM
> > > > *Subject:* Re: [Vtd-xml-users] Storing parsing info
> > > >
> > > > In fact the idea occured to me in the past also... but VTD is so
> > > > fast reading large files anyway! With a fast processor I think we might be
> > > > disk-limited rather than processor-limited. Still, if the code is made
> > > > already, the option seems cute enought to keep :-)
> > > >
> > > > Since I mainly deal with large files requiring a lots of processing
> > > > this has not been an issue. Others, in different environments, might
> > > > disagree.
> > > >
> > > > Jimmy Zhang wrote:
> > > >
> > > > Fernando,  The option for storing VTD in a separate file  is open.
> > > > I attached  the technical document from your last email, and am also
> > > >
> > > > interested in the suggestions/comments from the mailing list ...
> > > >
> > > >
> > > >
> > >  ------------------------------
> > >
> > >
> > > -------------------------------------------------------------------------
> > > Take Surveys. Earn Cash. Influence the Future of IT
> > > Join SourceForge.net's Techsay panel and you'll get the chance to
> > > share your
> > > opinions on IT & business topics through brief surveys-and earn cash
> > > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> > >
> > >
> > > ------------------------------
> > >
> > > _______________________________________________
> > > Vtd-xml-users mailing list
> > > Vtd...@li...
> > > https://lists.sourceforge.net/lists/listinfo/vtd-xml-users
> > >
> > >
> > >
> > > -------------------------------------------------------------------------
> > > Take Surveys. Earn Cash. Influence the Future of IT
> > > Join SourceForge.net's Techsay panel and you'll get the chance to
> > > share your
> > > opinions on IT & business topics through brief surveys-and earn cash
> > >
> > > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> > > _______________________________________________
> > > Vtd-xml-users mailing list
> > > Vtd...@li...
> > > https://lists.sourceforge.net/lists/listinfo/vtd-xml-users
> > >
> > >
> >  ------------------------------
> >
> >
> > -------------------------------------------------------------------------
> > Take Surveys. Earn Cash. Influence the Future of IT
> > Join SourceForge.net's Techsay panel and you'll get the chance to share
> > your
> > opinions on IT & business topics through brief surveys-and earn cash
> > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> >
> >
> > ------------------------------
> >
> > _______________________________________________
> > Vtd-xml-users mailing list
> > Vtd...@li...
> > https://lists.sourceforge.net/lists/listinfo/vtd-xml-users
> >
> >
>