Re: [Vtd-xml-users] Storing parsing info

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Not a problem :-), one reason we are having this discussion is that the =
indexing feature (and
VTD-XML itself) is so new and we have yet to understand the possiblities =
and design trade-offs
... Yes, I can see why tuning the optimum buffer size can potentially =
improve performance...
but in general do you see any issue with the XPath evaluation ?

  ----- Original Message -----=20
  From: Fernando Gonzalez=20
  To: vtd...@li...=20
  Sent: Wednesday, March 21, 2007 3:14 AM
  Subject: Re: [Vtd-xml-users] Storing parsing info

  Sorry Jimmy, I misunderstood you. Please, forget the first paragraph =
of my last mail.

  yes, you're right, when you ask for a byte it may not be in the loaded =
chunk... so you have to load another chunk. Of course it's quite slower =
than the current solution. But I think there must be a buffer size that =
optimizes performance so the solution is only a bit slower without =
loading more than 20-30Mb in memory. Don't you think so?=20

  Anyway, the proposed changes don't force to use that approach. I'ts =
still possible to load the whole XML in memory, so the only side effect =
of the proposal is that the library is a bit more complex (there are =
some exception handling issues and the user have to provide a =
implementation of an interface instead of a byte[]) and the library is a =
bit slower (because of the added indirection level)=20

  Fernando

  On 3/21/07, Fernando Gonzalez <fer...@gm...> wrote:
    "if the chunks don't have what one is looking for, you will have to =
load in another chunk... then=20
    another chunk.."

    Not exactly. If something asks for the the byte number 'x' I guess =
which chunk the byte is in and I load only that chunk. Only if the =
information asked for by the user is spread in two or more contiguous =
chunks will be necessary to load more than one. The implementation can =
be seen in the " org.ChunkByteBuffer" class in the "public byte =
byteAt(int streamIndex)" method.

    About the alternative you propose. As I have told before, it's not a =
good solution to remove or archive the original GML. Splitting wouldn't =
be as bad as removal or archiving, but it would add some complexity to =
the user. The user should keep track of the splitted GML files that form =
the original GML file. The splitting could be implemented in a way that =
the user doesn't notice it... while he doesn't try to access the file =
with another application.=20

    Indeed, I think that, in the end, I'm doing something similar to =
splitting since I logically split the GML file into several GML chunks.

    Fernando

    On 3/20/07, Jimmy Zhang <cra...@co...> wrote:
      Ok, I see... it seems that you can be sure that the "chunks" of =
GML files contain what the user would need...
      But in general, if the chunks don't have what one is looking for, =
you will have to load in another chunk... then
      another chunk.. that could mean a lot of disk activities
      As an alternative, would it be possible to split GML into little =
chunks of well-formed GML files, then index
      them individually.=20
      So instead of dealing with 10 big GML files, split them into 100 =
smaller GML files and the algorithm you describe
       may still work..
        ----- Original Message -----=20
        From: Fernando Gonzalez=20
        To: vtd...@li...=20
        Sent: Tuesday, March 20, 2007 2:39 AM
        Subject: Re: [Vtd-xml-users] Storing parsing info

        On 3/20/07, Jimmy Zhang <cra...@co...> wrote:=20
          So what you are trying to accomplish is to load all the GML =
docs into memory at once...
          I guess you can simply index all those files to avoid =
parsing...
          but I still don't seem to understand the benefits of read teh =
parse info and a chunk of
          the XML file..

        Quite near. What I need is to access a random feature at any =
time with as a low cost as possible. That could be possible loading all =
the GML docs in memory but the GML files are very big so I cannot do it. =

        As that solution wasn't suitable to my problem, I thought of =
opening one file each time (using buffer reuse) and then it came to my =
mind that I could save parsing time storing the parse info. As I told =
before I cannot delete the GML. Storing the GML twice will waste disk =
space. I'm talking about an environment where the user can have in his =
computer a lot of digital cartography. Disk space is quite a bottle =
neck. It could be valid, but storing only the parse info was so easy =
that I did it and I obtained a better solution (for my environment).

        There is a use case where the user doesn't work with the files =
directly, but with a spatial region. In this case, the GML files and =
other spatial data are "layers", so the user can work at the same time =
with a lot of files. These files can be in other formats than GML, =
satellite images, different raster or vectorial formats; and these can =
bring the system to a even more memory constrained situation. That's =
what lead me to load chunks of the GML file.

        The workflow is the following
        * I open a file with the chunk approach
        * I parse the file (loading it with the chunks approach takes a =
lot, but no problem)
        * I store the parse info=20
        The user asks for information:
        * I load the parse info
        * I load the chunk
        * I return the asked information

        I want to speed up the asking of information because the user =
can ask for a map image with 20 GML files, and the map code is something =
like this:

        for each gml file
          guess what "features" are inside the map bounds (GML is =
indexed spatially previously)
          get those features from the GML (random access) (load parse =
info + load chunk + return info)=20
          draw the features on a image
        next gml file

        Maybe this will make things a bit clearer. This screenshot =
(http://www.gvsig.gva.es/fileadmin/conselleria/images/Documentacion/captu=
ras/raster_shp_dgn_750.gif ) shows a program that uses the library. You =
can see on the left all the loaded (from the user point of view) files: =
four "dgn" files, one "shp" and seven "ecw" files. A lot of operations =
done in the map are done over *every* file listed on the left so I don't =
care how much time it takes to put all those files on the left =
(generating parse info, etc). I care how much time takes to read the =
information after they are loaded (again, from the user point of view).

        Well, I hope it's clear enough. Notice that I'm not proposing =
changing the way VTD-XML works but I'm proposing to add new ways.

        greetings,
        Fernando =20

            ----- Original Message -----=20
            From: Fernando Gonzalez=20
            To: vtd...@li...=20
            Sent: Monday, March 19, 2007 2:56 AM
            Subject: Re: [Vtd-xml-users] Storing parsing info

            Well, jeje, the computer is new but I don't think my disk is =
so fast. I think Java or the operating system has to cache something =
because the first time I load the file it takes a bit more than 2 =
seconds and after the first load, it only takes 300ms to read the =
file...=20
            I have no experience on doing benchmarks and maybe I'm am =
missing something. That's why I attached the program.

            "So if you can't delete the orginal XML files, can you =
compress them and=20
            store them away (archiving)?"

            I cannot delete nor archive the GML file because in this =
context it won't be rare to be reading it from two different programs at =
the same time... It's difficult to find an open source program that does =
everything you need. For example, in a development context, there may be =
a map server serving a map image based on a GML file while you are =
opening it to see some data in it.=20

            "The other issue you raised is buffer reuse. To reuse =
internal buffers of=20
            VTDGen, you can call setDoc_BR(...). But there is more you =
can do...
            you can in fact reuse the byte array containing the XML =
document."
            Buffer reuse absolutly solves my memory constraints. But the =
problem I see with buffer reuse is that it will force me to read and =
parse the whole XML file every time the user ask for information on =
another XML file, won't it? If I read the XML file by chunks and I =
store/read the parse information, each time the user asks for =
information on another XML file I only have to read the parse info and a =
chunk of the XML file.=20

            To show you my point of view:
            The "user asking for another XML file" may be a map server =
that reads some big GML files and draws its spatial information in a map =
image. If each time the map server draws a GML file and "changes" to the =
next it takes 2 seconds or so, the drawing of the map (all the GML =
files) takes too much time.=20

            best regards,
            Fernando

            On 3/19/07, Jimmy Zhang <cra...@co...> wrote:=20

              What intrigues me with Fernando's test results is that it =
only takes 300ms to read a 100MB
              file? He got a super fast disk...
                ----- Original Message -----=20
                From: Rodrigo Cunha=20
                To: Jimmy Zhang=20
                Cc: Fernando Gonzalez ; =
vtd...@li...=20
                Sent: Sunday, March 18, 2007 8:40 PM
                Subject: Re: [Vtd-xml-users] Storing parsing info

                In fact the idea occured to me in the past also... but =
VTD is so fast reading large files anyway! With a fast processor I think =
we might be disk-limited rather than processor-limited. Still, if the =
code is made already, the option seems cute enought to keep :-)

                Since I mainly deal with large files requiring a lots of =
processing this has not been an issue. Others, in different =
environments, might disagree.

                Jimmy Zhang wrote:=20
                  Fernando,  The option for storing VTD in a separate =
file  is open.=20
                  I attached  the technical document from your last =
email, and am also=20
                  interested in the suggestions/comments from the =
mailing list ...=20

--------------------------------------------------------------------

            =
-------------------------------------------------------------------------=

            Take Surveys. Earn Cash. Influence the Future of IT
            Join SourceForge.net's Techsay panel and you'll get the =
chance to share your
            opinions on IT & business topics through brief surveys-and =
earn cash
            =
http://www.techsay.com/default.php?page=3Djoin.php&p=3Dsourceforge&CID=3D=
DEVDEV=20

--------------------------------------------------------------------

            _______________________________________________
            Vtd-xml-users mailing list
            Vtd...@li...
            https://lists.sourceforge.net/lists/listinfo/vtd-xml-users

          =
-------------------------------------------------------------------------=

          Take Surveys. Earn Cash. Influence the Future of IT
          Join SourceForge.net's Techsay panel and you'll get the chance =
to share your=20
          opinions on IT & business topics through brief surveys-and =
earn cash
          =
http://www.techsay.com/default.php?page=3Djoin.php&p=3Dsourceforge&CID=3D=
DEVDEV
          _______________________________________________
          Vtd-xml-users mailing list
          Vtd...@li...
          https://lists.sourceforge.net/lists/listinfo/vtd-xml-users=20

------------------------------------------------------------------------

        =
-------------------------------------------------------------------------=

        Take Surveys. Earn Cash. Influence the Future of IT
        Join SourceForge.net's Techsay panel and you'll get the chance =
to share your
        opinions on IT & business topics through brief surveys-and earn =
cash
        =
http://www.techsay.com/default.php?page=3Djoin.php&p=3Dsourceforge&CID=3D=
DEVDEV=20

------------------------------------------------------------------------

        _______________________________________________
        Vtd-xml-users mailing list
        Vtd...@li...
        https://lists.sourceforge.net/lists/listinfo/vtd-xml-users

-------------------------------------------------------------------------=
-----

  =
-------------------------------------------------------------------------=

  Take Surveys. Earn Cash. Influence the Future of IT
  Join SourceForge.net's Techsay panel and you'll get the chance to =
share your
  opinions on IT & business topics through brief surveys-and earn cash
  =
http://www.techsay.com/default.php?page=3Djoin.php&p=3Dsourceforge&CID=3D=
DEVDEV

-------------------------------------------------------------------------=
-----

  _______________________________________________
  Vtd-xml-users mailing list
  Vtd...@li...
  https://lists.sourceforge.net/lists/listinfo/vtd-xml-users