From: Jimmy Z. <cra...@co...> - 2007-03-21 17:50:20
|
Not a problem :-), one reason we are having this discussion is that the = indexing feature (and VTD-XML itself) is so new and we have yet to understand the possiblities = and design trade-offs ... Yes, I can see why tuning the optimum buffer size can potentially = improve performance... but in general do you see any issue with the XPath evaluation ? ----- Original Message -----=20 From: Fernando Gonzalez=20 To: vtd...@li...=20 Sent: Wednesday, March 21, 2007 3:14 AM Subject: Re: [Vtd-xml-users] Storing parsing info Sorry Jimmy, I misunderstood you. Please, forget the first paragraph = of my last mail. yes, you're right, when you ask for a byte it may not be in the loaded = chunk... so you have to load another chunk. Of course it's quite slower = than the current solution. But I think there must be a buffer size that = optimizes performance so the solution is only a bit slower without = loading more than 20-30Mb in memory. Don't you think so?=20 Anyway, the proposed changes don't force to use that approach. I'ts = still possible to load the whole XML in memory, so the only side effect = of the proposal is that the library is a bit more complex (there are = some exception handling issues and the user have to provide a = implementation of an interface instead of a byte[]) and the library is a = bit slower (because of the added indirection level)=20 Fernando On 3/21/07, Fernando Gonzalez <fer...@gm...> wrote: "if the chunks don't have what one is looking for, you will have to = load in another chunk... then=20 another chunk.." Not exactly. If something asks for the the byte number 'x' I guess = which chunk the byte is in and I load only that chunk. Only if the = information asked for by the user is spread in two or more contiguous = chunks will be necessary to load more than one. The implementation can = be seen in the " org.ChunkByteBuffer" class in the "public byte = byteAt(int streamIndex)" method. About the alternative you propose. As I have told before, it's not a = good solution to remove or archive the original GML. Splitting wouldn't = be as bad as removal or archiving, but it would add some complexity to = the user. The user should keep track of the splitted GML files that form = the original GML file. The splitting could be implemented in a way that = the user doesn't notice it... while he doesn't try to access the file = with another application.=20 Indeed, I think that, in the end, I'm doing something similar to = splitting since I logically split the GML file into several GML chunks. Fernando On 3/20/07, Jimmy Zhang <cra...@co...> wrote: Ok, I see... it seems that you can be sure that the "chunks" of = GML files contain what the user would need... But in general, if the chunks don't have what one is looking for, = you will have to load in another chunk... then another chunk.. that could mean a lot of disk activities As an alternative, would it be possible to split GML into little = chunks of well-formed GML files, then index them individually.=20 So instead of dealing with 10 big GML files, split them into 100 = smaller GML files and the algorithm you describe may still work.. ----- Original Message -----=20 From: Fernando Gonzalez=20 To: vtd...@li...=20 Sent: Tuesday, March 20, 2007 2:39 AM Subject: Re: [Vtd-xml-users] Storing parsing info On 3/20/07, Jimmy Zhang <cra...@co...> wrote:=20 So what you are trying to accomplish is to load all the GML = docs into memory at once... I guess you can simply index all those files to avoid = parsing... but I still don't seem to understand the benefits of read teh = parse info and a chunk of the XML file.. Quite near. What I need is to access a random feature at any = time with as a low cost as possible. That could be possible loading all = the GML docs in memory but the GML files are very big so I cannot do it. = As that solution wasn't suitable to my problem, I thought of = opening one file each time (using buffer reuse) and then it came to my = mind that I could save parsing time storing the parse info. As I told = before I cannot delete the GML. Storing the GML twice will waste disk = space. I'm talking about an environment where the user can have in his = computer a lot of digital cartography. Disk space is quite a bottle = neck. It could be valid, but storing only the parse info was so easy = that I did it and I obtained a better solution (for my environment). There is a use case where the user doesn't work with the files = directly, but with a spatial region. In this case, the GML files and = other spatial data are "layers", so the user can work at the same time = with a lot of files. These files can be in other formats than GML, = satellite images, different raster or vectorial formats; and these can = bring the system to a even more memory constrained situation. That's = what lead me to load chunks of the GML file. The workflow is the following * I open a file with the chunk approach * I parse the file (loading it with the chunks approach takes a = lot, but no problem) * I store the parse info=20 The user asks for information: * I load the parse info * I load the chunk * I return the asked information I want to speed up the asking of information because the user = can ask for a map image with 20 GML files, and the map code is something = like this: for each gml file guess what "features" are inside the map bounds (GML is = indexed spatially previously) get those features from the GML (random access) (load parse = info + load chunk + return info)=20 draw the features on a image next gml file Maybe this will make things a bit clearer. This screenshot = (http://www.gvsig.gva.es/fileadmin/conselleria/images/Documentacion/captu= ras/raster_shp_dgn_750.gif ) shows a program that uses the library. You = can see on the left all the loaded (from the user point of view) files: = four "dgn" files, one "shp" and seven "ecw" files. A lot of operations = done in the map are done over *every* file listed on the left so I don't = care how much time it takes to put all those files on the left = (generating parse info, etc). I care how much time takes to read the = information after they are loaded (again, from the user point of view). Well, I hope it's clear enough. Notice that I'm not proposing = changing the way VTD-XML works but I'm proposing to add new ways. greetings, Fernando =20 ----- Original Message -----=20 From: Fernando Gonzalez=20 To: vtd...@li...=20 Sent: Monday, March 19, 2007 2:56 AM Subject: Re: [Vtd-xml-users] Storing parsing info Well, jeje, the computer is new but I don't think my disk is = so fast. I think Java or the operating system has to cache something = because the first time I load the file it takes a bit more than 2 = seconds and after the first load, it only takes 300ms to read the = file...=20 I have no experience on doing benchmarks and maybe I'm am = missing something. That's why I attached the program. "So if you can't delete the orginal XML files, can you = compress them and=20 store them away (archiving)?" I cannot delete nor archive the GML file because in this = context it won't be rare to be reading it from two different programs at = the same time... It's difficult to find an open source program that does = everything you need. For example, in a development context, there may be = a map server serving a map image based on a GML file while you are = opening it to see some data in it.=20 "The other issue you raised is buffer reuse. To reuse = internal buffers of=20 VTDGen, you can call setDoc_BR(...). But there is more you = can do... you can in fact reuse the byte array containing the XML = document." Buffer reuse absolutly solves my memory constraints. But the = problem I see with buffer reuse is that it will force me to read and = parse the whole XML file every time the user ask for information on = another XML file, won't it? If I read the XML file by chunks and I = store/read the parse information, each time the user asks for = information on another XML file I only have to read the parse info and a = chunk of the XML file.=20 To show you my point of view: The "user asking for another XML file" may be a map server = that reads some big GML files and draws its spatial information in a map = image. If each time the map server draws a GML file and "changes" to the = next it takes 2 seconds or so, the drawing of the map (all the GML = files) takes too much time.=20 best regards, Fernando On 3/19/07, Jimmy Zhang <cra...@co...> wrote:=20 What intrigues me with Fernando's test results is that it = only takes 300ms to read a 100MB file? He got a super fast disk... ----- Original Message -----=20 From: Rodrigo Cunha=20 To: Jimmy Zhang=20 Cc: Fernando Gonzalez ; = vtd...@li...=20 Sent: Sunday, March 18, 2007 8:40 PM Subject: Re: [Vtd-xml-users] Storing parsing info In fact the idea occured to me in the past also... but = VTD is so fast reading large files anyway! With a fast processor I think = we might be disk-limited rather than processor-limited. Still, if the = code is made already, the option seems cute enought to keep :-) Since I mainly deal with large files requiring a lots of = processing this has not been an issue. Others, in different = environments, might disagree. Jimmy Zhang wrote:=20 Fernando, The option for storing VTD in a separate = file is open.=20 I attached the technical document from your last = email, and am also=20 interested in the suggestions/comments from the = mailing list ...=20 -------------------------------------------------------------------- = -------------------------------------------------------------------------= Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the = chance to share your opinions on IT & business topics through brief surveys-and = earn cash = http://www.techsay.com/default.php?page=3Djoin.php&p=3Dsourceforge&CID=3D= DEVDEV=20 -------------------------------------------------------------------- _______________________________________________ Vtd-xml-users mailing list Vtd...@li... https://lists.sourceforge.net/lists/listinfo/vtd-xml-users = -------------------------------------------------------------------------= Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance = to share your=20 opinions on IT & business topics through brief surveys-and = earn cash = http://www.techsay.com/default.php?page=3Djoin.php&p=3Dsourceforge&CID=3D= DEVDEV _______________________________________________ Vtd-xml-users mailing list Vtd...@li... https://lists.sourceforge.net/lists/listinfo/vtd-xml-users=20 ------------------------------------------------------------------------ = -------------------------------------------------------------------------= Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance = to share your opinions on IT & business topics through brief surveys-and earn = cash = http://www.techsay.com/default.php?page=3Djoin.php&p=3Dsourceforge&CID=3D= DEVDEV=20 ------------------------------------------------------------------------ _______________________________________________ Vtd-xml-users mailing list Vtd...@li... https://lists.sourceforge.net/lists/listinfo/vtd-xml-users -------------------------------------------------------------------------= ----- = -------------------------------------------------------------------------= Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to = share your opinions on IT & business topics through brief surveys-and earn cash = http://www.techsay.com/default.php?page=3Djoin.php&p=3Dsourceforge&CID=3D= DEVDEV -------------------------------------------------------------------------= ----- _______________________________________________ Vtd-xml-users mailing list Vtd...@li... https://lists.sourceforge.net/lists/listinfo/vtd-xml-users |