Re: [Geotools-devel] gml performance with xsd encoder

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hey, that's awesome.

Looking forward for the patch, this very much interests me since I have 
a couple ideas myself about how we can keep improving the encode/decode 
process.

For instance, the pluggable EncoderDelegate takes me half way the path 
to pluggable underlying XML I/O API! my heart jumps. On the radar are 
StAX and BXML as alternatives to JAXP. Tell me when the patch is in 
place and I'd like a conversation with you about this, I'm sure you'll 
be interested.

A couple questions:
- what would be needed for the improvements to work with complex?
- Do you think a similar approach can be applied to parsing? 
(StreamingParser being the main target).
Reason being (hello gt-wfs co-maintainer) that I would very much like to 
replace the legacy GML parser being used on gt-wfs1.1 by StreamingParser 
but so far SP needs some love. As you know the parser on gt-wfs-1.1 is 
already pluggable and imho the steps needed to making it rock are:
- chase and fix the memory leak
- ability to use a StAX like parsing API. This will increase performance 
a big dead because StAX, being pull/push allows you to stop processing 
and to skip content at your lease. But I want an abstraction over the 
underlying XML I/O api that's ready for direct primitive type handling 
as an extension to the ContentHandler.characters method. This also will 
allow for an improved and really streaming geometry encoding.

Sorry for going off-topic, I just got excited about this. Good work, and 
tell me if you'd like to peer up to discuss these ideas.

Cheers,
Gabriel

Justin Deoliveira wrote:
> Hi all,
> 
> A while back I implemented some hacks on the xsd/xml encoder to
> improve GML encoding performance. I finally got around to benchmarking. 
>    Here are the results. What I actually did is described afterward.
> 
> Here they are:
> 
> Test 1: 100,000 multi polygons
> ------------------------------
> 
> The polygons are fairly big with lots of points. Basically the 
> topp:states layer duplicated ~ 2000 times.
> 
> First step was the baseline, using FeatureTransformer:
> 
> * GML2 Transformer: 540 M, 4.4 M/s, 124 s
> 
> - First number being the total amount of data encoded.
> - Second being the average encoding rate.
> - Third number being the total encoding time.
> 
> Next step was using the encoder as is, no optimizations:
> 
> * GML2 Normal: 528 M, 2.4 M/s, 255 s
> 
> Hmmm... twice as slow.
> 
> And finally with the optimizations:
> 
> * GML2 Optimized: 528 M, 4.3 M/s, 126 s
> 
> Much better, Still a bit slower but not by much.
> 
> The last test I did was GML 3 with the optimizations, and similar results:
> 
> * GML3 Optimized: 518 M, 4.2 M/s, 12 6s
> 
> Test 2: 500,000 line strings
> ----------------------------
> 
> The second test was encoding 500,000 line strings from tiger, so not 
> many coordinates, just two point line strings. And the numbers:
> 
> * GML2 Transformer: 466M, 8.5 M/s, 56s
> * GML2 Normal: 365 M, 1.1 M/s, 345s
> * GML2 Optimized: 391M, 6.2 M/s, 64s
> * GML3 Optimized: 379M, 5.4 M/s, 72s
> 
> Yikes, the non-optimized encoder is almost 7 times as slow. The 
> optimized encoder is still slower, but again not by much.
> 
> So all in all good results with the optimizations. The two encoders are 
> now comparable for GML. I also ran the optimizations through the wfs 
> cite tests to ensure that with the optimizations the GML being produced 
> is still "correct".
> 
> What I did
> ----------
> 
> * A custom FeatureEncoderDelegate for feature collections
> 
> A while back I came up with an interface, EncoderDelegate. The original 
> purpose of this interface was allow other XML encoders to be embedded in 
> the encoder. When the main encoding routine encounters one of these 
> objects, it fully delegates all encoding to it, rather than continue on 
> with the stack based schema assisted encoding.
> 
> So my idea for optimization was to make one of these implementations for 
> FeatureCollections. This would totally remove the walking up and down 
> the encoding stack that the encoder does for each feature that is encoded.
> 
> The problem is that that walking up and down the stack is what looks up 
> the bindings based on type, using the correct binding to encode 
> attributes, etc... So what I did was basically simulate this inside the 
> encoder delegate. IT grabs the feature type, and figures out what 
> bindings would be used to encode each attribute, rolls it into a list. 
> Then for each feature looks up the binding directly and encodes.
> 
> * A custom EncoderDelegate for geometries
> 
> The above gave quite a speed up, but not exactly what i was hoping for. 
> Initial benchmarks still came back about twice as slow. A bit of 
> profiling pointed to the geometry encoding bindings. The above strategy 
> of rolling the bindings into a list only works for simple content, 
> geometries still go through the main encoding routine.
> 
> So the next step was to break out EncoderDelegate's for geometries as 
> well, and have them used directly. And it helped. After this numbers 
> were closer, with the optimized encoder coming back just a bit slower.
> 
> * Respecting number of decimals
> 
> Analyzing the above results I noticed that the optimized xsd encoder was 
> delivering substantially more data than the transformer. Which puzzled 
> me since based on my optimizations it should actually be producing less. 
> After analyzing data from both, the answer was clear, the number of 
> decimals being encoded.
> 
> GML from the xsd encoder was not respecting a limited number of decimals 
> at all. Which resulted quite a bit more data encoding than is necessary.
> 
> Adding the cutting off of decimals gave the amount of data coming back 
> much less, and the total time increase. Giving the final results being 
> quite close in the polygon case (lots of coordinates).
> 
> Things to note
> --------------
> 
> * This only works for simple feature data (sorry ben)
> * These speeds are only for GML, not for general encoding
> * The optimizations are engaged via explicitly setting a property, so if 
> you don't ask for them you won't get them
> 
> I have a bit of clean up to do with the patches but I plan to commit soon.
> 
> -Justin
> 

-- 
Gabriel Roldan
OpenGeo - http://opengeo.org
Expert service straight from the developers.

Re: [Geotools-devel] gml performance with xsd encoder

Toolkit for working with and mapping geospatial data

Re: [Geotools-devel] gml performance with xsd encoder