Thread: [Geotools-gt2-users] Efficiently writing a GeoJSON stream into a Shapefile

Toolkit for working with and mapping geospatial data

Brought to you by: aaime, afabiani, cholmesny, cory2070, and 15 others

geotools-gt2-users

[Geotools-gt2-users] Efficiently writing a GeoJSON stream into a Shapefile

From: William V. <wil...@gm...> - 2013-09-05 10:16:27

Dear All,

I've been trying a few solutions to efficiently convert GeoJSON into a
shapefile without having to store all features in memory. I'm using
GeoTools 9.2.

The problem is not so much in how to stream the JSON but how to
efficiently write the features into the shapefile. I use
FeatureJSON#streamFeatureCollection to obtain an iterator. After some
googling, I found 3 different ways of writing a shapefile, namely:

1. Repeatedly calling FeatureStore#addFeatures with a collection
containing say 1000 features, within a transaction.
      -----
      ListFeatureCollection coll = new ListFeatureCollection(type, features);
      Transaction transaction = new DefaultTransaction("create");
      featureStore.setTransaction(transaction);
      try {
        featureStore.addFeatures(coll);
        transaction.commit();
      } catch (IOException e) {
        transaction.rollback();
        throw new IllegalStateException(
            "Could not write some features to shapefile. Aborting process", e);
      } finally {
        transaction.close();
      }
      -----


This option is extremely slow. By profiling a few runs, I noticed that
about 50% of CPU time is spent on the method
ContentFeatureStore#getWriterAppend, presumably in order to reach the
end of the file before each transaction commit.

2. Obtaining an append writer directly from ShapefileDataStore, and
write 1000 features at a time within a transaction.

This options suffers from the same problems as number one.

3. Obtaining a feature writer from ShapefileDataStore, and write one
feature at a time using Transaction.AUTO_COMMIT.

     -----
     FeatureWriter<SimpleFeatureType, SimpleFeature> writer = shpDataStore
        .getFeatureWriter(shpDataStore.getTypeNames()[0],
            Transaction.AUTO_COMMIT);

     while (jsonIt.hasNext()) {

      SimpleFeature feature = jsonIt.next();
      SimpleFeature toWrite = writer.next();
      for (int i = 0; i < toWrite.getType().getAttributeCount(); i++) {
        String name = toWrite.getType().getDescriptor(i).getLocalName();
        toWrite.setAttribute(name, feature.getAttribute(name));
      }
      writer.write();
    }
    writer.close();
    ----


Option 3 is the fastest, but I feel there would a way of efficiently
adding a greater number of features at a time to the shapefile within
a transaction. On the other hand, a previous comment in this lists
noted:

> The above would work for mid-sized data transafers, for massive ones against
> databases it's better to adopt some sort of batching to avoid having a single
> transaction with one million inserts, e.g., insert 1000, commit the transaction,
> insert another 1000, and so on.
> This would work better against databases and against WFS servers,
> but not against shapefiles, which instead work better with the massive insert...
> to each his own.

Does this mean that the most efficient way of writing to a shapefile
is having all features in memory, rather than being able to append
features?
I appreciate if someone could suggest a better way of achieving this
or point to any documentation that would help me.

Best regards,

Will

Re: [Geotools-gt2-users] Efficiently writing a GeoJSON stream into a Shapefile

From: Andrea A. <and...@ge...> - 2013-09-05 10:22:06

On Thu, Sep 5, 2013 at 12:15 PM, William Voorsluys <wil...@gm...>wrote:

>
> Option 3 is the fastest, but I feel there would a way of efficiently
> adding a greater number of features at a time to the shapefile within
> a transaction. On the other hand, a previous comment in this lists
> noted:
>
> > The above would work for mid-sized data transafers, for massive ones
> against
> > databases it's better to adopt some sort of batching to avoid having a
> single
> > transaction with one million inserts, e.g., insert 1000, commit the
> transaction,
> > insert another 1000, and so on.
> > This would work better against databases and against WFS servers,
> > but not against shapefiles, which instead work better with the massive
> insert...
> > to each his own.
>

Indeed, not using transactions and getting an append writer is the faster
option.
Shapefiles have no notion of a transaction, which is emulated by writing a
second shapefile and swapping files
on commit, that's why using transactions there is slow

Cheers
Andrea

-- 
==
Our support, Your Success! Visit http://opensdi.geo-solutions.it for more
information.
==

Ing. Andrea Aime
@geowolf
Technical Lead

GeoSolutions S.A.S.
Via Poggio alle Viti 1187
55054  Massarosa (LU)
Italy
phone: +39 0584 962313
fax: +39 0584 1660272
mob: +39  339 8844549

http://www.geo-solutions.it
http://twitter.com/geosolutions_it

-------------------------------------------------------

Re: [Geotools-gt2-users] Efficiently writing a GeoJSON stream into a Shapefile

From: Jody G. <jod...@gm...> - 2013-09-06 03:55:53

Yes.

That code is very similar to how TransactionStateDiff works, it just has to
take some extra steps as it is also used to update existing content.

The work you copied from content datastore is used by JDBCDataStore to
allow implementations that support FeatureID to have that passed in as a
user property. Shapefile FeatureIDs are based on "row number" so you do not
need all of that code.

(You can see the docs for details
http://docs.geotools.org/latest/userguide/library/data/featuresource.html#adding-data)

Q: If I put a feature writer example at that location would you of found
it? Or did you look at the shapefile datastore page first?

Jody


On Fri, Sep 6, 2013 at 1:20 PM, William Voorsluys <wil...@gm...>wrote:

> Thanks Jody,
>
> So, I came up with this code, which gets an append writer and doesn't
> use transaction. Can you confirm if that's what you meant to indicate?
>
>     ...
>     try {
>       writer = shpDataStore.getFeatureWriterAppend(
>           shpDataStore.getTypeNames()[0], null);
>
>       while (jsonIt.hasNext()) {
>
>         SimpleFeature feature = jsonIt.next();
>         addFeature(feature, writer, featureStore);
>       }
>     } finally {
>       if (writer != null) {
>         writer.close();
>       }
>     }
>     ...
>
>   /**
>    * Copied over from {@link ContentFeatureStore} as a way of writing
> features
>    * directly into a {@link FeatureWriter}
>    */
>   private static FeatureId addFeature(SimpleFeature feature,
>       FeatureWriter<SimpleFeatureType, SimpleFeature> writer,
>       SimpleFeatureStore featureStore) throws IOException {
>
>     SimpleFeature toWrite = writer.next();
>     for (int i = 0; i < toWrite.getType().getAttributeCount(); i++) {
>       String name = toWrite.getType().getDescriptor(i).getLocalName();
>       toWrite.setAttribute(name, feature.getAttribute(name));
>     }
>
>     // copy over the user data
>     if (feature.getUserData().size() > 0) {
>       toWrite.getUserData().putAll(feature.getUserData());
>     }
>
>     // pass through the fid if the user asked so
>     boolean useExisting = Boolean.TRUE.equals(feature.getUserData().get(
>         Hints.USE_PROVIDED_FID));
>     if (featureStore.getQueryCapabilities().isUseProvidedFIDSupported()
>         && useExisting) {
>       ((FeatureIdImpl) toWrite.getIdentifier()).setID(feature.getID());
>     }
>
>     // perform the write
>     writer.write();
>
>     // copy any metadata from the feature that was actually written
>     feature.getUserData().putAll(toWrite.getUserData());
>
>     // add the id to the set of inserted
>     FeatureId id = toWrite.getIdentifier();
>     return id;
>   }
>
> On Fri, Sep 6, 2013 at 12:12 PM, Jody Garnett <jod...@gm...>
> wrote:
> > It is more that shapefile does not offer a database session, so we are
> > faking it to make the editing story easier for desktop clients.
> >
> > Using AUTO_COMMIT is a terrible idea as it will involve writing out your
> > file many times (ie each time you add a feature).
> >
> > I tried to indicate a better way in my email, and in the docs, but it is
> not
> > coming through.
> >
> >
> > On Fri, Sep 6, 2013 at 11:29 AM, William Voorsluys <
> wil...@gm...>
> > wrote:
> >>
> >> Hi Jodi,
> >>
> >> Did you mean to reply this to the list?
> >>
> >> It seems clear that transactions are not meant to be used efficiently
> >> on shapefiles. I'm settling on using AUTO_COMMIT and writing a feature
> >> a time using a writer. Do you mean there is no better way of more
> >> efficiently writing features in bulk to the file, say 1000 at a time?
> >> It seems the operation of getting an append writer is the greatest
> >> bottleneck in the operation.
> >>
> >> Will
> >>
> >> On Fri, Sep 6, 2013 at 1:28 AM, Jody Garnett <jod...@gm...>
> >> wrote:
> >> > I really need a better way to communicate this one, or a special case
> >> > when
> >> > the shapefile is empty or something.
> >> >
> >> > The goal here is to use an append feature writer, directly, and write
> >> > the
> >> > content out as you go in a streaming fashion.
> >> >
> >> > This is what ShapefileDataSource does internally when you call
> >> > transaction.commit(). It goes through the changes that it has
> collected
> >> > in
> >> > memory and writes out a new file. It then renames the old file out of
> >> > the
> >> > way, renames the new file into the correct place, and deletes the old
> >> > file.
> >> >
> >> >
> >> >
> >> > On Thu, Sep 5, 2013 at 8:15 PM, William Voorsluys
> >> > <wil...@gm...>
> >> > wrote:
> >> >>
> >> >> Dear All,
> >> >>
> >> >> I've been trying a few solutions to efficiently convert GeoJSON into
> a
> >> >> shapefile without having to store all features in memory. I'm using
> >> >> GeoTools 9.2.
> >> >>
> >> >> The problem is not so much in how to stream the JSON but how to
> >> >> efficiently write the features into the shapefile. I use
> >> >> FeatureJSON#streamFeatureCollection to obtain an iterator. After some
> >> >> googling, I found 3 different ways of writing a shapefile, namely:
> >> >>
> >> >> 1. Repeatedly calling FeatureStore#addFeatures with a collection
> >> >> containing say 1000 features, within a transaction.
> >> >>       -----
> >> >>       ListFeatureCollection coll = new ListFeatureCollection(type,
> >> >> features);
> >> >>       Transaction transaction = new DefaultTransaction("create");
> >> >>       featureStore.setTransaction(transaction);
> >> >>       try {
> >> >>         featureStore.addFeatures(coll);
> >> >>         transaction.commit();
> >> >>       } catch (IOException e) {
> >> >>         transaction.rollback();
> >> >>         throw new IllegalStateException(
> >> >>             "Could not write some features to shapefile. Aborting
> >> >> process", e);
> >> >>       } finally {
> >> >>         transaction.close();
> >> >>       }
> >> >>       -----
> >> >>
> >> >>
> >> >> This option is extremely slow. By profiling a few runs, I noticed
> that
> >> >> about 50% of CPU time is spent on the method
> >> >> ContentFeatureStore#getWriterAppend, presumably in order to reach the
> >> >> end of the file before each transaction commit.
> >> >>
> >> >> 2. Obtaining an append writer directly from ShapefileDataStore, and
> >> >> write 1000 features at a time within a transaction.
> >> >>
> >> >> This options suffers from the same problems as number one.
> >> >>
> >> >> 3. Obtaining a feature writer from ShapefileDataStore, and write one
> >> >> feature at a time using Transaction.AUTO_COMMIT.
> >> >>
> >> >>      -----
> >> >>      FeatureWriter<SimpleFeatureType, SimpleFeature> writer =
> >> >> shpDataStore
> >> >>         .getFeatureWriter(shpDataStore.getTypeNames()[0],
> >> >>             Transaction.AUTO_COMMIT);
> >> >>
> >> >>      while (jsonIt.hasNext()) {
> >> >>
> >> >>       SimpleFeature feature = jsonIt.next();
> >> >>       SimpleFeature toWrite = writer.next();
> >> >>       for (int i = 0; i < toWrite.getType().getAttributeCount();
> i++) {
> >> >>         String name =
> >> >> toWrite.getType().getDescriptor(i).getLocalName();
> >> >>         toWrite.setAttribute(name, feature.getAttribute(name));
> >> >>       }
> >> >>       writer.write();
> >> >>     }
> >> >>     writer.close();
> >> >>     ----
> >> >>
> >> >>
> >> >> Option 3 is the fastest, but I feel there would a way of efficiently
> >> >> adding a greater number of features at a time to the shapefile within
> >> >> a transaction. On the other hand, a previous comment in this lists
> >> >> noted:
> >> >>
> >> >> > The above would work for mid-sized data transafers, for massive
> ones
> >> >> > against
> >> >> > databases it's better to adopt some sort of batching to avoid
> having
> >> >> > a
> >> >> > single
> >> >> > transaction with one million inserts, e.g., insert 1000, commit the
> >> >> > transaction,
> >> >> > insert another 1000, and so on.
> >> >> > This would work better against databases and against WFS servers,
> >> >> > but not against shapefiles, which instead work better with the
> >> >> > massive
> >> >> > insert...
> >> >> > to each his own.
> >> >>
> >> >> Does this mean that the most efficient way of writing to a shapefile
> >> >> is having all features in memory, rather than being able to append
> >> >> features?
> >> >> I appreciate if someone could suggest a better way of achieving this
> >> >> or point to any documentation that would help me.
> >> >>
> >> >> Best regards,
> >> >>
> >> >> Will
> >> >>
> >> >>
> >> >>
> >> >>
> ------------------------------------------------------------------------------
> >> >> Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012,
> more!
> >> >> Discover the easy way to master current and previous Microsoft
> >> >> technologies
> >> >> and advance your career. Get an incredible 1,500+ hours of
> step-by-step
> >> >> tutorial videos with LearnDevNow. Subscribe today and save!
> >> >>
> >> >>
> >> >>
> http://pubads.g.doubleclick.net/gampad/clk?id=58041391&iu=/4140/ostg.clktrk
> >> >> _______________________________________________
> >> >> GeoTools-GT2-Users mailing list
> >> >> Geo...@li...
> >> >> https://lists.sourceforge.net/lists/listinfo/geotools-gt2-users
> >> >
> >> >
> >
> >
>

Re: [Geotools-gt2-users] Efficiently writing a GeoJSON stream into a Shapefile

From: William V. <wil...@gm...> - 2013-09-06 04:20:45

On Fri, Sep 6, 2013 at 1:55 PM, Jody Garnett <jod...@gm...> wrote:
> Yes.
>
> That code is very similar to how TransactionStateDiff works, it just has to
> take some extra steps as it is also used to update existing content.
>
> The work you copied from content datastore is used by JDBCDataStore to allow
> implementations that support FeatureID to have that passed in as a user
> property. Shapefile FeatureIDs are based on "row number" so you do not need
> all of that code.
>
> (You can see the docs for details
> http://docs.geotools.org/latest/userguide/library/data/featuresource.html#adding-data
> )
>
> Q: If I put a feature writer example at that location would you of found it?
> Or did you look at the shapefile datastore page first?
I did look at that and other examples, like the CSV2SHP example in the
documentation. IMO, these examples lead to an understanding that
transactions were a good way to approach the problem.

In this somewhat outdated page
(http://docs.codehaus.org/display/GEOTOOLS/Data+Writing) I did find
examples of using FeatureWriter directly. So, an example in the main
documentation would be certainly helpful.

Thanks,

Will


>
> Jody
>
>
> On Fri, Sep 6, 2013 at 1:20 PM, William Voorsluys <wil...@gm...>
> wrote:
>>
>> Thanks Jody,
>>
>> So, I came up with this code, which gets an append writer and doesn't
>> use transaction. Can you confirm if that's what you meant to indicate?
>>
>>     ...
>>     try {
>>       writer = shpDataStore.getFeatureWriterAppend(
>>           shpDataStore.getTypeNames()[0], null);
>>
>>       while (jsonIt.hasNext()) {
>>
>>         SimpleFeature feature = jsonIt.next();
>>         addFeature(feature, writer, featureStore);
>>       }
>>     } finally {
>>       if (writer != null) {
>>         writer.close();
>>       }
>>     }
>>     ...
>>
>>   /**
>>    * Copied over from {@link ContentFeatureStore} as a way of writing
>> features
>>    * directly into a {@link FeatureWriter}
>>    */
>>   private static FeatureId addFeature(SimpleFeature feature,
>>       FeatureWriter<SimpleFeatureType, SimpleFeature> writer,
>>       SimpleFeatureStore featureStore) throws IOException {
>>
>>     SimpleFeature toWrite = writer.next();
>>     for (int i = 0; i < toWrite.getType().getAttributeCount(); i++) {
>>       String name = toWrite.getType().getDescriptor(i).getLocalName();
>>       toWrite.setAttribute(name, feature.getAttribute(name));
>>     }
>>
>>     // copy over the user data
>>     if (feature.getUserData().size() > 0) {
>>       toWrite.getUserData().putAll(feature.getUserData());
>>     }
>>
>>     // pass through the fid if the user asked so
>>     boolean useExisting = Boolean.TRUE.equals(feature.getUserData().get(
>>         Hints.USE_PROVIDED_FID));
>>     if (featureStore.getQueryCapabilities().isUseProvidedFIDSupported()
>>         && useExisting) {
>>       ((FeatureIdImpl) toWrite.getIdentifier()).setID(feature.getID());
>>     }
>>
>>     // perform the write
>>     writer.write();
>>
>>     // copy any metadata from the feature that was actually written
>>     feature.getUserData().putAll(toWrite.getUserData());
>>
>>     // add the id to the set of inserted
>>     FeatureId id = toWrite.getIdentifier();
>>     return id;
>>   }
>>
>> On Fri, Sep 6, 2013 at 12:12 PM, Jody Garnett <jod...@gm...>
>> wrote:
>> > It is more that shapefile does not offer a database session, so we are
>> > faking it to make the editing story easier for desktop clients.
>> >
>> > Using AUTO_COMMIT is a terrible idea as it will involve writing out your
>> > file many times (ie each time you add a feature).
>> >
>> > I tried to indicate a better way in my email, and in the docs, but it is
>> > not
>> > coming through.
>> >
>> >
>> > On Fri, Sep 6, 2013 at 11:29 AM, William Voorsluys
>> > <wil...@gm...>
>> > wrote:
>> >>
>> >> Hi Jodi,
>> >>
>> >> Did you mean to reply this to the list?
>> >>
>> >> It seems clear that transactions are not meant to be used efficiently
>> >> on shapefiles. I'm settling on using AUTO_COMMIT and writing a feature
>> >> a time using a writer. Do you mean there is no better way of more
>> >> efficiently writing features in bulk to the file, say 1000 at a time?
>> >> It seems the operation of getting an append writer is the greatest
>> >> bottleneck in the operation.
>> >>
>> >> Will
>> >>
>> >> On Fri, Sep 6, 2013 at 1:28 AM, Jody Garnett <jod...@gm...>
>> >> wrote:
>> >> > I really need a better way to communicate this one, or a special case
>> >> > when
>> >> > the shapefile is empty or something.
>> >> >
>> >> > The goal here is to use an append feature writer, directly, and write
>> >> > the
>> >> > content out as you go in a streaming fashion.
>> >> >
>> >> > This is what ShapefileDataSource does internally when you call
>> >> > transaction.commit(). It goes through the changes that it has
>> >> > collected
>> >> > in
>> >> > memory and writes out a new file. It then renames the old file out of
>> >> > the
>> >> > way, renames the new file into the correct place, and deletes the old
>> >> > file.
>> >> >
>> >> >
>> >> >
>> >> > On Thu, Sep 5, 2013 at 8:15 PM, William Voorsluys
>> >> > <wil...@gm...>
>> >> > wrote:
>> >> >>
>> >> >> Dear All,
>> >> >>
>> >> >> I've been trying a few solutions to efficiently convert GeoJSON into
>> >> >> a
>> >> >> shapefile without having to store all features in memory. I'm using
>> >> >> GeoTools 9.2.
>> >> >>
>> >> >> The problem is not so much in how to stream the JSON but how to
>> >> >> efficiently write the features into the shapefile. I use
>> >> >> FeatureJSON#streamFeatureCollection to obtain an iterator. After
>> >> >> some
>> >> >> googling, I found 3 different ways of writing a shapefile, namely:
>> >> >>
>> >> >> 1. Repeatedly calling FeatureStore#addFeatures with a collection
>> >> >> containing say 1000 features, within a transaction.
>> >> >>       -----
>> >> >>       ListFeatureCollection coll = new ListFeatureCollection(type,
>> >> >> features);
>> >> >>       Transaction transaction = new DefaultTransaction("create");
>> >> >>       featureStore.setTransaction(transaction);
>> >> >>       try {
>> >> >>         featureStore.addFeatures(coll);
>> >> >>         transaction.commit();
>> >> >>       } catch (IOException e) {
>> >> >>         transaction.rollback();
>> >> >>         throw new IllegalStateException(
>> >> >>             "Could not write some features to shapefile. Aborting
>> >> >> process", e);
>> >> >>       } finally {
>> >> >>         transaction.close();
>> >> >>       }
>> >> >>       -----
>> >> >>
>> >> >>
>> >> >> This option is extremely slow. By profiling a few runs, I noticed
>> >> >> that
>> >> >> about 50% of CPU time is spent on the method
>> >> >> ContentFeatureStore#getWriterAppend, presumably in order to reach
>> >> >> the
>> >> >> end of the file before each transaction commit.
>> >> >>
>> >> >> 2. Obtaining an append writer directly from ShapefileDataStore, and
>> >> >> write 1000 features at a time within a transaction.
>> >> >>
>> >> >> This options suffers from the same problems as number one.
>> >> >>
>> >> >> 3. Obtaining a feature writer from ShapefileDataStore, and write one
>> >> >> feature at a time using Transaction.AUTO_COMMIT.
>> >> >>
>> >> >>      -----
>> >> >>      FeatureWriter<SimpleFeatureType, SimpleFeature> writer =
>> >> >> shpDataStore
>> >> >>         .getFeatureWriter(shpDataStore.getTypeNames()[0],
>> >> >>             Transaction.AUTO_COMMIT);
>> >> >>
>> >> >>      while (jsonIt.hasNext()) {
>> >> >>
>> >> >>       SimpleFeature feature = jsonIt.next();
>> >> >>       SimpleFeature toWrite = writer.next();
>> >> >>       for (int i = 0; i < toWrite.getType().getAttributeCount();
>> >> >> i++) {
>> >> >>         String name =
>> >> >> toWrite.getType().getDescriptor(i).getLocalName();
>> >> >>         toWrite.setAttribute(name, feature.getAttribute(name));
>> >> >>       }
>> >> >>       writer.write();
>> >> >>     }
>> >> >>     writer.close();
>> >> >>     ----
>> >> >>
>> >> >>
>> >> >> Option 3 is the fastest, but I feel there would a way of efficiently
>> >> >> adding a greater number of features at a time to the shapefile
>> >> >> within
>> >> >> a transaction. On the other hand, a previous comment in this lists
>> >> >> noted:
>> >> >>
>> >> >> > The above would work for mid-sized data transafers, for massive
>> >> >> > ones
>> >> >> > against
>> >> >> > databases it's better to adopt some sort of batching to avoid
>> >> >> > having
>> >> >> > a
>> >> >> > single
>> >> >> > transaction with one million inserts, e.g., insert 1000, commit
>> >> >> > the
>> >> >> > transaction,
>> >> >> > insert another 1000, and so on.
>> >> >> > This would work better against databases and against WFS servers,
>> >> >> > but not against shapefiles, which instead work better with the
>> >> >> > massive
>> >> >> > insert...
>> >> >> > to each his own.
>> >> >>
>> >> >> Does this mean that the most efficient way of writing to a shapefile
>> >> >> is having all features in memory, rather than being able to append
>> >> >> features?
>> >> >> I appreciate if someone could suggest a better way of achieving this
>> >> >> or point to any documentation that would help me.
>> >> >>
>> >> >> Best regards,
>> >> >>
>> >> >> Will
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> ------------------------------------------------------------------------------
>> >> >> Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012,
>> >> >> more!
>> >> >> Discover the easy way to master current and previous Microsoft
>> >> >> technologies
>> >> >> and advance your career. Get an incredible 1,500+ hours of
>> >> >> step-by-step
>> >> >> tutorial videos with LearnDevNow. Subscribe today and save!
>> >> >>
>> >> >>
>> >> >>
>> >> >> http://pubads.g.doubleclick.net/gampad/clk?id=58041391&iu=/4140/ostg.clktrk
>> >> >> _______________________________________________
>> >> >> GeoTools-GT2-Users mailing list
>> >> >> Geo...@li...
>> >> >> https://lists.sourceforge.net/lists/listinfo/geotools-gt2-users
>> >> >
>> >> >
>> >
>> >
>
>