Re: [JPP-Devel] FeatureCache - Request for suggestions...

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Michael,

You wrote: "I don't know how you want to use the featureCache. As I imagine=
d
it
until now, it was just used to keep a part of your features in memory,
most of the data staying in the original file. But another solution
(yours ?) is to entirely copy the original data file in your own
independant format and to make direct access to this last one. I'll
consider it attentively."

This is correct.

You wrote: "At the moment, I have only one FeatureOnDemand class, so I need
an id
and an adress, but if you keep 2 objects into memory , FeatureOnDemand
and FeatureCache, I suppose it is about the same (you will need an
identifier, a reference to the FeatureCache, and a file adresse in your
featurecache to make direct access to your data)"

The FeatureCache is an object that implements the FeatureCollection
interface. It also manages the movement of features to and from storage on
disk. For example, in my system the FeatureCache will have an in-memory
buffer and a simple spatial index. Many FeatureOnDemand object can be
associated with a single FeatureCache. I would imagine in most cases you
would wrap a FeatureCache with a Layer object, as you do with a normal
FeatureCollection.

In this system I only need to store 2 values in my FeatureOnDemand objects.
One is a long that uniquely identifies the Feature from other Features
stored in the Cache, the other is an int that identifies the FeatureCache
itself.

In reality this system will require another object. Something like a
FeatureCacheManager. This object will allow the user and other developers t=
o
manipulate FeatureCaches and configure their behavior.

 You wrote: "Why do you think we need to put every feature into memory ? If
so, we'll
never be able to load huge data file. Until someone explain to me why we
have to keep all the features in memory, I want to think we have not :-)
I we keep only reference and bounding boxes into memory, it can save
disk access operations for all the features which bb does not intersect
the OJ window (which is very very interesting when you read a large
dataset but only need to zoom on a small part of the dataset."

This is a problem with the design of OpenJUMP. JUMP was designed to
manipulate features in memory, and I don't think they ever planned on using
Features read from permanent storage on demand. The FeatureCollection
interface defines a getFeatures() method which returns a List of Features.
You can't return a list of Features that aren't in memory becuase the List
class extends the Collection interface. The Collection interface defines a
toArray() method, which, once again, requires all features to be in memory.
I found all of the methods that require this using Eclipse, and if I
remember correctly there were over 20 places where this facet of the
FeatureCollection interface was required in OpenJUMP's code. I figured it
would be easier to design a FeatureCache system than it would be to refacto=
r
these portions of the code to use an Iterator instead of an array or List
implementation.

If you can figure out a way to override the toArray() method of the
Collection interface then we might not have to use a FeatureCache at all.
But to my knowledge this can't be done. That is why I designed my system to
use a lieght-weight "in-memory" representation of the heavy-weight Feature.
When I need to return a List or array of all the Features in a
FeatureCollection backed by a FeatuerCache I fill it with these light-weigh=
t
Features.

I hope this makes sense.

You wrote: "I we keep only reference and bounding boxes into memory, it can
save
disk access operations for all the features which bb does not intersect
the OJ window (which is very very interesting when you read a large
dataset but only need to zoom on a small part of the dataset."

That is why the FeatureCache will include a simple spatial index.

You wrote: "I also need to have more thought about data access. WKT is
interesting
because it is human readable, but as soon as performance is concerned,
WKB offers a big advantage. As I said in the previous mail, I don't know
if serialization is a good solution for the performance point of view,
but I'm not sure it will save you much work as JTS has a WKB
reader/writer which is simple to use."

I think you are right. I will work the next week or two on a parser for
features stored in a binary format. The FeatureCache will use this parser
for Features and the WKB reader for Geometries. I've developed a rough
outline for a binary format for Feature storage here:

http://thejumppilotproject.pbwiki.com/OpenJUMP-Binary-Feature-Format

Perhaps you can take a look and tell me what you think. I will release all
of my code for the FeatureCache and the binary Feature parser through the
SurveyOS SourceForge Project.

The Sunburned Surveyor

P.S. - Perhaps Erwan will have some comments for us on this as well.

On 3/30/07, Micha=EBl Michaud <mic...@fr...> wrote:
>
> Sunburned Surveyor,
>
> > I'm glad you think so. I like the term "FeatureOnDemand". Do you mind
> > if I use it as the name of the light-weight feature class?
>
> I guess I read this term from Agile's code. I don't mind who uses it and
> hope alvaro zabala, the original developer of agile doesn't mind too.
>
> >  The FeatureCache will be writable as well. The advantage over the
> > scalable shapefile driver that is used by Agile, UDig and (maybe)
> > Kosmo is that we'll be able to use the FeatureCache with any data
> > source that can provide Features. For example, after I get the
> > FeatureCache working with GeoTools Shapefile drivers I want to get it
> > working for AutoDesk's DXF format as well. The other benefit is that
> > we can support storage of data not currently supported in the ESRI
> > Shapefile format if we choose to do so in the future.
>
> I don't know how you want to use the featureCache. As I imagined it
> until now, it was just used to keep a part of your features in memory,
> most of the data staying in the original file. But another solution
> (yours ?) is to entirely copy the original data file in your own
> independant format and to make direct access to this last one. I'll
> consider it attentively.
>
> > Almost, but not quite. I was only going to store a numeric identifier
> > for the Feature, like a serial number, which I would probably store in
> > an integer or a long. The only other item I would store is perhaps a
> > string with the name of the FeatureCache containing the Feature. I
> > think this is about as light weight as you can get.
>
> At the moment, I have only one FeatureOnDemand class, so I need an id
> and an adress, but if you keep 2 objects into memory , FeatureOnDemand
> and FeatureCache, I suppose it is about the same (you will need an
> identifier, a reference to the FeatureCache, and a file adresse in your
> featurecache to make direct access to your data)
>
> > You wrote: "but imo the bounding box has also to be in-memory for
> > performance
> > reasons (just wonder if it is worth trying to store the bb in a
> > structure smaller than 4 doubles)"
> > I didn't think about this. Could you please tell me why you think it
> > will be important to keep the bounding box of the feature in memory?
> > Is this for rendering purposes? Remember we will need to put every
> > feature into memory for rendering anyways, so I don't know if this
> > will save us anything. Unless the bounding box is used for another
> > frequent operation.
>
> Why do you think we need to put every feature into memory ? If so, we'll
> never be able to load huge data file. Until someone explain to me why we
> have to keep all the features in memory, I want to think we have not :-)
> I we keep only reference and bounding boxes into memory, it can save
> disk access operations for all the features which bb does not intersect
> the OJ window (which is very very interesting when you read a large
> dataset but only need to zoom on a small part of the dataset.
>
> > After looking at your tests of JTS reading in WKT and WKB formats I
> > can see that using text as the storage format really isn't a good
> > option. The binary storage format is so much faster!
> > I'll have to give this problem a lot more thought. Perhaps I can get a
> > temporary FeatureCache system running with Java's standard object
> > serialization, and work on the custom binary format after that.
> > I'll have to take a look at WKB format. Maybe we can base a binary
> > format for Feature attribute values on a similar system.
>
> I also need to have more thought about data access. WKT is interesting
> because it is human readable, but as soon as performance is concerned,
> WKB offers a big advantage. As I said in the previous mail, I don't know
> if serialization is a good solution for the performance point of view,
> but I'm not sure it will save you much work as JTS has a WKB
> reader/writer which is simple to use
>
> > Thanks again for your comments. They were very helpful.
> > Thanks to Erwan as well.
>
> Thanks
>
> Micha=EBl
>
> >
> > The Sunburned Surveyor
> >
> >
> >
> >
> > On 3/29/07, *Micha=EBl Michaud* <mic...@fr...
> > <mailto:mic...@fr...>> wrote:
> >
> >     Hi sunburned,
> >
> >     I think that a light-weight feature class or FeatureOnDemand is a
> good
> >     solution, as well as a FeatureCache.
> >     I already tested Agile's scalable shapefile driver, and I'm
> currently
> >     implementing something similar for GeoConcept format(a commercial
> >     gis).
> >     It can save a lot of memory (but as you guess, is not very good for
> >     performance unless we find very well designed solutions)
> >     I've not yet seen how kosmo implemented their scalable shapefile
> >     driver,
> >     but I'll have to, because it is not only scalable, it is also
> >     writable !
> >     Some questions are :
> >     - what must the in-memory representation of the light-weight featur=
e
> >     include ?
> >     the minimum is an identifier and a file adress for disk-access
> (unless
> >     you store data in a database)
> >     but imo the bounding box has also to be in-memory for performance
> >     reasons (just wonder if it is worth trying to store the bb in a
> >     structure smaller than 4 doubles)
> >     - another question you ask is about data format. Sigle project is
> >     exploring GML format storage for direct access. I think you can als=
o
> >     keep the data in the original file format (this is the way scalable
> >     shapefile works, and the way I am exploring with geoconcept
> >     format). But
> >     storing data in jump's own format may be useful to solve performanc=
e
> >     issues, or to solve the data access problem in a more independant
> way.
> >     For this issue, I made some tests to compare wkb and wkt reading
> (and
> >     also writing). Sorry, I did not test serializing which, I think,
> >     is not
> >     very performant. Here are my results with jts 1.8 (every test made
> >     with
> >     my personal laptop computer) :
> >
> >     Reading 100 Complex WKT Polygon (about 7000 points each)    2659026=
7
> >     bytes    15.073 sec
> >     Reading 1 000 000 WKT Points sequentially
> >        64489511 bytes    47.874 sec
> >
> >     Reading 100 Complex WKB Polygon (about 7000 points each)    2659026=
7
> >     bytes    1.313 sec
> >     Reading 1 000 000 WKB Points sequentially
> >     64489511 bytes    2.542 sec
> >
> >     Some more tests for database access (binary geometry)
> >     postgreSQL, sequential access :    10 000 pts 0.3 sec
> >     postgreSQL, random access :       10 000 pts 7 sec
> >     H2, sequential or random access : 10 000 pts 0.4 sec
> >
> >     Micha=EBl
> >
> >     Sunburned Surveyor a =E9crit :
> >
> >     > I've been working on a solution to the problem of working with
> very
> >     > large datasets in OpenJUMP at home the past couple of weeks. (For
> >     > those of you that don't know, OpenJUMP reads all features in from
> a
> >     > data source into memory. This isn't a problem until you start
> >     working
> >     > with some very large datasets. For example, OpenJUMP runs out of
> >     > memory before it can open the shapefile with all of the parcels
> >     in my
> >     > county. The size limit of the data source OpenJUMP can work with
> is
> >     > limited by the RAM of the computer OpenJUMP is running on.) I'd
> like
> >     > to give a brief explanation of how this system will work, and
> >     then ask
> >     > for some suggestions on an aspect of the design.
> >     >
> >     >
> >     >
> >     > This system uses a very light-weight in-memory representation of
> the
> >     > Feature class. (This is required because portions of OpenJUMP's
> >     code
> >     > requires the ability to manipulate individual features or all the
> >     > features in a feature collection "in-memeory".) Object's of this
> >     > light-weight Feature Class are really a fa=E7ade and forward all
> >     method
> >     > calls to a FeatureCache object. A FeatureCache is an
> >     implementation of
> >     > the FeatureCollection interface that actually manages data
> >     behind the
> >     > light-weight Feature objects.
> >     >
> >     >
> >     >
> >     > The FeatureCache maintains a "buffer". In this buffer it stores
> >     > in-memory representations of regular OpenJUMP Feature objects.
> This
> >     > buffer will only grow to a maximum size that can be set by the
> user
> >     > and based on the balance between speed/performance and memory
> usage.
> >     > When a method call is made to the light-weight Feature object it
> is
> >     > forwarded to the FeatureCache. The FeatureCache passes this call
> to
> >     > the regular Feature object if it is in the buffer. If it is not
> >     in the
> >     > buffer the Feature object is created in memory from information i=
n
> >     > permanent storage or "on-disk". The method call is then
> >     processed and
> >     > the newly created Feature is placed in the buffer. If the buffer
> is
> >     > already at its limit the oldest Feature in the Buffer is stored
> back
> >     > in permanent memory and removed from the buffer.
> >     >
> >     >
> >     >
> >     > There should be no major distinction between Features and a
> >     > FeatureCollection implemented by a FeatureCache and normal
> Features
> >     > and FeatureCollections that are stored entirely in memory. The
> only
> >     > significant difference will be the speed of operations and
> >     rendering.
> >     > This will be slower with this system than it is with Features and
> >     > FeatureCollections stored entirely in memory. However, it will
> >     make it
> >     > possible to work with very large datasets.
> >     >
> >     >
> >     >
> >     > Here is the part of the system that I would like to get some
> >     > suggestions on. I need to decide on a storage format for the
> >     features
> >     > placed in permanent memory, or on disk. I think I have 3 choices.
> >     >
> >     >
> >     >
> >     > [1] Java's Standard Object Serialization Format
> >     >
> >     > [2] A custom binary storage format.
> >     >
> >     > [3] A text based format.
> >     >
> >     >
> >     >
> >     > I believe the first two formats will be much quicker than the
> >     third. I
> >     > don't really think the second format is something I want to do,
> >     > because I think cooking up a custom binary format will be a real
> >     pain
> >     > in the neck. So I need to decide between the first format listed
> and
> >     > the third format listed.
> >     >
> >     >
> >     >
> >     > If I use a text-based format external tools will be able to easil=
y
> >     > work with the FeatureCache, and I won't have to worry about
> >     versioning
> >     > issues. It will also be slower. If I use Java's standard object
> >     > serialization format I'll have better performance, but I'll have
> to
> >     > worry about versioning issues that might come up if we change the
> >     > interface definition for the Feature interface. It will also make
> it
> >     > difficult for external tools, especially those that aren't
> >     written in
> >     > Java, to work with the data in the FeatureCache.
> >     >
> >     >
> >     >
> >     > I'd like to know what storage format the other developers would
> >     > recommend and why.
> >     >
> >     > Thanks,
> >     >
> >     > The Sunburned Surveyor
> >     >
> >
> >------------------------------------------------------------------------
> >
> >     >
> >
> >------------------------------------------------------------------------=
-
> >     >Take Surveys. Earn Cash. Influence the Future of IT
> >     >Join SourceForge.net's Techsay panel and you'll get the chance to
> >     share your
> >     >opinions on IT & business topics through brief surveys-and earn
> cash
> >     >
> http://www.techsay.com/default.php?page=3Djoin.php&p=3Dsourceforge&CID=3D=
DEVDEV
> >     <
> http://www.techsay.com/default.php?page=3Djoin.php&p=3Dsourceforge&CID=3D=
DEVDEV>
> >     >
> >
> >------------------------------------------------------------------------
> >     >
> >     >_______________________________________________
> >     >Jump-pilot-devel mailing list
> >     > Jum...@li...
> >     <mailto:Jum...@li...>
> >     >https://lists.sourceforge.net/lists/listinfo/jump-pilot-devel
> >     >
> >     >
> >
> >
> >
> -------------------------------------------------------------------------
> >
> >     Take Surveys. Earn Cash. Influence the Future of IT
> >     Join SourceForge.net's Techsay panel and you'll get the chance to
> >     share your
> >     opinions on IT & business topics through brief surveys-and earn cas=
h
> >
> http://www.techsay.com/default.php?page=3Djoin.php&p=3Dsourceforge&CID=3D=
DEVDEV
> >     <
> http://www.techsay.com/default.php?page=3Djoin.php&p=3Dsourceforge&CID=3D=
DEVDEV>
> >     _______________________________________________
> >     Jump-pilot-devel mailing list
> >     Jum...@li...
> >     <mailto:Jum...@li...>
> >     https://lists.sourceforge.net/lists/listinfo/jump-pilot-devel
> >     <https://lists.sourceforge.net/lists/listinfo/jump-pilot-devel>
> >
> >
> >------------------------------------------------------------------------
> >
> >------------------------------------------------------------------------=
-
> >Take Surveys. Earn Cash. Influence the Future of IT
> >Join SourceForge.net's Techsay panel and you'll get the chance to share
> your
> >opinions on IT & business topics through brief surveys-and earn cash
> >http://www.techsay.com/default.php?page=3Djoin.php&p=3Dsourceforge&CID=
=3DDEVDEV
> >
> >------------------------------------------------------------------------
> >
> >_______________________________________________
> >Jump-pilot-devel mailing list
> >Jum...@li...
> >https://lists.sourceforge.net/lists/listinfo/jump-pilot-devel
> >
> >
>
>
> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share
> your
> opinions on IT & business topics through brief surveys-and earn cash
> http://www.techsay.com/default.php?page=3Djoin.php&p=3Dsourceforge&CID=3D=
DEVDEV
> _______________________________________________
> Jump-pilot-devel mailing list
> Jum...@li...
> https://lists.sourceforge.net/lists/listinfo/jump-pilot-devel
>