I have quite large polygon datasets, order of 1e6 polys which can be clipped in one go, but then take (too much) time. Multicoring is possible, but does not help out enough, even with 16 threads on a Ryzen 2700X.
Working with large numbers the expected N2 complexity would have to be - clipped - not to say reduced.
Is this assumption of mine correct?
My reason for believing that a serious speedup for large datasets is possible, comes from running the desired dissolves on Qgis, but more to the point, arcgis in the latest versions (pro: 2.4.2, desktop 10.7.1). The latter are one, if not two orders of magnitude faster than what I did using the clipper library simply dumping all the rings as path. After quite a bit of work, I felt, well, silly.
If my idea about binning/two steps approach is correct, the basic approach could be to use a spatial index to group subsets, clip these and subsequently test for overlaps to do the second round, while leaving the non-overlapping well alone.
I've not quite thought this out and wonder if someone has, and/or more people have this need.
When filling the path, binning could be done without the need of a spatial index, through a service to be provided by the library perhaps.
Please let me know whether you agree, have more developed ideas, etc.
Thanks,
Jan
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
dear,
what rendering do you require?
winding numbers?. ..parallelel edge bundling.....?...
There are apecialised versions of vatti for odd/even(alternate) windings which run hundreds of times faster if many many overlapping edges are input for clipping.
Last edit: Ignorant 2019-11-02
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
Thanks for the attention! I require winding number dissolving for this specific case.
My basic code is (parts being polygons which may have holes etc. as in shapefile):
the only easily available (with source code) is Angus clipper if you want
windings other than alternate.
Specialised vatti includes Anugus code.The other one (with overfed bugs) is
murtas general polygon clipper.
an attempt to remove some noticed bugs is available on github. Code for
vatti is also available from NCAR
which was the first vatti available with proper horizontal edge
handing.Alan murta combined al the 4 operations
in a single source to call it the General Polygon clipper--He added 3 bugs
in the process.That code has a couple of
innovations which might help you..It has a horrible input polygon
partitioning code so rewriting is necessary for a large number of input
vertices.
Angus code clipperV2 is much neater but does NOT handle overlapping
bundles which gpc does.
Hi,
Thanks for the attention! I require winding number dissolving for this
specific case.
My basic code is (parts being polygons which may have holes etc. as in
shapefile):
The simplest way to address this might be to union subsets of your polygons before performng a final union of your subset results.
There are other much more complex ways to address this, including streaming polygons as they're encountered in the sweep algorithm, but if the above suggestion down't work for you, I'd suggest using another library that's perhaps better suited to very large datasets.
Cheers
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
You guys helped me on my way, thanks.
I have one remaining question because the code improvements and the speed increase with preview/beta version 2.0 are attractive. Do you have any reason to believe that the preview version is less correct than the preceding version 4.6.2? (I'm not asking for guarantees, mind you)
I'll certainly get to work on pre-grouping/input partitioning, probably using the fact that my polygons have calculated stats by the time of clipping, so I can at least use the extents to determine potential proximity, which should not be that hard if I use a tiling approach. Maybe I'll try to look for concentrations, have to try out what will work without overly complicating and inviting hard to find bugs :) It's usually easier said than done, off course.
In the mean time I collected some papers. Probably you have them, but anyway: https://www.dropbox.com/sh/h8le08p5rlnq3wn/AABjpwtOdEo8JkloCB8j2rZka?dl=0
GPC is also in it. Nice old style C-code, many capitals, reminds just a little of Fortran code :)
GPC does not appear to admit as many bugs as I expected.
Cheers!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Do you have any reason to believe that the preview version is less correct than the preceding version 4.6.2? (I'm not asking for guarantees, mind you)
I'm confident it's still just as robust as version 4.6.2. However, it's still only beta because there are additional features I hope to address such as removing micro self-intersections related to rounding (which also occur in the 'stable' version) and a more complete solution to joining polygons in solutions that have touching edges. (Neither of these things are easy to fix, and I've needed a break from Clipper to focus on other things, hence the prolonged delay there.)
so I can at least use the extents to determine potential proximity
I don't think that's necessary and would probably only slow things down. Just arbitrarily divide your polygons into smaller groups (eg 1,000) and union these before a final union of the results.
Edit: I'd also be very interested to know how well this works for you.
Last edit: Angus Johnson 2019-11-02
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Do you have any reason to believe that the preview version is less correct
than the preceding version 4.6.2? (I'm not asking for guarantees, mind you)
I'm confident it's still just as robust as version one. However, it's
still only beta because there are additional features I hope to address
such as removing micro self-intersections related to rounding (which also
occur on version 1) and a more complete solution to joining polygons in
solutions that have touching edges. (Neither of these things are easy to
fix, and I've needed a break from Clipper to focus on other things, hence
the prolonged delay there.)
so I can at least use the extents to determine potential proximity
I don't think that's necessary and would probably only slow things down.
Just arbitrarily divide your polygons into smaller groups (eg 1,000) and
union these before a final union of the results.
I used the PolyTree approach years ago because it gave me the holes.
If you say I shouldn't need to do this, I'll have another look.
Now I'm converting to 2.0 and have a few loose ends:
ReverseSolution(), SimpliFyPolygon(), CleanPolygon(), StrictlySimple() are no longer there,
I'm trying to find out whether I really need them, but I'm not comfortable to scrap them;
less important but potentially so: ArcTolerance is no longer available and tuning variables the likes of delta and stepsperrad may need exposing (too).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I've implemented subdivision and tweaked it a bit. In my task the effect is substantial, from about 8 hours to 40 minutes (factor 12). Subdivision is by a factor of the number of items (per class), which varies quite a bit. That worked better than using fixed numbers of polys per subtask, say 1000. I have not preordered the polygons, but they have some ordening stemming from a set of shapefiles covering regular tiles of the whole area. That may also explain why a fractional size worked better than a fixed size.
I first implemented the 2.0 Clipper version in my library, but reverted when I did not get correct results, because I need the library to be functional. When I have time, I'll try to replay my scenario's by directly to the V2.0 objects to find out whether I missed something or not.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I've implemented subdivision and tweaked it a bit. In my task the effect is substantial, from about 8 hours to 40 minutes
Excellent. Also, I'm wondering now if groups of 100 or 200 may even be a little faster. It'd be interesting to find the sweet spot.
I first implemented the 2.0 Clipper version in my library, but reverted when I did not get correct result ... When I have time, I'll try to replay my scenario's by directly to the V2.0 objects to find out whether I missed something or not.
Obviously I'd be very interested to see examples of any problems (preferably with smallish datasets :)).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi Angus,
I tried smaller groups, but they gave significantly worse results.
1000 was a good start, but also to my surprise, higher was better but 5000
too much.
My multithreading does not cover this part, it would have to be nested
multithreading, I'm looking into that, as it would help with the largest
datasets.
Also I haven't spent energy to find out whether the clipping of the groups
individually or the clipping of the groupresults takes more time.
I hope I can select a smallish sample, but first I have to set up the test
without using my own library as a potentially confusing influence.
Thanks,
Jan
I've implemented subdivision and tweaked it a bit. In my task the effect
is substantial, from about 8 hours to 40 minutes
Excellent. Also, I'm wondering now if groups of 100 or 200 may even be a
little faster. It'd be interesting to find the sweet spot.
I first implemented the 2.0 Clipper version in my library, but reverted
when I did not get correct result ... When I have time, I'll try to replay
my scenario's by directly to the V2.0 objects to find out whether I missed
something or not.
Obviously I'd be very interested to see examples of any problems
(preferably with smallish datasets :)).
On the advice of Angus I did not (yet) put any effort in to group spatially.My data are already organised (spatially) fairly well.So, I'll give a few details first which have to do with my library/organisation.I use a format for a a collection of polygon with holes.
In my lingo that is a MultiComplexGon.
The data comes as shapefiles, which are read into a single object as before, and then gets sent to the dissolve routine, which puts the data into Path objects, calculates and puts the result back.dissolveloadfactor is an estimate of the average complexity of the records in the shapefile.
Translating twice is a bit of a hassle, but it is hardly significant timewise.
tshpi is a struct that helps handling shapefile input.
The multithreading is done using openMP (have to get around to using a more C++ threadpool), which explains the #pragma omp critical for the load.
So hoping it's not too disappointing:
I could use a spatial index or merely use the centroids/extents for more organisation of the feed, but having a fairly useful result, it's not a priority for me right now. I suspect that I would need to 'grow' the groups instead of simply hacking the input into parts.
Last edit: Jan 2019-11-07
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have quite large polygon datasets, order of 1e6 polys which can be clipped in one go, but then take (too much) time. Multicoring is possible, but does not help out enough, even with 16 threads on a Ryzen 2700X.
Working with large numbers the expected N2 complexity would have to be - clipped - not to say reduced.
Is this assumption of mine correct?
My reason for believing that a serious speedup for large datasets is possible, comes from running the desired dissolves on Qgis, but more to the point, arcgis in the latest versions (pro: 2.4.2, desktop 10.7.1). The latter are one, if not two orders of magnitude faster than what I did using the clipper library simply dumping all the rings as path. After quite a bit of work, I felt, well, silly.
If my idea about binning/two steps approach is correct, the basic approach could be to use a spatial index to group subsets, clip these and subsequently test for overlaps to do the second round, while leaving the non-overlapping well alone.
I've not quite thought this out and wonder if someone has, and/or more people have this need.
When filling the path, binning could be done without the need of a spatial index, through a service to be provided by the library perhaps.
Please let me know whether you agree, have more developed ideas, etc.
Thanks,
Jan
dear,
what rendering do you require?
winding numbers?. ..parallelel edge bundling.....?...
There are apecialised versions of vatti for odd/even(alternate) windings which run hundreds of times faster if many many overlapping edges are input for clipping.
Last edit: Ignorant 2019-11-02
Hi,
Thanks for the attention! I require winding number dissolving for this specific case.
My basic code is (parts being polygons which may have holes etc. as in shapefile):
Should I google up on specialised Vatti algorithms?
Thanks,
Jan
the only easily available (with source code) is Angus clipper if you want
windings other than alternate.
Specialised vatti includes Anugus code.The other one (with overfed bugs) is
murtas general polygon clipper.
an attempt to remove some noticed bugs is available on github. Code for
vatti is also available from NCAR
which was the first vatti available with proper horizontal edge
handing.Alan murta combined al the 4 operations
in a single source to call it the General Polygon clipper--He added 3 bugs
in the process.That code has a couple of
innovations which might help you..It has a horrible input polygon
partitioning code so rewriting is necessary for a large number of input
vertices.
Angus code clipperV2 is much neater but does NOT handle overlapping
bundles which gpc does.
On Sat, Nov 2, 2019 at 5:44 PM Jan jheckman@users.sourceforge.net wrote:
Hi Jan.
No. IIRC Clipper's algorithm complexity is worse than N^2, hence the problem you're encountering with very large datasets. (For performance comparisons with other clipping libraries see https://rogue-modron.blogspot.com/2011/04/polygon-clipping-wrapper-benchmark.html )
The simplest way to address this might be to union subsets of your polygons before performng a final union of your subset results.
There are other much more complex ways to address this, including streaming polygons as they're encountered in the sweep algorithm, but if the above suggestion down't work for you, I'd suggest using another library that's perhaps better suited to very large datasets.
Cheers
You guys helped me on my way, thanks.
I have one remaining question because the code improvements and the speed increase with preview/beta version 2.0 are attractive. Do you have any reason to believe that the preview version is less correct than the preceding version 4.6.2? (I'm not asking for guarantees, mind you)
I'll certainly get to work on pre-grouping/input partitioning, probably using the fact that my polygons have calculated stats by the time of clipping, so I can at least use the extents to determine potential proximity, which should not be that hard if I use a tiling approach. Maybe I'll try to look for concentrations, have to try out what will work without overly complicating and inviting hard to find bugs :) It's usually easier said than done, off course.
In the mean time I collected some papers. Probably you have them, but anyway:
https://www.dropbox.com/sh/h8le08p5rlnq3wn/AABjpwtOdEo8JkloCB8j2rZka?dl=0
GPC is also in it. Nice old style C-code, many capitals, reminds just a little of Fortran code :)
GPC does not appear to admit as many bugs as I expected.
Cheers!
I'm confident it's still just as robust as version 4.6.2. However, it's still only beta because there are additional features I hope to address such as removing micro self-intersections related to rounding (which also occur in the 'stable' version) and a more complete solution to joining polygons in solutions that have touching edges. (Neither of these things are easy to fix, and I've needed a break from Clipper to focus on other things, hence the prolonged delay there.)
I don't think that's necessary and would probably only slow things down. Just arbitrarily divide your polygons into smaller groups (eg 1,000) and union these before a final union of the results.
Edit: I'd also be very interested to know how well this works for you.
Last edit: Angus Johnson 2019-11-02
Thanks!
On Sat, Nov 2, 2019 at 11:03 PM Angus Johnson angusj@users.sourceforge.net
wrote:
--
Lees meer over de Landelijke Signaleringskaart
https://www.dropbox.com/s/yuqfthjdcmbmdhe/NominatieSignaleringskaart.pdf?dl=0
en stem op ons als uw favoriete Omgevingswet project
https://www.eventtouch.eu/ministeriebzk/trofee2019/!
Also, given that performance is critical, are you sure you need to use the
PolyTree
structure instead ofPaths
?Hi Angus,
Was caught up in some socialities ;)
I used the
PolyTree
approach years ago because it gave me the holes.If you say I shouldn't need to do this, I'll have another look.
Now I'm converting to 2.0 and have a few loose ends:
ReverseSolution(),
SimpliFyPolygon(),
CleanPolygon(), StrictlySimple() are no longer there,I'm trying to find out whether I really need them, but I'm not comfortable to scrap them;
less important but potentially so:
ArcTolerance
is no longer available and tuning variables the likes of delta and stepsperrad may need exposing (too).I've implemented subdivision and tweaked it a bit. In my task the effect is substantial, from about 8 hours to 40 minutes (factor 12). Subdivision is by a factor of the number of items (per class), which varies quite a bit. That worked better than using fixed numbers of polys per subtask, say 1000. I have not preordered the polygons, but they have some ordening stemming from a set of shapefiles covering regular tiles of the whole area. That may also explain why a fractional size worked better than a fixed size.
I first implemented the 2.0 Clipper version in my library, but reverted when I did not get correct results, because I need the library to be functional. When I have time, I'll try to replay my scenario's by directly to the V2.0 objects to find out whether I missed something or not.
Excellent. Also, I'm wondering now if groups of 100 or 200 may even be a little faster. It'd be interesting to find the sweet spot.
Obviously I'd be very interested to see examples of any problems (preferably with smallish datasets :)).
Hi Angus,
I tried smaller groups, but they gave significantly worse results.
1000 was a good start, but also to my surprise, higher was better but 5000
too much.
My multithreading does not cover this part, it would have to be nested
multithreading, I'm looking into that, as it would help with the largest
datasets.
Also I haven't spent energy to find out whether the clipping of the groups
individually or the clipping of the groupresults takes more time.
I hope I can select a smallish sample, but first I have to set up the test
without using my own library as a potentially confusing influence.
Thanks,
Jan
On Wed, Nov 6, 2019 at 7:21 PM Angus Johnson angusj@users.sourceforge.net
wrote:
--
Lees meer over de Landelijke Signaleringskaart
https://www.dropbox.com/s/yuqfthjdcmbmdhe/NominatieSignaleringskaart.pdf?dl=0
en stem op ons als uw favoriete Omgevingswet project
https://www.eventtouch.eu/ministeriebzk/trofee2019/!
curiosity ...!
Is the speedup because of multithreading?.
Please, can you share your polygon subgrouping method?
On the advice of Angus I did not (yet) put any effort in to group spatially.My data are already organised (spatially) fairly well.So, I'll give a few details first which have to do with my library/organisation.I use a format for a a collection of polygon with holes.
In my lingo that is a MultiComplexGon.
The data comes as shapefiles, which are read into a single object as before, and then gets sent to the dissolve routine, which puts the data into Path objects, calculates and puts the result back.dissolveloadfactor is an estimate of the average complexity of the records in the shapefile.
Translating twice is a bit of a hassle, but it is hardly significant timewise.
tshpi is a struct that helps handling shapefile input.
The multithreading is done using openMP (have to get around to using a more C++ threadpool), which explains the #pragma omp critical for the load.
So hoping it's not too disappointing:
I could use a spatial index or merely use the centroids/extents for more organisation of the feed, but having a fairly useful result, it's not a priority for me right now. I suspect that I would need to 'grow' the groups instead of simply hacking the input into parts.
Last edit: Jan 2019-11-07