Sakari,
I'm in the process of testing osm2postgis rev. 77 with a medium size data set (U.S. state of Maryland; chosen at random for it's size). I've have come across a couple of questions:
1) my test run has been in progress for nearly four days. I've noticed that in the current phase entries are simultaneously being created for tables navigation_motor and navigation_foot. This may not be a concern, but I thought I'd question it just in case. Below is the latest log output.
2) performance wise, my empirical observations indicate that things are running O(n2) or O(n3) as I move up in data set size. While one should expect that things will take longer because of the necessity to minimize the VM footprint, if efficiencies can be had through multi-threading (parallelism) or other algorithmic tricks it would greatly help. From what I can tell, disk access rates appear to be a limiting factor but there still may be room to take advantage of multi-core cpus.
best regards,
Bruce
08:55 osm2postgis.core.Monitor run CONFIG: JVM 84.7/233.5 MiB (63.7 % used).
08:55 osm2postgis.core.Monitor run INFO: Time elapsed 3 d 20:01:52; Committed
to line 22920836; Throughput 2.0 entities/s.
08:55 osm2postgis.core.Monitor run INFO: Cumulative: public.anomaly:created=3
public.human_built:created=275288 public.human_landuse:created=187 public.hum
political:created=16 public.navigation_aero:created=87 public.navigation_foot
eated=601473 public.navigation_motor:created=605475 public.navigation_nautica
reated=1 public.osm_nodes:created=7553768 public.osm_relations:created=2210 p
ic.osm_ways:created=534564 public.physiography_surface:created=1 public.physi
aphy_water:created=5 public.topology_border:created=151
08:55 osm2postgis.core.Monitor run CONFIG: JVM 83.4/233.5 MiB (64.3 % used).
08:55 osm2postgis.core.Monitor run INFO: Time elapsed 3 d 20:02:07; Committed
to line 22920836; Throughput 3.1 entities/s.
08:55 osm2postgis.core.Monitor run INFO: Cumulative: public.anomaly:created=3
public.human_built:created=275296 public.human_landuse:created=187 public.hum
political:created=16 public.navigation_aero:created=87 public.navigation_foot
eated=601492 public.navigation_motor:created=605494 public.navigation_nautica
reated=1 public.osm_nodes:created=7553768 public.osm_relations:created=2210 p
ic.osm_ways:created=534564 public.physiography_surface:created=1 public.physi
aphy_water:created=5 public.topology_border:created=151
08:55 osm2postgis.core.Monitor run CONFIG: JVM 83.4/233.5 MiB (64.3 % used).
I've been a bit busy lately, so there has been a slight pause with my work with OSM2PostGIS. I will continue in a few days.
See the features.json file (the feature specification) for map features that end up in multiple routing topologies. It's entirely intentional that the same features can appear in multiple routing networks. For example, you can walk or use bicycle on many of the same roads where you can drive your car. You can fine tune the features.json to choose which map features you want in which routing networks.
When you are moving from small to medium data sets, the O(n^x) behaviour is likely caused by the machine not being able to utilize I/O caches so much. The algorithms should be something like (Onlogn), unless there is a bug with indexing or something. I have been testing with the full planet.osm and these runs take a few weeks on a powerful machine. To really sample the algorithmic performance, we should measure with several data sets of tens of gigabytes and draw the conclusions from there. For example, 5G, 10G, 15G, 20G, or even more.
Like you said, the performance is mainly I/O bound, so using multiple cores could only reduce some delays between writes. Of course if someone has a very powerful striped SSD array or the like they would definitely benefit from multithreading. On the other hand writing multithreaded code is a nice challenge and kind of fun, so I might do that anyway.
I have also had some problems rendering some map features with GeoServer, so I will also spend some time trying to figure out where are the problematic geometries.
I forgot to mention that the tool already reports the current throughput every 15 seconds (by default). The number of entities is the the number of OSM changesets, nodes, ways or relations depending on which of those the system is processing. Changesets and nodes are naturally processed faster because they are simpler.
Also note that the throughput will drop dramatically at times when PostgreSQL is doing something (like vacuuming) on the background.
If you know the number of lines in your input XML file, you can monitor the parsing progress from the log where it says how many lines have been committed. Currently, there is less progress information on the next phases: on the generation of renderable map features and routing topologies. They can be monitored by looking at the cumulative reports of rows added to the various target tables, also found in the logs.
If the throughput falls to zero for an extended period (several minutes), it indicates that something is wrong, and maybe some of the processing threads have died or been blocked. There should be some exception messages also, if this happens.
One more comment quickly. You should tune your PostgreSQL server for better performance. Google for PostgreSQL performance tuning. There is also a link on the front page of OSM2PostGIS project web. Or maybe you have already done it...
Good feedback, thanks.