Skip to content

Ideas for improved geometry processing #1663

Open
@joto

Description

@joto

At the core of osm2pgsqls mission is the processing of geometries from OSM data into some useful format in a PostgreSQL/PostGIS database. Osm2pgsql is one step in a larger toolchain transforming the geometries in
to pixels on your screen showing a map (or help finding places in the world of whatever the final use of the data is).

Conceptionally processing in osm2pgsql has these steps:

  1. The OSM objects (nodes, ways, and relations) are assembled into geometries.
  2. The geometries are optionally transformed in some way to make them more useful or easier or quicker to access.
  3. The geometries are loaded into the database.

You might think that osm2pgsql is not doing much in the second step, but there are several (optional) operations which fall into that category:

  • Transform the geometry into the target projection (usually web mercator)
  • Split up long linestrings
  • Split up multipolygons into polygons
  • Generate expire lists (which, if you think about it, are just another type of geometry) (see Ideas for improved expire mechanism #1662 for ideas about expire lists)
  • Generate the bounding box for use in the flex config file
  • Calculate the area of (multi)polygons
  • Check validity of geometries (with the help of PostGIS, this really happens partially in step 1 and partially after step 3)

Geoprocessing in the database

All other geometry processing is currently delegated to the database. After all one of the major reasons we are using the PostgreSQL/PostGIS database system is its powerful geometry processing capabilities. Users of osm2pgsql use the database to calculate labelling points, simplify linestring and polygons, merge multiple smaller objects into larger ones and many more things.

But there are some costs involved with doing all those things in the database:

  • Sometimes you do the geometry processing when accessing the data (for instance when rendering a tile) which means you might do a lot of work several times which could be made once.
  • When you want to do the work whenever the original OSM data changes, you have to use the rather coarse expire mechanism to trigger geometry processing or you have to set something up with database triggers etc.
  • If all you need is some piece of smaller data (like the center of a polygon, not the whole polygon), sending the big geometry to the database first (and maybe even committing it to disk) and reducing it there is wasteful compared to processing in osm2pgsql.
  • The Lua config file doesn't have access to any data that's only created in the database.
  • Writing code in the database can be quite complex, especially if triggers and materialized views etc. are involved.

So it makes sense to add some more geometry processing capabilities to osm2pgsql. The PR #1636 is where we are testing some of these.

But adding these kinds of capabilities only gets you so far. The way osm2pgsql operates you can usually only operate on a single feature at a time. So we can calculate the centroid of a polygon or simplify the geometry of a single way. But whenever we want to operate on multiple features, we really need to go to the database. Again, that's why we have the database, because it can easily find a bunch of objects and do some geometry processing on them. So that kind of processing is not going to go away.

Working with updates

If we only do a one-off import of OSM data, we can easily run a SQL script afterwards that does any kind of processing we can imagine. Many people do that already. But if we want to be able to update the database this becomes more tricky. We need mechanisms to track what changes need to be done and trigger those changes. There are several options how such a tracking and triggering could be done:

  • Split the world into pieces and keep track of which pieces need re-processing. That's what expire list is really. See Ideas for improved expire mechanism #1662 for ideas about expire lists that would make this much more flexible and useful.
  • Store the OSM id with each object in the database and re-processs them when a change for that id comes in. This is what osm2pgqsl does to decide which features to re-create in the database. You can piggyback on that with database triggers. This is not easy to do because you need to handle new objects and deletions and changed objects (which osm2pgsql simply treats as a deletion plus a new object). It is possible, but not that easy to, say, join all linestrings of a longer street this way and keep track of all the constituent geometries and keep the result up-to-date this way. It would be great if we can make this kind of thing easier to do.
  • Somehow keep track of attributes and use them to trigger re-processing. If we join all the linestrings of motorway M17, we can find the one geometry with ref=M17 again and update it if needed. This isn't easy because you need to take any old and new tags on OSM objects into account, but it is something we could add some support for in osm2pgsql.

So where do we go from here

  • There is already some work underway to allow for more geometry processing inside osm2pgsql (see PR Work-in-progress: Add geometry functions in Lua #1636). The goal here is to make some processing much easier to do and more accessible to the casual user.
  • There are ideas for improving expire handling (see Ideas for improved expire mechanism #1662)
  • We need to rethink the way we do updates when OSM objects change. Currently basically a DELETE/INSERT is done. To use this in the database you need to add a trigger on the DELETE to basically ignore it, and than a trigger on the INSERT that does any updates needed. Maybe we can add some support to osm2pgsql to make this easier and more straightforward. At a minimum we should document the way this is supposed to work so non-power-users can set up something like this.
  • We should think about ways of doing attribute-based "expire" and re-processing. One option would be to add some kind of hash to each database entry. That hash must be calculated in the Lua config based on the tags used in an object. And osm2pgsql would keep track of those hashes and make sure objects with the same hash are re-processed whenever an object with that hash changes.

I am sure there are more ideas.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions