Sunday 8 June 2014

Between the data and the map

I have long felt that one of the the easiest ways to prepare OSM data for use in a map is to create an intermediate database, holding separate tables for each of the the different features. Rather than working directly from the original tagging and geometry these tables can contain a set of values that have been adapted to make it easier to work with the data.

Such an approach won't suit all, but I find that it allows me to work from a fairly large extract (typically the whole of the British Isles), and to quickly produce maps interactively. I can easily assess the way that contributors actually use tags, then pick what transformations I want, without worrying much about how quickly these will be performed. I can progressively refine the data as I learn more about the way that tags have actually been used, and I can make local edits on my own data without worrying about corrupting the base data. I tend to do this stuff in bursts of activity, with long gaps between, so there are some administrative advantages in having a clear trail which documents the different transformations I've tried, and keeping these scripts separate from work on individual maps. I don't need a very fast turnaround on changes to the data, so periodically I do another extract, and rebuild everything.

Transforming the original data into an intermediate form seems to be the norm for routing, but I'm a bit surprised their isn't more discussion of similar techniques for other applications. So this is how I handle data on linear roads, in the hope that it will be of use to others.

Some highway features are plotted as points (mini-roundabouts, turning circles) and some are plotted as areas. I do stuff with these as well, but for now, let's concentrate on how I transform data on highways that are represented by lines.

I download a planet extract covering the British Isles, and load this into a postgis database. I then extract all the linear highways into a separate table (anything tagged "highway" but not "area=yes"). I retain the geometry, the original identification (so I can backtrack and fix any problems later), and I keep the original tagging (as an hstore). To be more precise, I simplify the geometry slightly at this stage, since the base data is sometimes unnecessarily detailed for my purposes. This doesn't seem to make a big different for roads, but it does for some other features. For ways that represent lines I do no validation on the geometry. For other features the geometry of polygons, and multi-polygons can be checked at this stage.

I then add separate columns for each of the different characteristics I might want to represent on a map.

Based on the original tags, I categorise each segment as either a "road",  "track", or "path". I hold the data on paths and tracks in the same table as roads because there are areas of overlap, and keeping them together makes it easier to faff around in the grey areas between road and track or track and path. However, there are different considerations involved in classifying different types of path and track. There is enough to worry about right now on roads, so dealing with the others is best left aside for while.

I then re-interpret the original highway tagging for roads, using three different schemes, based on different ways that I might want to use the data.

  • For mapping the topology of the road network I want styles that indicate the official classification of the road (and where the road is unclassified, a rough indication of importance). So I use the most common values of the "highway" tag: Motorway, Trunk, Primary, Secondary, Tertiary, and the associated Link roads; Unclassified, Residential, and Service, etc. Tagging of roads is mature and heavily used, so it is pretty robust. A core set of a dozen common values covers 99.9% of the road network in Britain, and does so quite consistently. I impose these values with an "enum" data type. When I'm working on the data there's a huge temptation to use a more flexible data type but I've found that forcing a closed set of values at this stage is a useful discipline. A dozen categories seems a lot, so I've tried to reduce the number, but found this doesn't work out well in practice. However, I do merge living_street" and "residential". In principle these have different meanings, but practice, I reckon they are often used as synonyms. So I reclassify any roads tagged "living_street" as "residential" in my own database. Everything else is marked as "other". The remaining 60 or so values of the "highway" tag account for less than 0.1% of British roads. Some of these values are well-defined, but more specialised than I am ever likely to need ("bus_guideway"). Some look like sensible tagging that hasn't yet been widely adopted ("passing_place"). Even if I wanted to use these, the data would be too sparse to be of much use. Others are presumably errors ("bridalway"). However, in total all these represent such a small proportion of the total network that they are hardly worth dealing with systematically.  However, one occurrence of a rare value can affect a map of a small area, and may need to be dealt with. So I put all of these values into the category "other". By applying a contrasting style to "other" I can easily spot problems, and deal with them manually.
  • Within  a map covering a wide sweep of landscape the roads are not really a dominant feature. Along with water, woods, the contour of the land, etc. they are one of a number of features that map readers will use as a guide to location. For mapping these cases I only need a limited number of road styles - enough to indicate that a road exists, and give some idea of its importance. Beyond that I can't be bothered with managing a variety of many different styles for different types of road. So I also re-classify the different values of "highway" to just "wide", "standard" or "narrow". When I have no need for more detailed classification this small set of values makes it very easy to manage appropriate styles.
  • In the two previous cases the width of the road as rendered is an indication of its importance, and is commonly exaggerated to improve visibility. However, across a small section of townscape roads will naturally be represented at a relatively large scale. They will need to relate to buildings and other features that are positioned at their actual location. Over-exaggerating the width can cause problems, and I find it helps if roads are drawn broadly in line with actual width. There is a well-defined tagging scheme to specify road width, but it is not widely used. So what I do here is estimate the width of the road. I extract the width value where one is available, where it is easily parsed, and where the result is wide enough to be reasonably visible (in practice this captures almost all values that are presented in metres). If no suitable value is available, then I next default to calculating an approximate width based on the number of lanes (where this is provided). Otherwise I use a default value for each different type of road. There are a few cases where this doesn't work well, but on the whole it looks fine. It seems to produce an estimate of width that is close enough for my purposes, and it is easy enough to render a width in metres when working in OSGB projection. 
For labelling I extract separate "name", and "reference" columns from the relevant tags. Having these values in separate columns isn't really necessary - I could extract the data from the tags whenever I needed it. However it makes life a little bit easier. In theory it would also allow me to do some fancy editing of the content (e.g. abbreviations, line breaks hyphenation, ...). At present, though, I just retain the original values. I also add a "label" column which contains the name, if one exists, and the reference number if not. I did this because liked the idea of standardising on always having a "label" column across a variety of different features (waterways, woods, whatever...) I thought it would be handy to always know where to find a label - so I set things up that way. In practice it didn't turn out to be as useful as I expected.

Experimenting with standardising the type of road surface hasn't proved very successful. Across Britain, more than 300 different values are used to describe the surface of a highway. These range from "asphalt" to "very_horrible". So it is a bit of a challenge to interpret this reliably. Differentiating between "metalled" and "unmetalled" is straightforward enough, though, based on common surface tags, and defaults for different types of road. So I've done that, but not used it much in practice.

There are other aspects of the data that might be worth more investigation. For example,

  • Is the current data consistent enough to differentiate between the different types of "service" roads: driveways, parking_aisles, alleys, etc. ?
  • Is there a practical way of dealing with pavements ("sidewalks=...")?
  • Many roads are commonly split into short segments because of the need to tag variations in detailed characteristics that change over short stretches. Where these characteristics are irrelevant it would be possible to merge the segments without loss of meaning. Is there any value in doing so - even if only for labelling?
Does anyone else have experience of these?

No comments: