Tuesday 24 June 2014

More on rivers

I've been looking into couple of questions about the OSM data on rivers.

The straightforward point is about naming of rivers. The guidance is to use the complete name on the waterway=river line. So "River Tyne" rather than "Tyne". However this guidance isn't followed consistently. About 12% of UK rivers (waterway=river) haven't been tagged with a name at all, roughly half are tagged with a name in the form "River xxxx", and rest are tagged with a different form of name. A lot of the time that doesn't matter, and in any case some rivers have widely recognised names that don't include the "River..." prefix. To label "Afon Teifi" as "River Afon Teifi" would just be silly.

The bigger challenge is that there isn't consistency within naming of different segments of the same river. Rivers are long, and normally mapped in pieces. There isn't a widely adopted reference system for rivers, as far as I can see. So (alongside the geometry) names are the best way of associating different pieces of the same river.

Most of the larger UK rivers have segments tagged with a mix of different forms of the same name - including the River Avon (8% of length = "Avon"), River Thames (1% = "Thames"), River Derwent (11% = "Derwent"), River Trent (4% = "Trent") and River Don (6% = "Don"). Tagging with different forms of name gets in the way of other processing, so I try to get round this by standardising on a stripped form of name (i.e. removing "^River " with a regular expression). This doesn't work all the time, but for most purposes it works well enough.

The next issue is a side-effect of mapping rivers as a mix of areas and lines. The map below shows the point just west of Hexham where the North Tyne and the South Tyne come together. Parts have been mapped as areas (waterway=riverbank) and parts have been mapped as lines (waterway=river). There's a gap between two of the areas that were drawn as riverbanks.

These look particularly odd here, because I've emphasised the problem by removing all but the river. In the default map the issues don't show up to the same extent. Bridges mask two of the three problems, and while the the third is visible on the standard map, it tends to get lost amid other details.




But the default map isn't the only way this data is used. I'm interested in exploring the limits of using the same data for detailed maps, so I've been pondering on how best to handle this mix of lines and areas.

Of course the long-term answer is to add more detailed mapping. I'm doing that where I can, which tends to shift the problem further upstream. Meanwhile, in the shot-term, varying the width at which I render waterway=river lines provides a partial solution. The issue is that different widths work for different parts of the river system. For example, further north, tributaries of the River Tweed have been mapped in quite a lot of detail, and at the point where the areas meet the lines the rivers tend to be narrower. Here's an example where exactly the same style for waterway=river works well enough when it is mixed with areas tagged waterway=riverbank.


Choosing a suitable width for the waterway=river line can look OK. A width that is too narrow will leave step changes like my first example. Unless I start chopping up overlapping lines and areas, then a width for the waterway=river line that is too large will obscure more detailed mapping of riverbanks.

Although there is a tagging scheme for explicitly recording the width of a river (width=n) it is rarely used in the UK, and very rare around here: so it is not a great deal of help to me in practice. Usage is patchy though, so there are places where others may find it more useful (see darker lines below for some idea of where width has been applied to waterway=river). For what it's worth, where it is specified the average width given for a river is just under 5 metres.



In theory it might be possible to estimate a different width for each river segment by analysing adjoining lines and any overlapping areas of riverbank. However, I suspect this would turn out to be too complicated to be of any practical use.

While I wait for more complete mapping, what I think I need is a reasonably sensible default river width. River widths vary, so there are always going to be cases where a single default is either too wide, or too narrow. But because I want to retain mapping of detailed data wherever possible, I would prefer to err on the side of choosing a default width that is more often too narrow, rather than one which is more often too wide.

As a rule of thumb, I reckon that a width equivalent to 9 metres on the ground works quite well around here, particularly with rivers that have been mapped in a fair amount of detail. Where it doesn't work so well there is an incentive to improve the mapping. As far as I can tell this is also about the right default width for rivers in the rest of the UK.

In more than 90% of cases where the width of a river is specified it is less than 10 metres. I'm not doing any special processing for such cases, because they don't occur around here, but there may be a case for handling river width where it is specified.

I don't have any evidence for the right thresholds for width elsewhere in the world, but consider this. Whether by accident or design, in Northern European latitudes, nine metres seems to be roughly the width that a default waterway=river line represents in the standard map. So on the standard map a mix of river lines and areas happens to turn out reasonably well at the point where a river is nine metres wide. I imagine that contributors who are influenced by the standard map will tend to stop drawing rivers as areas once the river width reduces to the equivalent of the default line width on the standard map. They may not realise they are doing this, but if they notice that riverbank areas don't add value to narrower rivers then this would be a sensible point for them to stop adding areas. If this is a global effect, then perhaps there could be some appropriate guidelines for data contributors.

As supporting evidence, I've sampled about 130,000 points along UK riverbanks from the OSM database. The most common width between banks is 9-10 metres. This is well below the average width, or even the median, because there are many sections of river which are wider. But more than 10% of my sample showed a river width of 7-11 metres. The number of samples falls off quite slowly for wider sections of river, but quite quickly where a river area is narrower. Only 2% of my sample showed a river width less than 5 metres.


I haven't done a similar measurement of mapped width for streams. There are some streams mapped as areas, so it should be possible, but I have doubts about how accurate, so how useful the result could be.

The standard definition of the difference between a river and a stream is that a stream "can be jumped across by an active, able-bodied person". Intriguingly, the world record for a running long jump is just short of 9 metres. So if we wanted to be silly, we could argue that a default river width of 9 metres is consistent with one interpretation of the transition point between a river and a stream.

More realistically, several fitness measures suggest that the kind of distance a fit adult can achieve with a standing jump is in the region of 2-2.5 metres. I wouldn't try to jump a stream that wide. But the figure of 2.5 metres suits my purposes as a default width for a stream.

There are about 3,000 sections of stream with a specified width the the OSM data for the British Isles. The average width is just over 1 metre. About 85% of stream segments with a specified width are less than 2 metres, and almost half are less than 1 metre. So my chosen default is higher than the average, and higher than most specified stream widths. I may have set it too high. However, it seems only slightly wider than the standard map rendering of a stream. On larger scale maps I reckon that a stream shown as 2.5 metres wide is about the minimum that renders reasonably clearly, and there is scope to increase the width without streams becoming too dominant across a map of open countryside. It still leaves me with a bit of a big step between an effective minimum river width of 9 metres, and a stream width of 2.5 metres, but the quality of stream tracing looks like being more of an issue.

I'm going to leave further tuning of river widths as a problem for another day. However, with an eye on the long-term possibilities, it might help detailed mapping if contributors were encouraged to map rivers that are more than 10 metres wide as areas, and add a width tag to rivers that are less than 10 metres wide.  

Friday 13 June 2014

Pre-processing water

Like roads, water can be represented as either a line or an area. Roads represented as lines tend to dominate those represented as areas, but areas of water play the more important role. Lines of water (at least in my part of the world) are almost all small streams. They add detail, but don't dominate. Dealing with the sea is a special case of dealing with water. It needs to be handled differently, but that's not difficult.

I find the quality of data on coastal landscape (beaches, cliffs, rocky outcrops) is a bit patchy. It's important to me that I get the big areas of water right (rivers, lakes and the coastline). I'm less concerned with the finer detail of streams etc. I spot the occasional oddity, but on the whole I find that the current data on inland water is as complete and as accurate as I need - if not more so. As a result I've not done any systematic examination of the quality of data on inland water. The only way that I've used to examine data on coastal landscape is by eye.

Some ways that describe water are assumed to represent areas by default, others are assumed to represent lines by default. To collect areas of water I extract all the ways marked 'natural=water', 'waterway=riverbank', 'pond','dock', 'boatyard','lake','reservoir', or 'estuary'. All of these are assumed to represent areas. I also extract anything tagged 'area=yes' alongside 'waterway=...whatever'. Because I'm selecting an inclusive list of values I end up dropping some possible areas of water with this approach. That isn't very satisfactory. However, the list of potential "waterway" values is quite long, and varies across features that should represent areas of water ("marina"), those that could represent either a closed line or an area ("moat"), as well as those that probably don't represent water ("pumping-station") and those that probably need individual treatment ("weir", "slipway"). I can't see a simple approach that would handle all these cases robustly. Users who are interested in water features would have to, but (so far) I don't.

In practice a few of the ways I select are not valid representations of an area, so I throw away anything that isn't a closed loop with more than three nodes and no self-intersection. What's left are easily converted to polygons, the geometry simplified, and then added to a table of water areas.

Multi-polygons of water are assembled in the same ways as multi-polygons of buildings (see earlier post), using multi-polygon relations that carry similar water-related tags to the simple polygons described above. They are added to the same table as the simple polygons.

I don't load coastline, or the sea, from the standard OSM extract. It's quicker and easier to pull the shapefiles from here, and import them into postgis. I am only working on UK data so I could then delete most of the world's oceans. I'm not sure whether this would make a significant performance difference. I thought I might run into problems because I'm working off two different data extracts, but of course the coastline is pretty stable, and it hasn't caused me any difficulty in practice.

While I'm dealing with areas of water I also add 'natural=beach' and 'natural=wetland' areas to the same table as I use for water areas. It's a pragmatic approach, which I adopted because I've found it  it makes life easier later on if I can treat all these areas together.

Because I put water, beach and wetland into the same table I need to classify these differently. However, I've never felt it necessary to make any further distinction between different types of water area. I use the same basic style for all, whether they are tagged as "riverbank", "canal", "dock", "boatyard", "pond", "lake", "reservoir" or whatever. So I don't bother with further classification. The original tags are all retained in case I ever need them.

In principle I treat every other way that is tagged "waterway" as a linear feature. The vast majority of these are marked "stream", "drain", "river", "canal", or "ditch". The rest of the waterway values in the UK are a bit of a ragbag. Most have a very small number of examples. Some data users may need to deal with these more systematically, but I've never found the need. I classify the most common values in a separate column, so I can handle different styles easily. I use an "enum" data type to impose a closed set of values. Everything other than the most common values is classified as "other" so I can spot any important instances and deal with them manually. I've not needed to (so far). 

I add a "label" column to the tables for both areas of water and linear waterways. In principle I could do all sorts of fancy stuff with this, but at present it simply contains the value of any "name" tag. 

There are some questions I haven't addressed when dealing with the water data at a relatively large scale. I'm not very happy with the way that I distinguish between areas and linear features, though it seems to work well enough in practice around here. I haven't found a very satisfactory way of dealing with transition where a river moves from being represented as an explicit area, to be represented symbolically by a line. I've also realised that the data on coastal landscape is a bit patchy around here - so there's some map improvement work to be done.

As I wrote this I noticed some errors in the tests I've been using to extract linear water features. Some of the ways that represent areas of water have ended up included as linear ways as well. I hadn't noticed that earlier, so it obviously hasn't been much of a problem. However, I'm sure I can't be not the only data user who is post-processing data in this kind of way, and I think that goes to show that some collective effort on post-processing techniques would benefit us all.

Wednesday 11 June 2014

Pre-processing buildings

Including buildings makes a noticeable difference when mapping a built-up area. In some cases buildings will make sense of a townscape in a way that the road network alone cannot achieve. In a rural area buildings can be important landmarks. So it is nice to be able to include them.

It's not difficult, but nor is it entirely straightforward. When mapping a local area, the most obvious question is whether OSM contains sufficient data on buildings to make it worth the effort. Mapping of buildings isn't as mature as it is for roads. So the answer to the big question of whether it is worth it at all varies between different parts of the UK. Just by eye-balling the standard map it is obvious that some built-up areas have been comprehensively mapped with buildings. In some areas only a few important buildings have been mapped. In many areas a town centre has been thoroughly mapped, but the suburbs are sparse.

There are ways of assessing building coverage programmatically. The Office of National Statistics publishes outlines of the built-up areas in England. It's not hard to assess the degree to which buildings in OSM cover the built-up areas defined by the ONS outlines. As a rule of thumb I reckon that if more than 10% of a built-up area is covered by buildings then it is likely to be fairly comprehensively mapped. If less than 10% is covered then many buildings are likely to be missing. Based on that approach, and a sample of larger UK towns, about one in ten of built-up areas are currently blessed with a worthwhile proportion of building data in OSM.

Ordnance Survey StreetView data includes representations of buildings. The quality of the outlines isn't always good, but it's still possible to use the data to test the extent to which buildings known to the Ordnance Survey are included in the OSM database. The technique is along the lines of established techniques for assessing coverage of the road network - look for Ordnance Survey buildings that don't overlap an OSM building. Measuring this for the whole country would involve a lot of data collection and crunching, so I've only done it locally. While eye-balling works for big gaps, like a missing suburb, I've found Streetview handy for identifying and fixing the odd building that has gone missing - particularly in more rural areas.

One further glitch is that a fairly significant minority of UK buildings in the OSM database (around 7%) have been sourced from OS Streetview data, rather than being freshly mapped. This data may be adequate for some applications, but these outlines weren't designed to support detailed mapping, and not everyone will want to use them.

So anyone who wants to represent one of the the lucky 10% of towns is pretty much ready to go. Those who want to map the other 90% of towns have some more work to do before they can include reasonably comprehensive building data.

The next set of considerations relate to the geometry of buildings. Most of them are represented by polygons. More difficult cases are represented by multi-polygons; by collections of lines that need to be assembled into a polygon; or by a combination of these two. Processing these is quite complex, but once they are processed all valid examples can be placed in the same table as the simple polygons. Virtually all complex (but valid) cases can be reduced to a mix of polygons, and polygons with holes. Postgis treats both of these as the same geometry type. So in practice it's just as straightforward to use complex and simple building geometries. Getting the complex geometries sorted is not so straightforward.

To start with the case of simple polygons. Anything tagged "building" as assumed to represent an area. So it ought to be a closed loop, without self-intersections, and containing at least four nodes. Some don't obey these rules, and life is easier if we dispose of the few exceptions first. In my pre-processing I flag up any exceptions in the diagnostics, so I can fix those I need.

For more complex geometries I wanted to understand out how this was done, rather than use a black-box approach. So this is my DIY method. I collect all the linestrings that are part of a relation tagged "multipolygon". I get postgis to merge linestrings that are connected, then remove any that won't convert to valid polygons, and any where the lines overlap. The remainder are converted to polygons, and marked "inner", or "outer" depending on whether they are contained within another polygon. The inner polygons are then grouped according to the "outer" polygon that they lie within, and each group converted to the intended multi-polygon - with the original tags from the relation transferred. They can then be inserted into the same table of buildings that  holds all the simple polygons. Sometimes there is a multipolygon relation tagged "building" where one of the elements is also tagged "building". I don't need both, so I remove any individual elements that would otherwise end up being included twice.

This process doesn't rely on contributors correctly tagging "inner" and "outer". It copes with examples where the contributor has included disjoint outer polygons in the same relation, and examples where a multipolygon relation only holds a single polygon. It doesn't fall over when the geometry is flawed - faulty components, or elements that overlap. It isn't completely robust. I've included diagnostics on some of the error cases, but not all. There are some errors that could probably be fixed automatically, but aren't - the whole building is just discarded. At the moment my approach doesn't handle cases where the "hole" in a multipolygon relation is defined by another multipolygon relation. There may be other problems I haven't encountered yet.

Some buildings (imports, I think) are plotted with more precision than I will ever need. So I also get postgis to "simplify" the geometry - i.e. reduce the number of nodes to a more sensible level.

Most buildings in the database are tagged as such. Some aren't. So could I end up with gaps in the map as a result of inconsistent tagging rather than incomplete mapping. For example, there is the question of how to handle structures that are marked as "man_made" but not "building". If I include something marked as "building=silo" then shouldn't I include something marked "man_made=silo" as well? Similar examples would include a pier, storage_tank, or gasometer. Tagging isn't always consistent. Some such structures marked as "man_made" are also marked as a "building". Some are tagged as a building, instead of "man_made". And sometimes the "man_made" key doesn't refer to a structure. For example "man_made=water_works" is often applied to the grounds in which the works stands, rather than to the structure itself. Man_made=embankment" is fair enough, but it is not a building. So I can't simply treat every feature marked as "man_made" as though it was a structure. War memorials, commemorative columns and the like present similar problems with inconsistent approaches to "historic" tagging. Sometimes a valid area has a "shop" or "office" tag, but no "building" tag. 

My compromise solution has been to collect a small selection of key/value pairs, that I am fairly confident I can treat as "structures that are a sort of building". I include those in the same table as buildings, even when they don't have a "building" tag, I classify them differently so I can pick and choose, then handle them manually. It's not a very satisfactory solution, but it captures most cases that I've come across, and until I can find a better way it will have to do.

Finally, I've started to experiment with a limited amount of classification of different building types. Normally I have no need to differentiate between types of building, but I can envisage circumstances where it would be useful to highlight certain types (schools or hospitals, for example). So I have begun experimenting with this. The majority of buildings in OSM are tagged simply as a building. However, many can be re-classified from the value of the building key as being residential, industrial, retail, agricultural, commercial, transport or different types of institution. So buildings marked as "building=house" (or residential, terrace, apartments, semi, detached, flats, bungalow, etc.) can be fairly confidently treated as "residential". Other cases are more ambiguous ("building=store", or "block", for example). More sophisticated assessments based on a range of different tags should be able classify more completely and more reliably, but I haven't explored that approach (yet).

Sunday 8 June 2014

Between the data and the map

I have long felt that one of the the easiest ways to prepare OSM data for use in a map is to create an intermediate database, holding separate tables for each of the the different features. Rather than working directly from the original tagging and geometry these tables can contain a set of values that have been adapted to make it easier to work with the data.

Such an approach won't suit all, but I find that it allows me to work from a fairly large extract (typically the whole of the British Isles), and to quickly produce maps interactively. I can easily assess the way that contributors actually use tags, then pick what transformations I want, without worrying much about how quickly these will be performed. I can progressively refine the data as I learn more about the way that tags have actually been used, and I can make local edits on my own data without worrying about corrupting the base data. I tend to do this stuff in bursts of activity, with long gaps between, so there are some administrative advantages in having a clear trail which documents the different transformations I've tried, and keeping these scripts separate from work on individual maps. I don't need a very fast turnaround on changes to the data, so periodically I do another extract, and rebuild everything.

Transforming the original data into an intermediate form seems to be the norm for routing, but I'm a bit surprised their isn't more discussion of similar techniques for other applications. So this is how I handle data on linear roads, in the hope that it will be of use to others.

Some highway features are plotted as points (mini-roundabouts, turning circles) and some are plotted as areas. I do stuff with these as well, but for now, let's concentrate on how I transform data on highways that are represented by lines.

I download a planet extract covering the British Isles, and load this into a postgis database. I then extract all the linear highways into a separate table (anything tagged "highway" but not "area=yes"). I retain the geometry, the original identification (so I can backtrack and fix any problems later), and I keep the original tagging (as an hstore). To be more precise, I simplify the geometry slightly at this stage, since the base data is sometimes unnecessarily detailed for my purposes. This doesn't seem to make a big different for roads, but it does for some other features. For ways that represent lines I do no validation on the geometry. For other features the geometry of polygons, and multi-polygons can be checked at this stage.

I then add separate columns for each of the different characteristics I might want to represent on a map.

Based on the original tags, I categorise each segment as either a "road",  "track", or "path". I hold the data on paths and tracks in the same table as roads because there are areas of overlap, and keeping them together makes it easier to faff around in the grey areas between road and track or track and path. However, there are different considerations involved in classifying different types of path and track. There is enough to worry about right now on roads, so dealing with the others is best left aside for while.

I then re-interpret the original highway tagging for roads, using three different schemes, based on different ways that I might want to use the data.

  • For mapping the topology of the road network I want styles that indicate the official classification of the road (and where the road is unclassified, a rough indication of importance). So I use the most common values of the "highway" tag: Motorway, Trunk, Primary, Secondary, Tertiary, and the associated Link roads; Unclassified, Residential, and Service, etc. Tagging of roads is mature and heavily used, so it is pretty robust. A core set of a dozen common values covers 99.9% of the road network in Britain, and does so quite consistently. I impose these values with an "enum" data type. When I'm working on the data there's a huge temptation to use a more flexible data type but I've found that forcing a closed set of values at this stage is a useful discipline. A dozen categories seems a lot, so I've tried to reduce the number, but found this doesn't work out well in practice. However, I do merge living_street" and "residential". In principle these have different meanings, but practice, I reckon they are often used as synonyms. So I reclassify any roads tagged "living_street" as "residential" in my own database. Everything else is marked as "other". The remaining 60 or so values of the "highway" tag account for less than 0.1% of British roads. Some of these values are well-defined, but more specialised than I am ever likely to need ("bus_guideway"). Some look like sensible tagging that hasn't yet been widely adopted ("passing_place"). Even if I wanted to use these, the data would be too sparse to be of much use. Others are presumably errors ("bridalway"). However, in total all these represent such a small proportion of the total network that they are hardly worth dealing with systematically.  However, one occurrence of a rare value can affect a map of a small area, and may need to be dealt with. So I put all of these values into the category "other". By applying a contrasting style to "other" I can easily spot problems, and deal with them manually.
  • Within  a map covering a wide sweep of landscape the roads are not really a dominant feature. Along with water, woods, the contour of the land, etc. they are one of a number of features that map readers will use as a guide to location. For mapping these cases I only need a limited number of road styles - enough to indicate that a road exists, and give some idea of its importance. Beyond that I can't be bothered with managing a variety of many different styles for different types of road. So I also re-classify the different values of "highway" to just "wide", "standard" or "narrow". When I have no need for more detailed classification this small set of values makes it very easy to manage appropriate styles.
  • In the two previous cases the width of the road as rendered is an indication of its importance, and is commonly exaggerated to improve visibility. However, across a small section of townscape roads will naturally be represented at a relatively large scale. They will need to relate to buildings and other features that are positioned at their actual location. Over-exaggerating the width can cause problems, and I find it helps if roads are drawn broadly in line with actual width. There is a well-defined tagging scheme to specify road width, but it is not widely used. So what I do here is estimate the width of the road. I extract the width value where one is available, where it is easily parsed, and where the result is wide enough to be reasonably visible (in practice this captures almost all values that are presented in metres). If no suitable value is available, then I next default to calculating an approximate width based on the number of lanes (where this is provided). Otherwise I use a default value for each different type of road. There are a few cases where this doesn't work well, but on the whole it looks fine. It seems to produce an estimate of width that is close enough for my purposes, and it is easy enough to render a width in metres when working in OSGB projection. 
For labelling I extract separate "name", and "reference" columns from the relevant tags. Having these values in separate columns isn't really necessary - I could extract the data from the tags whenever I needed it. However it makes life a little bit easier. In theory it would also allow me to do some fancy editing of the content (e.g. abbreviations, line breaks hyphenation, ...). At present, though, I just retain the original values. I also add a "label" column which contains the name, if one exists, and the reference number if not. I did this because liked the idea of standardising on always having a "label" column across a variety of different features (waterways, woods, whatever...) I thought it would be handy to always know where to find a label - so I set things up that way. In practice it didn't turn out to be as useful as I expected.

Experimenting with standardising the type of road surface hasn't proved very successful. Across Britain, more than 300 different values are used to describe the surface of a highway. These range from "asphalt" to "very_horrible". So it is a bit of a challenge to interpret this reliably. Differentiating between "metalled" and "unmetalled" is straightforward enough, though, based on common surface tags, and defaults for different types of road. So I've done that, but not used it much in practice.

There are other aspects of the data that might be worth more investigation. For example,

  • Is the current data consistent enough to differentiate between the different types of "service" roads: driveways, parking_aisles, alleys, etc. ?
  • Is there a practical way of dealing with pavements ("sidewalks=...")?
  • Many roads are commonly split into short segments because of the need to tag variations in detailed characteristics that change over short stretches. Where these characteristics are irrelevant it would be possible to merge the segments without loss of meaning. Is there any value in doing so - even if only for labelling?
Does anyone else have experience of these?