Wednesday 11 June 2014

Pre-processing buildings

Including buildings makes a noticeable difference when mapping a built-up area. In some cases buildings will make sense of a townscape in a way that the road network alone cannot achieve. In a rural area buildings can be important landmarks. So it is nice to be able to include them.

It's not difficult, but nor is it entirely straightforward. When mapping a local area, the most obvious question is whether OSM contains sufficient data on buildings to make it worth the effort. Mapping of buildings isn't as mature as it is for roads. So the answer to the big question of whether it is worth it at all varies between different parts of the UK. Just by eye-balling the standard map it is obvious that some built-up areas have been comprehensively mapped with buildings. In some areas only a few important buildings have been mapped. In many areas a town centre has been thoroughly mapped, but the suburbs are sparse.

There are ways of assessing building coverage programmatically. The Office of National Statistics publishes outlines of the built-up areas in England. It's not hard to assess the degree to which buildings in OSM cover the built-up areas defined by the ONS outlines. As a rule of thumb I reckon that if more than 10% of a built-up area is covered by buildings then it is likely to be fairly comprehensively mapped. If less than 10% is covered then many buildings are likely to be missing. Based on that approach, and a sample of larger UK towns, about one in ten of built-up areas are currently blessed with a worthwhile proportion of building data in OSM.

Ordnance Survey StreetView data includes representations of buildings. The quality of the outlines isn't always good, but it's still possible to use the data to test the extent to which buildings known to the Ordnance Survey are included in the OSM database. The technique is along the lines of established techniques for assessing coverage of the road network - look for Ordnance Survey buildings that don't overlap an OSM building. Measuring this for the whole country would involve a lot of data collection and crunching, so I've only done it locally. While eye-balling works for big gaps, like a missing suburb, I've found Streetview handy for identifying and fixing the odd building that has gone missing - particularly in more rural areas.

One further glitch is that a fairly significant minority of UK buildings in the OSM database (around 7%) have been sourced from OS Streetview data, rather than being freshly mapped. This data may be adequate for some applications, but these outlines weren't designed to support detailed mapping, and not everyone will want to use them.

So anyone who wants to represent one of the the lucky 10% of towns is pretty much ready to go. Those who want to map the other 90% of towns have some more work to do before they can include reasonably comprehensive building data.

The next set of considerations relate to the geometry of buildings. Most of them are represented by polygons. More difficult cases are represented by multi-polygons; by collections of lines that need to be assembled into a polygon; or by a combination of these two. Processing these is quite complex, but once they are processed all valid examples can be placed in the same table as the simple polygons. Virtually all complex (but valid) cases can be reduced to a mix of polygons, and polygons with holes. Postgis treats both of these as the same geometry type. So in practice it's just as straightforward to use complex and simple building geometries. Getting the complex geometries sorted is not so straightforward.

To start with the case of simple polygons. Anything tagged "building" as assumed to represent an area. So it ought to be a closed loop, without self-intersections, and containing at least four nodes. Some don't obey these rules, and life is easier if we dispose of the few exceptions first. In my pre-processing I flag up any exceptions in the diagnostics, so I can fix those I need.

For more complex geometries I wanted to understand out how this was done, rather than use a black-box approach. So this is my DIY method. I collect all the linestrings that are part of a relation tagged "multipolygon". I get postgis to merge linestrings that are connected, then remove any that won't convert to valid polygons, and any where the lines overlap. The remainder are converted to polygons, and marked "inner", or "outer" depending on whether they are contained within another polygon. The inner polygons are then grouped according to the "outer" polygon that they lie within, and each group converted to the intended multi-polygon - with the original tags from the relation transferred. They can then be inserted into the same table of buildings that  holds all the simple polygons. Sometimes there is a multipolygon relation tagged "building" where one of the elements is also tagged "building". I don't need both, so I remove any individual elements that would otherwise end up being included twice.

This process doesn't rely on contributors correctly tagging "inner" and "outer". It copes with examples where the contributor has included disjoint outer polygons in the same relation, and examples where a multipolygon relation only holds a single polygon. It doesn't fall over when the geometry is flawed - faulty components, or elements that overlap. It isn't completely robust. I've included diagnostics on some of the error cases, but not all. There are some errors that could probably be fixed automatically, but aren't - the whole building is just discarded. At the moment my approach doesn't handle cases where the "hole" in a multipolygon relation is defined by another multipolygon relation. There may be other problems I haven't encountered yet.

Some buildings (imports, I think) are plotted with more precision than I will ever need. So I also get postgis to "simplify" the geometry - i.e. reduce the number of nodes to a more sensible level.

Most buildings in the database are tagged as such. Some aren't. So could I end up with gaps in the map as a result of inconsistent tagging rather than incomplete mapping. For example, there is the question of how to handle structures that are marked as "man_made" but not "building". If I include something marked as "building=silo" then shouldn't I include something marked "man_made=silo" as well? Similar examples would include a pier, storage_tank, or gasometer. Tagging isn't always consistent. Some such structures marked as "man_made" are also marked as a "building". Some are tagged as a building, instead of "man_made". And sometimes the "man_made" key doesn't refer to a structure. For example "man_made=water_works" is often applied to the grounds in which the works stands, rather than to the structure itself. Man_made=embankment" is fair enough, but it is not a building. So I can't simply treat every feature marked as "man_made" as though it was a structure. War memorials, commemorative columns and the like present similar problems with inconsistent approaches to "historic" tagging. Sometimes a valid area has a "shop" or "office" tag, but no "building" tag. 

My compromise solution has been to collect a small selection of key/value pairs, that I am fairly confident I can treat as "structures that are a sort of building". I include those in the same table as buildings, even when they don't have a "building" tag, I classify them differently so I can pick and choose, then handle them manually. It's not a very satisfactory solution, but it captures most cases that I've come across, and until I can find a better way it will have to do.

Finally, I've started to experiment with a limited amount of classification of different building types. Normally I have no need to differentiate between types of building, but I can envisage circumstances where it would be useful to highlight certain types (schools or hospitals, for example). So I have begun experimenting with this. The majority of buildings in OSM are tagged simply as a building. However, many can be re-classified from the value of the building key as being residential, industrial, retail, agricultural, commercial, transport or different types of institution. So buildings marked as "building=house" (or residential, terrace, apartments, semi, detached, flats, bungalow, etc.) can be fairly confidently treated as "residential". Other cases are more ambiguous ("building=store", or "block", for example). More sophisticated assessments based on a range of different tags should be able classify more completely and more reliably, but I haven't explored that approach (yet).

No comments: