Saturday, 15 November 2014

Bandstands

Bandstands are structures with a wide appeal, and the way they have been recorded in OSM throws a light on how contributors approach features that fall outside the mainstream. The OSM data on bandstands isn't complete, but the database may already contain one of the most comprehensive lists of existing bandstands in the UK.

The most extensive list of UK bandstands that I have found is the list of Vintage Bandstands, here. This has 334 distinct entries, but I don't think they all still exist.

In 2001 the Urban Parks Forum surveyed local authority parks in the UK. Out of the 438 bandstands they could identify 203 had already been lost, 186 were in use, or under repair, and 49 were abandoned or unused. Since then the Heritage Lottery Fund has been investing in the restoration and rebuilding of bandstands. I can't find more recent data, so for now let's assume that there are still more than 200 bandstands in public parks in the UK.

Bandstand in Sefton park, from Wikimedia

Across Britain, there are about 142 bandstands that are listed for architectural or historic interest. I have locations for those in England, and roughly three-quarters of them lie inside public parks, and roughly a quarter outside public parks. Around the same proportion of the bandstands in OSM are within a park, so it is probably fair to assume that by concentrating on parks, the survey by the Urban Parks Forum was primarily concerned with about three-quarters of the bandstands in the UK.

So as a starting point I'm going to assume that there are about 275 bandstands left in Britain, of which about 213 are inside public parks, and 62 outside public parks - fewer than on the Vintage Bandstands list, and more than the Urban Parks forum suggests.

I've managed to find 223 bandstands in the OSM data for Britain, which is about 80% of my estimated total. If these features really are bandstands, then that's an impressive result for a feature that I thought would fall well outside the mainstream.

Our ancestors obviously knew how to build structures that would maintain their appeal, and that's part of the attraction of examining data on bandstands. But I also wanted to look closer at the data because the tagging of bandstands in OSM is particularly inconsistent, and I thought we might learn something from that.

A fairly simple search of the OSM database for anything that mentions "bandstand" (and spelling variations) will pick up 247 features. On inspecting the data we find about dozen of these are false positives: completely different features with the word "bandstand" in the name. Ten appear to be duplicates. If two different features within a few hundred yards of each other both describe a bandstand then they are probably referring to the same structure in the real world. Some of the duplicates occur because contributors have added and tagged both a way and a node. Some may be because one attempt hasn't rendered and a later contributor thought the feature was missing. In a few cases contributors seem to have been uncertain how to tag the feature, so they have added more than one option.

About 50 of the bandstands in the OSM database correspond to the 93 listed bandstands in England, so contributors have added more than half of the bandstands that are listed by English Heritage. Very few have been marked as a structure with listed building protection. If the overall totals are correct, then contributors have located more than 80% of all bandstands, but less than 60% of listed bandstands. I'm not sure what to make of that.

For processing this data we really want to find bandstands based on well-defined attributes. Interestingly, with bandstands we find more variation in the choice of keys than in the choice of values. The most common contents of the value, by far, is "bandstand", with "band_stand" accounting for about one in thirty values. The keys that have been marked as a bandstand include "leisure", "amenity", "building", "historic", "type", "building:use", "tourism", "shelter_type", and "man_made". It's interesting that bandstand contributors choose quite a wide variety of different keys with quite a narrow range of values.

There are 233 features that are recognisable as a bandstand from the data content. This includes 10 that appear to be duplicated, which mucks the numbers up slightly. I've slightly fudged this in the chart - to keep things simple.



The tag "leisure=bandstand" is recommended in the documentation, and accounts for about 24% of all UK bandstands (30% of the ones I found, and almost half of the bandstands that have been coded in a structured way). The "amenity" and "building" tag with a value of bandstand account for another 18% of all bandstands. Less common tags such as "historic", "type", "building:use", "tourism", "shelter_type", and "man_made" account for another 2%. Low usage of "man_made" surprised me, because I thought this would have been seen as more appropriate than "building". Apparently not.

More than a third of bandstands can be identified in the database by the name, but not by other coding. This approach, of course is risky for automated processing, because the results need to be checked for false positives. However, for some purposes it is still worth looking at how bandstands are named. The form "name=bandstand" is the most common - almost as though the name tag is being used for coding, since this is not capitalised. Less common forms are "name=Band Stand", "name=The Bandstand" and "name=The Band Stand". I suspect contributors have been influenced here by labeling in the standard render.  Together, these four values pick up 84% of the named bandstands. The rest mainly use names based on the location - such as "Southsea Bandstand".

The few remaining examples that I found are a mix of more obscure tagging, and spelling variations. These are of little interest, or value to data users.

There are probably about 50 UK bandstands that don't appear in the database (yet).

We have to be wary of false synonyms. I've looked at features in the database that correspond to the location of listed bandstands, and along with tagging variations in the OSM data these suggest that some contributors consider "gazebo" and "pavilion" as synonyms for "bandstand". Some just label the feature as a "shelter". A "gazebo" can look similar to a bandstand, although a bandstand is generally larger, raised higher above ground level, and clearly intended for a different purpose. The term "pavilion" might be acceptable as a technical description of the architecture, but in general data users will not be able to use it because it is so widely used to mark a sports pavilion. And "shelter" is a very general category, that doesn't help somebody who wants to identify bandstands.

Now to draw some conclusions.

Although we know for certain that some are missing, there's a case that the OSM database already contains one of the most complete lists of existing bandstands in the UK. With relatively little effort to plug the gaps, and verify existing data this is information that could be used productively, by anyone who wants to do so.

If somebody wants to find bandstands in the database at present they will probably look for values of "leisure", "amenity" or "building" that contain "bandstand". That will uncover almost half of all examples in the database. It starts to get quite complicated for data users to seek out tagging variations and find the next few percent. Searching the "name" tag for variants of "bandstand" will turn up quite a few likely candidates, and might be appropriate in some circumstances,but it would not be reliable enough for systematic processing. Without manual inspection this approach also picks up theatres, cafes, and the like that have been named this way.

The data on bandstands is fairly comprehensive and suggests that contributors can have quite different perspectives on these features. So this is an area where we could (and probably should), encourage more consistency, while tolerating quite a lot of variation in tagging. For the relatively small number of features in the database, bandstands demonstrate an unusually varied use of tags. Whether they realise it or not, different contributors have been marking bandstands according to their function ("leisure=bandstand" or "amenity=bandstand"), according to their form ("building=bandstand", "man_made=bandstand"), or according to their significance ("historic=bandstand"). These are surely all valid approaches, and they are not incompatible with each other. Indeed, quite often they are used together on the same feature. It is quite conceivable that one bandstand will be notable for its historic significance, but no longer in use as an amenity, while another might be in regular use as a leisure facility, but have no historic significance. Some might have changed function ("shelter:type=bandstand"). We don't know how this data might be used in future, so differences such as these ought to be reflected somehow in the tagging. However, the current documentation doesn't give guidance on such subtleties, so the current data probably doesn't record them accurately. All we can really say for now is that features with any of these tags have been recognised and recorded as bandstands.

If the community wants to improve the current data on bandstands then the following might be priorities :

  • locate the fifty or so missing examples
  • add structured tagging to the hundred or so bandstands which can currently only be found by name
  • fix inconsistencies - such as combining duplicates
  • encourage considered use of the existing tagging options to capture and retain information that can be collected in the field
  • add attributes (such as listed building status) that could be of interest or value to data users
  • find a group of bandstand enthusiasts who might be interested in verifying the data, finding innovative uses for it, and taking things further
  • celebrate with an outdoor concert

Tuesday, 11 November 2014

Kennels and catteries

Kennels and catteries are commercial businesses where owners can leave their cats or dogs while they are away from home. They are part of a wider category of animal boarding that also includes donkey sanctuaries, and organisations that take in domestic strays, pets whose owners are no longer able to cope; and wildlife, such as hedgehogs. These are more likely to be run by charities.

In England and Wales animal boarding establishments (including kennels & catteries) are controlled by the Animal Boarding Establishments Act 1963. They have to be licensed by the local authority. The situation is similar in Scotland, but controlled by a different act.

Because they have to be licensed I thought it would be straightforward to find statistics on how many kennels and catteries there are in the UK. I was mistaken. Published government figures on the business population don't go down to that level of detail, local authorities don't seem to publish any statistics, and I can't find figures from trade bodies. However, the Valuation Office Agency does publish figures for different types of property, including the numbers of kennels and catteries there are in Wales and the English regions. These figures are a bit old (2010), but broadly in line with the numbers that come up on a search of Yellow Pages. Unless anyone can come up with a better figure, I think we can be fairly confident that there are just short of 5,000 kennels and catteries in the UK.

Of these I've been able to find just over 200 in the OSM database. Contributors have tagged roughly half of those as a kennel or cattery, and the other half can be identified (with reasonable confidence) as a kennel or cattery by the name.



Mapping these clearly hasn't been a high priority for OSM contributors.

I reckon that the data on kennels and catteries is too incomplete, and the tagging is too inconsistent for it to be of great practical use for rendering or other forms of data presentation (at present). The point of this post isn't to argue that things should be any different. These establishments are not particularly prominent features in the landscape. Dog and cat owners (in my experience, at least) will either have chosen their a preferred animal boarding service already, or they will find one through personal recommendation rather than searching a database. This isn't quite the same for animal rescue, where I could see a need for an application to find the nearest hedgehog sanctuary (for example). But it's hard to see how pressure from data users is going to create a surge of interest in data on animal boarding. One day we may see enthusiasts kick of an "Animal Boarding Mapping Project". But my guess is that these are more likely to be added by non-specialist contributors who are working on mapping a wide range of different features within their local area.

In any case, the community will decide on priorities. My interest in looking at this is not to push for action on kennels and catteries. I'm more interested in seeing what we can learn about how contributors approach less commonplace features.

We find three different models for tagging kennels and catteries.
  • The usual approach is not to label these as a kennel or cattery at all. Examples of kennels and catteries can be  picked up fairly easily through the contents of the "name" tag. There may be other, similar techniques that I haven't tried. In other words these features have been mapped, and named, but there is no further detail to indicate that they might be of special interest. In the chart they appear as "Name only"
  • The second most common approach is to use simple "amenity" tagging. This makes use of user-specified values of the "amenity" tag: "amenity=kennels" and "amenity=cattery". These aren't documented, but nevertheless, they represent more than half of the examples of kennels and and catteries that have appropriate tags attached in the database. In the chart these appear as "Simple"
  • The third approach is more structured, and follows the tagging recommended in the documentation. This is based around "amenity=animal_boarding"  with more detail added under "animal_boarding=...". This approach represents almost half of the examples of kennels and catteries that carry specific tagging. Although it is the documented approach it is not yet the most widely used. It appears in the chart as "Structured".
As always, there are a few variants on both of the structured models. By the look of it most of these are a mix of typing mistakes, and misunderstandings. They don't make a significant difference to the totals. There are an even smaller number of intriguing examples, though, where a contributor has used tagging based on "pet=". There are only a few of these, and it looks as though this might be an experiment that didn't go further, but it suggests that at least one contributor sees boarding kennels and catteries in terms of how they relate to other facilities for pets, rather than as a type of amenity, or as a service for animals.

The blindingly obvious conclusion is that there are serious limitations in the current data. Inconsistent tagging might be a deterrent for users who want to render or process this data: but the real blocker is the very limited coverage.

In the case of kennels and catteries, it seems likely, as things stand, that anyone using this data is unlikely to be interested in rendering or presenting the data within an application.Simply because there isn't (yet) enough coverage to make this viable.

There is no shortage of similar examples in the database, outside the mainstream,  where coverage is low.

To my mind this raises some interesting questions.

When the community is debating about best how to tag features that fall outside the mainstream:

  • Who is using this data, and what are they using it for?
  • Do current approaches meet the needs of data users, can they be improved, and if so, how?
  • Could contributors be encouraged to add more useful data?

Presumably we think that one day coverage of these features will become sufficient to make rendering or application processing viable. If we don't think that, then why are we collecting this data at all?.

  • Will the needs of data users change at that point?
  • If needs do change, how will that affect the data?
  • How will we tell when we have reached that stage?
Obviously I wouldn't be raising these questions if I held the view that the needs of data users were the same for all types of feature, and that they were unchanging as the contents of the database evolves. 

To me, the key question is "how do we tell when we have reached the stage that rendering or data presentation becomes viable?". But this might touch on some contentious issues. It's probably best to stop at that point, and see what others think.

Saturday, 8 November 2014

Bookies

The Gambling Commission publishes statistics on the number of bookmakers in Britain. Earlier this year they recorded 9,021 bookmaker premises, based on returns from operators. The number has been fairly constant in recent years.

A data user will be able to find about 1,280 of these in OSM (depending on how determined they are).




The preferred tag is "shop=bookmaker". That picks up 8% of all bookmakers in Britain. The most common alternative is "shop=betting" and that will pick up about 4% of the total.

There are a number of variants on these (shop=bet, bookmakers, betting_shop, bookies, turf_accountant) Together these pick up about 0.4% of the total.

A few contributors have gone down slightly different routes. Using the "amenity" tag rather than the "shop" tag accounts for another 0.2%, and various bookmaker-related values for "gambling" account for another 0.4%.

There are also some premises which look like a bookmaker (based on the operator or the name), but aren't marked as such. I found another 79 premises that might have added to the collection if they had been tagged with additional data. There are almost certainly more that I missed, but deep searches to find these get increasingly complicated and the results increasingly suspect.

I'm unable to find around 86% of British bookmakers in the OSM data.

Hotels

VisitEngland is the tourist board for England. It publishes statistics on the stock of different types of accommodation, broken down by local authority. These figures group all serviced accommodation together. So hotels, guest houses and B&Bs are all counted together, but non-serviced accommodation, such as camp-sites and self-catering holiday accommodation is counted separately. Importantly, they count establishments rather than businesses.

Comparing their figures with OSM data for hotels, guest-houses, hostels and B&Bs suggest that the map currently contains about 7,500 out of 32,000 of these establishments in England (i.e. 23%).

Roughly 15% are tagged as a hotel, roughly 6% as a guest house, 1% as a hostel, and under 1% as a B&B. There are also a small number of motels and some pubs with accommodation that are conventionally tagged. As usual, there is a smattering of weird spellings, and a few  misunderstandings of tagging conventions, but these only account for a couple of dozen entries. They don't make a significant difference to the totals. More than 75% of establishments that provide serviced accommodation are missing from the data (or at least not easily found).




VisitEngland count all hotels together, and I've not been able to find a reliable-looking and detailed breakdown of different types of serviced accommodation. Smaller establishments seem to account for less than a third of the data, but I would think that they account for a lot more than a third of all establishments. Their trade association suggests there are about 25,000 guest houses and B&Bs in Britain. OSM contributors have located about 14% of that number (note - this figure relates to Britain, others to England). It's possible that some of these have been tagged as a hotel rather than a guest house or B&B, - but on the face of it, smaller establishments appear to be the most under-represented.

Looking at the distribution by local authority, it seems that either VisitEngland is particularly sloppy at counting hotels in the Midlands, or OSM contributors are particularly diligent at mapping them. For the rest of us, this might be a good time of year to fill some gaps before visitors and proprietors start gearing up for the 2015 season.

If anyone is planning to map serviced accommodation they will probably want to pick up tourism=hotel, guest_house, hostel, bed_and_breakfast or motel and amenity=pub with accomodation=yes.

Saturday, 1 November 2014

QGIS 2.6

I have just downloaded the latest version of QGIS, and I'm using it as an excuse to play around combining different sources of data, Here is a mix of Bing Aerial imagery, hill-shading based on OS Opendata elevations (resampled) and OSM geometry for roads, buildings and water. Why not? It almost works.

Tuesday, 14 October 2014

Duke's Footway

I'm a few days late, but last Friday the dog and I explored another local walk. Today I've added it to the map. This one wasn't recorded by the Board of Health in their 1869 survey.

At the time it wasn't clear whether the path existed or not. The reason was this.

In the 1850's the Duke wanted to build a new wall around his park. He offered to build the wall 6 feet inside his own land, leaving the land outside as a footpath that anyone could use. Everyone thought this was a good idea, so it went ahead.

A decade went by before people seem to have realised that the new wall blocked an ancient footway, and the Duke hadn't offered to build a footpath: just to make the land available. People felt cheated: they had lost an ancient right of way, without gaining a footpath. There was a bit of argy-bargy, and the path wasn't included in the 1869 survey. Things must have got sorted eventually, because today this counts as a public footpath.

The path follows the wall that the Duke built in 1858. So it couldn't be easier to follow - just by staying close to the wall. The path itself isn't heavily used, or surfaced. But it is used, and fairly easy walking - though there are some muddy bits, and a couple of streams to cross. There are also a few stiles, but they have gaps in the fence that a labrador-sized labrador can get through. We know because we tested them.

As you climb the views east over town to the sea come and go with the roll of the land. There are less interesting views over the moor to the south. On reaching the top the views out to sea disappear, but there are fine views inland to the east and north.

There is also a radar dome which provides long-range early warning and control for the UK Air Surveillance And Control System (ASACS). Basically this warns of intruders into UK airspace. The enclosure, apparently, is home to rabbits, stoats, newts, frogs, birds and sometimes adders.



It's worth the climb. The top of the hill is called Cloudy Crags. I can't think why. We were particularly lucky with the way the sun was breaking through. The return journey was easier but less impressive. We came back on the road, and the rain started.  










Friday, 3 October 2014

1869 footway survey

I discovered a newspaper article from 1869 describing a survey of local footpaths.

In 1869 the town was expanding and increased traffic was damaging some footpaths. Farmers were starting to make agricultural improvements that involved ploughing up paths, or interrupting them with new drainage ditches and fences. Turnpike trusts, who had managed local roads for the previous century or so, were coming to the end of their life.

So it was an interesting time for footpaths.

There were nearly thirty footpaths in the survey, and the dog and I went to explore one of them. It ran for about 1.5 miles, is clearly described in the newspaper article, and clearly marked on Ordnance Survey maps of the time. It doesn't really lead from anywhere, or to anywhere, but we made a round trip, resulting in a walk which took about three hours.

This is still a public footpath, which pre-dates the later field pattern - so it cuts across the middle of fields and their boundaries. It is not marked with signposts, except at the start and finish, but it runs in a fairly straight line, and large stiles have been installed at all the fences. So the route is mostly easy to follow. The stiles, unfortunately, are almost 100% dog-proof, so following this path with a labrador involves quite a few diversions to discover suitable gates.




The area has an interesting history. It went through two phases of enclosure and improvement. Early enclosures around 1700 increased the value of the land. Before that nobody seemed to care exactly who had what rights to the common land. With increased value there were a number of disagreements which led to court cases, and fences being pulled down. A second wave of improvement had to wait until the resulting disputes were resolved in the middle of the 19th century.

The ruined house above dates from the second phase of improvement: around 1870-1890. From a similar period, the path crosses the old track-bed of the Alnwick / Cornhill railway, which was built between 1884 and 1887, and closed in 1953.



Nearby is a farm that was owned in the 18th century by an eccentric mathematician, who divided his estate into highly regular square fields, fathered five sons and at least five daughters, but still found time to build an organ for a local church, design a threshing machine, and build a flying machine, from leather and feathers. After summoning friends and servants to witness his first flight, he jumped off his granary stairs, and landed (unhurt) in a gooseberry bush.

"Chapeau" to him.

The path wasn't traced on OSM, but I've added it now. It's a pleasant walk, with more than a bit of interest. The weather was fine, I enjoyed myself, and the dog seemed happy.

In 1869 most of the paths in the survey existed purely for utilitarian reasons. Only a few have been lost, most remain, and this isn't the only one that looks interesting.

So I think its fair to say that this theme can be continued....

Wednesday, 6 August 2014

Scots as a percentage of the population



Using 2011 Census data on National Identity.

Tuesday, 24 June 2014

More on rivers

I've been looking into couple of questions about the OSM data on rivers.

The straightforward point is about naming of rivers. The guidance is to use the complete name on the waterway=river line. So "River Tyne" rather than "Tyne". However this guidance isn't followed consistently. About 12% of UK rivers (waterway=river) haven't been tagged with a name at all, roughly half are tagged with a name in the form "River xxxx", and rest are tagged with a different form of name. A lot of the time that doesn't matter, and in any case some rivers have widely recognised names that don't include the "River..." prefix. To label "Afon Teifi" as "River Afon Teifi" would just be silly.

The bigger challenge is that there isn't consistency within naming of different segments of the same river. Rivers are long, and normally mapped in pieces. There isn't a widely adopted reference system for rivers, as far as I can see. So (alongside the geometry) names are the best way of associating different pieces of the same river.

Most of the larger UK rivers have segments tagged with a mix of different forms of the same name - including the River Avon (8% of length = "Avon"), River Thames (1% = "Thames"), River Derwent (11% = "Derwent"), River Trent (4% = "Trent") and River Don (6% = "Don"). Tagging with different forms of name gets in the way of other processing, so I try to get round this by standardising on a stripped form of name (i.e. removing "^River " with a regular expression). This doesn't work all the time, but for most purposes it works well enough.

The next issue is a side-effect of mapping rivers as a mix of areas and lines. The map below shows the point just west of Hexham where the North Tyne and the South Tyne come together. Parts have been mapped as areas (waterway=riverbank) and parts have been mapped as lines (waterway=river). There's a gap between two of the areas that were drawn as riverbanks.

These look particularly odd here, because I've emphasised the problem by removing all but the river. In the default map the issues don't show up to the same extent. Bridges mask two of the three problems, and while the the third is visible on the standard map, it tends to get lost amid other details.




But the default map isn't the only way this data is used. I'm interested in exploring the limits of using the same data for detailed maps, so I've been pondering on how best to handle this mix of lines and areas.

Of course the long-term answer is to add more detailed mapping. I'm doing that where I can, which tends to shift the problem further upstream. Meanwhile, in the shot-term, varying the width at which I render waterway=river lines provides a partial solution. The issue is that different widths work for different parts of the river system. For example, further north, tributaries of the River Tweed have been mapped in quite a lot of detail, and at the point where the areas meet the lines the rivers tend to be narrower. Here's an example where exactly the same style for waterway=river works well enough when it is mixed with areas tagged waterway=riverbank.


Choosing a suitable width for the waterway=river line can look OK. A width that is too narrow will leave step changes like my first example. Unless I start chopping up overlapping lines and areas, then a width for the waterway=river line that is too large will obscure more detailed mapping of riverbanks.

Although there is a tagging scheme for explicitly recording the width of a river (width=n) it is rarely used in the UK, and very rare around here: so it is not a great deal of help to me in practice. Usage is patchy though, so there are places where others may find it more useful (see darker lines below for some idea of where width has been applied to waterway=river). For what it's worth, where it is specified the average width given for a river is just under 5 metres.



In theory it might be possible to estimate a different width for each river segment by analysing adjoining lines and any overlapping areas of riverbank. However, I suspect this would turn out to be too complicated to be of any practical use.

While I wait for more complete mapping, what I think I need is a reasonably sensible default river width. River widths vary, so there are always going to be cases where a single default is either too wide, or too narrow. But because I want to retain mapping of detailed data wherever possible, I would prefer to err on the side of choosing a default width that is more often too narrow, rather than one which is more often too wide.

As a rule of thumb, I reckon that a width equivalent to 9 metres on the ground works quite well around here, particularly with rivers that have been mapped in a fair amount of detail. Where it doesn't work so well there is an incentive to improve the mapping. As far as I can tell this is also about the right default width for rivers in the rest of the UK.

In more than 90% of cases where the width of a river is specified it is less than 10 metres. I'm not doing any special processing for such cases, because they don't occur around here, but there may be a case for handling river width where it is specified.

I don't have any evidence for the right thresholds for width elsewhere in the world, but consider this. Whether by accident or design, in Northern European latitudes, nine metres seems to be roughly the width that a default waterway=river line represents in the standard map. So on the standard map a mix of river lines and areas happens to turn out reasonably well at the point where a river is nine metres wide. I imagine that contributors who are influenced by the standard map will tend to stop drawing rivers as areas once the river width reduces to the equivalent of the default line width on the standard map. They may not realise they are doing this, but if they notice that riverbank areas don't add value to narrower rivers then this would be a sensible point for them to stop adding areas. If this is a global effect, then perhaps there could be some appropriate guidelines for data contributors.

As supporting evidence, I've sampled about 130,000 points along UK riverbanks from the OSM database. The most common width between banks is 9-10 metres. This is well below the average width, or even the median, because there are many sections of river which are wider. But more than 10% of my sample showed a river width of 7-11 metres. The number of samples falls off quite slowly for wider sections of river, but quite quickly where a river area is narrower. Only 2% of my sample showed a river width less than 5 metres.


I haven't done a similar measurement of mapped width for streams. There are some streams mapped as areas, so it should be possible, but I have doubts about how accurate, so how useful the result could be.

The standard definition of the difference between a river and a stream is that a stream "can be jumped across by an active, able-bodied person". Intriguingly, the world record for a running long jump is just short of 9 metres. So if we wanted to be silly, we could argue that a default river width of 9 metres is consistent with one interpretation of the transition point between a river and a stream.

More realistically, several fitness measures suggest that the kind of distance a fit adult can achieve with a standing jump is in the region of 2-2.5 metres. I wouldn't try to jump a stream that wide. But the figure of 2.5 metres suits my purposes as a default width for a stream.

There are about 3,000 sections of stream with a specified width the the OSM data for the British Isles. The average width is just over 1 metre. About 85% of stream segments with a specified width are less than 2 metres, and almost half are less than 1 metre. So my chosen default is higher than the average, and higher than most specified stream widths. I may have set it too high. However, it seems only slightly wider than the standard map rendering of a stream. On larger scale maps I reckon that a stream shown as 2.5 metres wide is about the minimum that renders reasonably clearly, and there is scope to increase the width without streams becoming too dominant across a map of open countryside. It still leaves me with a bit of a big step between an effective minimum river width of 9 metres, and a stream width of 2.5 metres, but the quality of stream tracing looks like being more of an issue.

I'm going to leave further tuning of river widths as a problem for another day. However, with an eye on the long-term possibilities, it might help detailed mapping if contributors were encouraged to map rivers that are more than 10 metres wide as areas, and add a width tag to rivers that are less than 10 metres wide.  

Friday, 13 June 2014

Pre-processing water

Like roads, water can be represented as either a line or an area. Roads represented as lines tend to dominate those represented as areas, but areas of water play the more important role. Lines of water (at least in my part of the world) are almost all small streams. They add detail, but don't dominate. Dealing with the sea is a special case of dealing with water. It needs to be handled differently, but that's not difficult.

I find the quality of data on coastal landscape (beaches, cliffs, rocky outcrops) is a bit patchy. It's important to me that I get the big areas of water right (rivers, lakes and the coastline). I'm less concerned with the finer detail of streams etc. I spot the occasional oddity, but on the whole I find that the current data on inland water is as complete and as accurate as I need - if not more so. As a result I've not done any systematic examination of the quality of data on inland water. The only way that I've used to examine data on coastal landscape is by eye.

Some ways that describe water are assumed to represent areas by default, others are assumed to represent lines by default. To collect areas of water I extract all the ways marked 'natural=water', 'waterway=riverbank', 'pond','dock', 'boatyard','lake','reservoir', or 'estuary'. All of these are assumed to represent areas. I also extract anything tagged 'area=yes' alongside 'waterway=...whatever'. Because I'm selecting an inclusive list of values I end up dropping some possible areas of water with this approach. That isn't very satisfactory. However, the list of potential "waterway" values is quite long, and varies across features that should represent areas of water ("marina"), those that could represent either a closed line or an area ("moat"), as well as those that probably don't represent water ("pumping-station") and those that probably need individual treatment ("weir", "slipway"). I can't see a simple approach that would handle all these cases robustly. Users who are interested in water features would have to, but (so far) I don't.

In practice a few of the ways I select are not valid representations of an area, so I throw away anything that isn't a closed loop with more than three nodes and no self-intersection. What's left are easily converted to polygons, the geometry simplified, and then added to a table of water areas.

Multi-polygons of water are assembled in the same ways as multi-polygons of buildings (see earlier post), using multi-polygon relations that carry similar water-related tags to the simple polygons described above. They are added to the same table as the simple polygons.

I don't load coastline, or the sea, from the standard OSM extract. It's quicker and easier to pull the shapefiles from here, and import them into postgis. I am only working on UK data so I could then delete most of the world's oceans. I'm not sure whether this would make a significant performance difference. I thought I might run into problems because I'm working off two different data extracts, but of course the coastline is pretty stable, and it hasn't caused me any difficulty in practice.

While I'm dealing with areas of water I also add 'natural=beach' and 'natural=wetland' areas to the same table as I use for water areas. It's a pragmatic approach, which I adopted because I've found it  it makes life easier later on if I can treat all these areas together.

Because I put water, beach and wetland into the same table I need to classify these differently. However, I've never felt it necessary to make any further distinction between different types of water area. I use the same basic style for all, whether they are tagged as "riverbank", "canal", "dock", "boatyard", "pond", "lake", "reservoir" or whatever. So I don't bother with further classification. The original tags are all retained in case I ever need them.

In principle I treat every other way that is tagged "waterway" as a linear feature. The vast majority of these are marked "stream", "drain", "river", "canal", or "ditch". The rest of the waterway values in the UK are a bit of a ragbag. Most have a very small number of examples. Some data users may need to deal with these more systematically, but I've never found the need. I classify the most common values in a separate column, so I can handle different styles easily. I use an "enum" data type to impose a closed set of values. Everything other than the most common values is classified as "other" so I can spot any important instances and deal with them manually. I've not needed to (so far). 

I add a "label" column to the tables for both areas of water and linear waterways. In principle I could do all sorts of fancy stuff with this, but at present it simply contains the value of any "name" tag. 

There are some questions I haven't addressed when dealing with the water data at a relatively large scale. I'm not very happy with the way that I distinguish between areas and linear features, though it seems to work well enough in practice around here. I haven't found a very satisfactory way of dealing with transition where a river moves from being represented as an explicit area, to be represented symbolically by a line. I've also realised that the data on coastal landscape is a bit patchy around here - so there's some map improvement work to be done.

As I wrote this I noticed some errors in the tests I've been using to extract linear water features. Some of the ways that represent areas of water have ended up included as linear ways as well. I hadn't noticed that earlier, so it obviously hasn't been much of a problem. However, I'm sure I can't be not the only data user who is post-processing data in this kind of way, and I think that goes to show that some collective effort on post-processing techniques would benefit us all.

Wednesday, 11 June 2014

Pre-processing buildings

Including buildings makes a noticeable difference when mapping a built-up area. In some cases buildings will make sense of a townscape in a way that the road network alone cannot achieve. In a rural area buildings can be important landmarks. So it is nice to be able to include them.

It's not difficult, but nor is it entirely straightforward. When mapping a local area, the most obvious question is whether OSM contains sufficient data on buildings to make it worth the effort. Mapping of buildings isn't as mature as it is for roads. So the answer to the big question of whether it is worth it at all varies between different parts of the UK. Just by eye-balling the standard map it is obvious that some built-up areas have been comprehensively mapped with buildings. In some areas only a few important buildings have been mapped. In many areas a town centre has been thoroughly mapped, but the suburbs are sparse.

There are ways of assessing building coverage programmatically. The Office of National Statistics publishes outlines of the built-up areas in England. It's not hard to assess the degree to which buildings in OSM cover the built-up areas defined by the ONS outlines. As a rule of thumb I reckon that if more than 10% of a built-up area is covered by buildings then it is likely to be fairly comprehensively mapped. If less than 10% is covered then many buildings are likely to be missing. Based on that approach, and a sample of larger UK towns, about one in ten of built-up areas are currently blessed with a worthwhile proportion of building data in OSM.

Ordnance Survey StreetView data includes representations of buildings. The quality of the outlines isn't always good, but it's still possible to use the data to test the extent to which buildings known to the Ordnance Survey are included in the OSM database. The technique is along the lines of established techniques for assessing coverage of the road network - look for Ordnance Survey buildings that don't overlap an OSM building. Measuring this for the whole country would involve a lot of data collection and crunching, so I've only done it locally. While eye-balling works for big gaps, like a missing suburb, I've found Streetview handy for identifying and fixing the odd building that has gone missing - particularly in more rural areas.

One further glitch is that a fairly significant minority of UK buildings in the OSM database (around 7%) have been sourced from OS Streetview data, rather than being freshly mapped. This data may be adequate for some applications, but these outlines weren't designed to support detailed mapping, and not everyone will want to use them.

So anyone who wants to represent one of the the lucky 10% of towns is pretty much ready to go. Those who want to map the other 90% of towns have some more work to do before they can include reasonably comprehensive building data.

The next set of considerations relate to the geometry of buildings. Most of them are represented by polygons. More difficult cases are represented by multi-polygons; by collections of lines that need to be assembled into a polygon; or by a combination of these two. Processing these is quite complex, but once they are processed all valid examples can be placed in the same table as the simple polygons. Virtually all complex (but valid) cases can be reduced to a mix of polygons, and polygons with holes. Postgis treats both of these as the same geometry type. So in practice it's just as straightforward to use complex and simple building geometries. Getting the complex geometries sorted is not so straightforward.

To start with the case of simple polygons. Anything tagged "building" as assumed to represent an area. So it ought to be a closed loop, without self-intersections, and containing at least four nodes. Some don't obey these rules, and life is easier if we dispose of the few exceptions first. In my pre-processing I flag up any exceptions in the diagnostics, so I can fix those I need.

For more complex geometries I wanted to understand out how this was done, rather than use a black-box approach. So this is my DIY method. I collect all the linestrings that are part of a relation tagged "multipolygon". I get postgis to merge linestrings that are connected, then remove any that won't convert to valid polygons, and any where the lines overlap. The remainder are converted to polygons, and marked "inner", or "outer" depending on whether they are contained within another polygon. The inner polygons are then grouped according to the "outer" polygon that they lie within, and each group converted to the intended multi-polygon - with the original tags from the relation transferred. They can then be inserted into the same table of buildings that  holds all the simple polygons. Sometimes there is a multipolygon relation tagged "building" where one of the elements is also tagged "building". I don't need both, so I remove any individual elements that would otherwise end up being included twice.

This process doesn't rely on contributors correctly tagging "inner" and "outer". It copes with examples where the contributor has included disjoint outer polygons in the same relation, and examples where a multipolygon relation only holds a single polygon. It doesn't fall over when the geometry is flawed - faulty components, or elements that overlap. It isn't completely robust. I've included diagnostics on some of the error cases, but not all. There are some errors that could probably be fixed automatically, but aren't - the whole building is just discarded. At the moment my approach doesn't handle cases where the "hole" in a multipolygon relation is defined by another multipolygon relation. There may be other problems I haven't encountered yet.

Some buildings (imports, I think) are plotted with more precision than I will ever need. So I also get postgis to "simplify" the geometry - i.e. reduce the number of nodes to a more sensible level.

Most buildings in the database are tagged as such. Some aren't. So could I end up with gaps in the map as a result of inconsistent tagging rather than incomplete mapping. For example, there is the question of how to handle structures that are marked as "man_made" but not "building". If I include something marked as "building=silo" then shouldn't I include something marked "man_made=silo" as well? Similar examples would include a pier, storage_tank, or gasometer. Tagging isn't always consistent. Some such structures marked as "man_made" are also marked as a "building". Some are tagged as a building, instead of "man_made". And sometimes the "man_made" key doesn't refer to a structure. For example "man_made=water_works" is often applied to the grounds in which the works stands, rather than to the structure itself. Man_made=embankment" is fair enough, but it is not a building. So I can't simply treat every feature marked as "man_made" as though it was a structure. War memorials, commemorative columns and the like present similar problems with inconsistent approaches to "historic" tagging. Sometimes a valid area has a "shop" or "office" tag, but no "building" tag. 

My compromise solution has been to collect a small selection of key/value pairs, that I am fairly confident I can treat as "structures that are a sort of building". I include those in the same table as buildings, even when they don't have a "building" tag, I classify them differently so I can pick and choose, then handle them manually. It's not a very satisfactory solution, but it captures most cases that I've come across, and until I can find a better way it will have to do.

Finally, I've started to experiment with a limited amount of classification of different building types. Normally I have no need to differentiate between types of building, but I can envisage circumstances where it would be useful to highlight certain types (schools or hospitals, for example). So I have begun experimenting with this. The majority of buildings in OSM are tagged simply as a building. However, many can be re-classified from the value of the building key as being residential, industrial, retail, agricultural, commercial, transport or different types of institution. So buildings marked as "building=house" (or residential, terrace, apartments, semi, detached, flats, bungalow, etc.) can be fairly confidently treated as "residential". Other cases are more ambiguous ("building=store", or "block", for example). More sophisticated assessments based on a range of different tags should be able classify more completely and more reliably, but I haven't explored that approach (yet).

Sunday, 8 June 2014

Between the data and the map

I have long felt that one of the the easiest ways to prepare OSM data for use in a map is to create an intermediate database, holding separate tables for each of the the different features. Rather than working directly from the original tagging and geometry these tables can contain a set of values that have been adapted to make it easier to work with the data.

Such an approach won't suit all, but I find that it allows me to work from a fairly large extract (typically the whole of the British Isles), and to quickly produce maps interactively. I can easily assess the way that contributors actually use tags, then pick what transformations I want, without worrying much about how quickly these will be performed. I can progressively refine the data as I learn more about the way that tags have actually been used, and I can make local edits on my own data without worrying about corrupting the base data. I tend to do this stuff in bursts of activity, with long gaps between, so there are some administrative advantages in having a clear trail which documents the different transformations I've tried, and keeping these scripts separate from work on individual maps. I don't need a very fast turnaround on changes to the data, so periodically I do another extract, and rebuild everything.

Transforming the original data into an intermediate form seems to be the norm for routing, but I'm a bit surprised their isn't more discussion of similar techniques for other applications. So this is how I handle data on linear roads, in the hope that it will be of use to others.

Some highway features are plotted as points (mini-roundabouts, turning circles) and some are plotted as areas. I do stuff with these as well, but for now, let's concentrate on how I transform data on highways that are represented by lines.

I download a planet extract covering the British Isles, and load this into a postgis database. I then extract all the linear highways into a separate table (anything tagged "highway" but not "area=yes"). I retain the geometry, the original identification (so I can backtrack and fix any problems later), and I keep the original tagging (as an hstore). To be more precise, I simplify the geometry slightly at this stage, since the base data is sometimes unnecessarily detailed for my purposes. This doesn't seem to make a big different for roads, but it does for some other features. For ways that represent lines I do no validation on the geometry. For other features the geometry of polygons, and multi-polygons can be checked at this stage.

I then add separate columns for each of the different characteristics I might want to represent on a map.

Based on the original tags, I categorise each segment as either a "road",  "track", or "path". I hold the data on paths and tracks in the same table as roads because there are areas of overlap, and keeping them together makes it easier to faff around in the grey areas between road and track or track and path. However, there are different considerations involved in classifying different types of path and track. There is enough to worry about right now on roads, so dealing with the others is best left aside for while.

I then re-interpret the original highway tagging for roads, using three different schemes, based on different ways that I might want to use the data.

  • For mapping the topology of the road network I want styles that indicate the official classification of the road (and where the road is unclassified, a rough indication of importance). So I use the most common values of the "highway" tag: Motorway, Trunk, Primary, Secondary, Tertiary, and the associated Link roads; Unclassified, Residential, and Service, etc. Tagging of roads is mature and heavily used, so it is pretty robust. A core set of a dozen common values covers 99.9% of the road network in Britain, and does so quite consistently. I impose these values with an "enum" data type. When I'm working on the data there's a huge temptation to use a more flexible data type but I've found that forcing a closed set of values at this stage is a useful discipline. A dozen categories seems a lot, so I've tried to reduce the number, but found this doesn't work out well in practice. However, I do merge living_street" and "residential". In principle these have different meanings, but practice, I reckon they are often used as synonyms. So I reclassify any roads tagged "living_street" as "residential" in my own database. Everything else is marked as "other". The remaining 60 or so values of the "highway" tag account for less than 0.1% of British roads. Some of these values are well-defined, but more specialised than I am ever likely to need ("bus_guideway"). Some look like sensible tagging that hasn't yet been widely adopted ("passing_place"). Even if I wanted to use these, the data would be too sparse to be of much use. Others are presumably errors ("bridalway"). However, in total all these represent such a small proportion of the total network that they are hardly worth dealing with systematically.  However, one occurrence of a rare value can affect a map of a small area, and may need to be dealt with. So I put all of these values into the category "other". By applying a contrasting style to "other" I can easily spot problems, and deal with them manually.
  • Within  a map covering a wide sweep of landscape the roads are not really a dominant feature. Along with water, woods, the contour of the land, etc. they are one of a number of features that map readers will use as a guide to location. For mapping these cases I only need a limited number of road styles - enough to indicate that a road exists, and give some idea of its importance. Beyond that I can't be bothered with managing a variety of many different styles for different types of road. So I also re-classify the different values of "highway" to just "wide", "standard" or "narrow". When I have no need for more detailed classification this small set of values makes it very easy to manage appropriate styles.
  • In the two previous cases the width of the road as rendered is an indication of its importance, and is commonly exaggerated to improve visibility. However, across a small section of townscape roads will naturally be represented at a relatively large scale. They will need to relate to buildings and other features that are positioned at their actual location. Over-exaggerating the width can cause problems, and I find it helps if roads are drawn broadly in line with actual width. There is a well-defined tagging scheme to specify road width, but it is not widely used. So what I do here is estimate the width of the road. I extract the width value where one is available, where it is easily parsed, and where the result is wide enough to be reasonably visible (in practice this captures almost all values that are presented in metres). If no suitable value is available, then I next default to calculating an approximate width based on the number of lanes (where this is provided). Otherwise I use a default value for each different type of road. There are a few cases where this doesn't work well, but on the whole it looks fine. It seems to produce an estimate of width that is close enough for my purposes, and it is easy enough to render a width in metres when working in OSGB projection. 
For labelling I extract separate "name", and "reference" columns from the relevant tags. Having these values in separate columns isn't really necessary - I could extract the data from the tags whenever I needed it. However it makes life a little bit easier. In theory it would also allow me to do some fancy editing of the content (e.g. abbreviations, line breaks hyphenation, ...). At present, though, I just retain the original values. I also add a "label" column which contains the name, if one exists, and the reference number if not. I did this because liked the idea of standardising on always having a "label" column across a variety of different features (waterways, woods, whatever...) I thought it would be handy to always know where to find a label - so I set things up that way. In practice it didn't turn out to be as useful as I expected.

Experimenting with standardising the type of road surface hasn't proved very successful. Across Britain, more than 300 different values are used to describe the surface of a highway. These range from "asphalt" to "very_horrible". So it is a bit of a challenge to interpret this reliably. Differentiating between "metalled" and "unmetalled" is straightforward enough, though, based on common surface tags, and defaults for different types of road. So I've done that, but not used it much in practice.

There are other aspects of the data that might be worth more investigation. For example,

  • Is the current data consistent enough to differentiate between the different types of "service" roads: driveways, parking_aisles, alleys, etc. ?
  • Is there a practical way of dealing with pavements ("sidewalks=...")?
  • Many roads are commonly split into short segments because of the need to tag variations in detailed characteristics that change over short stretches. Where these characteristics are irrelevant it would be possible to merge the segments without loss of meaning. Is there any value in doing so - even if only for labelling?
Does anyone else have experience of these?

Friday, 30 May 2014

A tribute to County Series

About two years ago I was asked to help produce a newsletter for the local Civic Society. Since then we have used a mix of Open Street Map, Ordnance Survey Open Data and other sources to create maps which illustrate articles on various events, local walks, the historic townscape, development proposals, the visibility of wind farms, and the like. 

When we look at the history of the townscape we also make extensive use of old Ordnance Survey maps. Recently the National Library of Scotland provided access to images of the Ordnance Survey Six-inch maps of England and Wales, produced between 1842 and 1952. The level of detail in these mean that they are a fascinating source of information on how the town has developed over the last 150 years or so. 

It occurred to me that it might be interesting to bring these themes together, and try to recreate a similar style using Open Street Map data and some of the related tools. This is my first attempt, using QGIS, and based on OSM data loaded into a Postgis database,.  

I wouldn't want to pretend that this is as attractive as the originals. It certainly isn't as clean and uncluttered as modern styling. That's not really the point ,but even within the intended scope, it would benefit from more work. However, I think the basic principles are becoming clear. Even at this stage, I thought it might be of interest to others, and I'd welcome suggestions from those with more experience on how best to take it forward.



As I expected, all this taught me something about using the different style options in QGIS. But it turned out that there were also some interesting challenges in standardising the quirks in OSM data at this level of detail. 

Taking a step back, I suspect there is untapped potential for groups who need static maps of a local area at quite a high level of detail. These groups have a choice of mapping tools that are sufficiently accessible for those without specialist expertise. OSM is in a unique position to provide the map data that these groups need. But there are still hurdles to overcome for anyone who wants to produce detailed local maps using OSM data. Mapping tools make it easy to access raw OSM data, but some of that raw data is notoriously difficult to use in practice. Exploring techniques for making the data more accessible may be of wider value than exploring techniques for producing a particular style of map. 

More of that later.