Friday, 25 September 2015

Lidar data

I'm not sure what it is going to be used for, but the Lidar data recently opened up by the Environment Agency is remarkable. Here's central Alnwick, hillshaded in QGIS, using 1m DSM elevations downloaded from here.

Edited: 26/9/2015

Thanks to Chris for the comment and pointers.

There's more.

If sea levels continue rising at 3.2mm p.a. Alnwick will look like this by 20,000 a.d. Thankfully our local pub will still be above the waves.

And this is what I reckon the duke can see from the top of Alnwick Castle.

Sunday, 23 August 2015

OSM retail survey: Conclusions-2

This picks up from previous posts to consider more specifically what tools might help contributors.

The examples are rudimentary – stuff I've assembled for my own use, rather than robust tools for the wider community. If they have any value I hope it will be as prototypes for something more polished.

Missing data

There are about 385,000 retail properties in England that are missing from OSM, and the obvious way to help contributors is to point out where they are.

To help achieve the most rapid improvement across the whole country I have tried to find dense retail concentrations that haven't been thoroughly mapped yet.

These are the biggest concentrations of unmapped retail property in England and Wales: about 1,000 of them, each with an average of 100 missing retail outlets across an area of under 2 sq. km.

I've used a mix of Food Hygiene Data, Non-Domestic Rates, population data, and various other statistics to identify concentrations of retail outlets at a local level. I've done this for England and Wales. The same basic technique should work in Scotland because similar data is available, but the structure of the census geography, and data on non-domestic rates for Scotland is quite different, so the process needs tweaking, and I haven't got round to that yet.

My formula for estimating the number of retail premises at a local level can probably be improved, but it will never be perfect. At this stage I don't think it is good enough to reliably identify areas that are almost complete, because that needs more precision. But I think it is good enough to flag up areas that are far from complete. Contributors who are looking for significant concentrations of missing retail outlets should be able to do a quick check on the area. If it still looks empty on the map, they can head there with a reasonable expectation of adding enough new retail outlets to make the trip worthwhile.

Feedback based on local knowledge would be welcome, to help refine this a bit more.

Helping contributors to find nearby concentrations of missing retail outlets is one way to quickly increase the overall volume of data. A different starting point is to assume that thorough retail coverage in some areas has a higher value to data users than adding missing shops elsewhere. On that basis we may want to encourage contributors to concentrate first on mapping areas which we think have the highest potential value.

This example picks out a limited number of smaller towns and cities where OSM data might have high value (e.g. to students or visitors).

Areas coloured:
  • blue already contain more than 75% of my estimated number of retail outlets
  • green contain 50-75% of my estimated number of retail outlets
  • orange contain 25-50% of my estimated number of retail outlets
  • red contain less than 25% of my estimated number of retail outlets 

Each area is intended to cover a manageable size: one where a few contributors should quickly be able to bring retail content up to an impressive level. Larger cities are excluded on the basis that they justify a more systematic approach. My list is  bit arbitrary – it is intended to cover a mix of different towns of roughly similar size, distributed across the country. Are these really the areas where OSM retail data is likely to have most value? I doubt it, but that might be a useful discussion point in its own right. For each suggestion of a settlement that should be added, please feel free to suggest one that should be removed.

I can only assess how useful these estimates might be in areas that I know fairly well. Feedback on any unexpected results would be useful: to better understand where the technique can be improved.

Feedback to contributors

All contributors deserve to see the results of their work. But not all retail information is rendered on the standard map. And in my view it never can (and shouldn't) be. So to encourage contributors I would like to see a decent alternative to the standard map which shows more complete retail information. When I want to check specific content of the database I use either a data extract, Overpass, or the “Map Data” overlay on the standard map view. I'm happy to do this, but for many contributors (and particularly for novices) none of these techniques are particularly user-friendly. I suspect this is beyond my own technical capabilities, but there are examples (based on various data extracts) that illustrate the kind of thing that can be done.

Data collection

When mapping retail areas there will normally be some shops already recorded in OSM, which need checking. Alongside other existing features such as road junctions, these also provide reference points for adding new data. When surveying retail premises, it's handy to have a crib sheet to hand, on which to collect notes of any changes, which shows the current state of the data. This needs to show every relevant feature in the database, including some which won't be rendered on the standard map.

Below is an example generated (automagically, with some rather clunky SQL) from OSM data for Winchester High Street (the pedestrian part). It starts at the western (top) end.

I've set this up to collect any shops, amenities and offices within 25 metres of the highway centre line, and display them in order. This simplistic approach only exhibits the most basic information, and includes more features than I would really want: including shops beyond each end of the central line, up side streets, and occasionally from a nearby street running parallel. But it's easy enough to cross out any unnecessary entries. To allow for some additions there's an additional spacer inserted every 20 metres (roughly twice the width of a conventional shop front). I find sheets like this speed up the data collection process and make it easier to add notes.

Consistency checks

I hope I've made a clear case that across most of the UK adding missing retail data is a higher priority than cleaning up tagging inconsistencies. However, this isn't true everywhere, and pointers to inconsistencies could help contributors to clean local data.

Some basic consistency checks are easily carried on Overpass:

But this isn't ideal for finding all quirky data within a local area, and finding more complex inconsistencies sometimes involves extensive processing that isn't really practical interactively. Overpass isn't the ideal solution here, but it is possible to do more crunching on a data extract. Here are some examples. Unlike Overpass, anything here that is fixed won't be quickly updated in the overlay (some of these quirks are already fixed, which could get annoying). Note that, for the sake of simplicity, this overlay only contains some of the features in the UK that exhibit these quirks.

Wednesday, 12 August 2015

OSM Retail Survey: Conclusions-1

OSM has thrived by bringing together a community with diverse interests, and aligning their efforts behind a common purpose. In thinking how best to improve retail coverage it seems useful to consider how different groups with different interests and different skills will be able to contribute.

The most obvious question for the community is how the existing tools might be improved. But I am not going to start there. Instead I will begin with how contributors might view the priorities - because that will determine which tools will be of greatest help.

My starting point is based on findings from the survey:
  • In a some localised areas retail data in OSM is the most comprehensive retail data that is generally available. Because OSM data has a degree of structure it should be capable of supporting certain types of structured search that are extremely difficult to achieve in any other way. These are the areas where the data will be of greatest value to end users and hence of greatest interest to application providers
  • We are still a long way from being able to offer comprehensive retail data across the whole of the UK. In the foreseeable future this means that most viable applications based on OSM data are likely to have a local focus, rather than aiming for national coverage. So far only a few areas have been really thoroughly mapped. One priority is to increase the number of thoroughly mapped areas.
  • Elsewhere, whatever issues data users find with the consistency and accuracy of UK retail data in OSM, the impact of those issues is small in comparison to the amount of retail data that is missing from OSM. Another priority is to reduce the volume of missing retail data.
To address missing data, I assume the community needs to expand the number of contributors, as well as encouraging existing contributors to add more basic retail data. We need to ensure that the process of collecting and contributing data is both satisfying and productive.

Most contributors have only a limited choice of where to map. The question we need to help them with is how to make the biggest impact in their local area. Some contributors have more choice of where they map. The question we need to help them with is where they can make the biggest impact.

If OSM is going to provide a decent platform for viable applications based on retail data, then the priority is to bring more areas of the UK up to a standard that compares with the best. OSM data doesn't have to be complete in order to be the best available source of retail data in a well-defined area: but it should be getting near complete. In towns and smaller cities individual contributors can quickly make an impact, by bringing retail data up to a good standard across a well-defined area. In an ideal world they might chose a location that would most interest potential application providers – perhaps a university town, or a city that attracts a large number of visitors.

I'd like to think that contributors who want to improve retail data will start by assessing how retail coverage currently stands in their chosen area. For a rough idea, they can examine the standard map, or for more precision they can compare the number of shops in the OSM database with an estimate of how many there should be. There are various ways to get that estimate, but that's a separate question which I'll defer for now.
  • If local coverage is currently under 25%, then this part of the map is still close to being a blank canvas. OSM data is far from providing the best source of retail data, and there will still be gaps in some of the most commonly mapped features, such as post-offices and pubs. The first priority is to make a start, develop technique, and demonstrate progress to encourage others. For a contributor's own motivation, they should begin with whatever interests them personally. This probably includes retail outlets that they are familiar with (i.e. ones that their family, friends and neighbours use regularly). Beyond that, major retail premises, such as supermarkets, banks and larger high street stores are relatively easy to tag, and are all properly rendered on the standard map. Their relatively large scale helps to build visibility. To help raise awareness add any retailers with a high public profile. This could include any well-known local specialists, those who advertise heavily, those who regularly feature in the local paper, or take an active part in the local chamber of trade.
  • If local coverage is around 25-50% then a fair number of shops will appear on the standard map, but there will still be plenty that are missing. Quite often some retail categories will have been well covered (pubs often seem to appear first), while others still have to be added. The priority now is to build momentum. The quickest results will be achieved in densely occupied retail zones such as the central shopping area and larger retail parks. Complete coverage is still some way off, and trying to include everything at this stage will slow things down. It is more important to include major outlets, and a representative sample of outlets that are of high public utility, and widespread interest. It seems to me that this should include retailers that cater for a broad section of the population – both their daily needs (convenience stores, post offices, pharmacies, take-aways, cafés, pubs), and more significant purchases (electrical goods, clothing, furniture, etc.). Others will have better insight into those catering for specific groups of customer (visitors, students, etc.), and in some towns these groups will be particularly important.
  • Once local coverage reaches around 50-75% then OSM data is providing some of the most complete retail data that is generally available. The standard map will contain a good number of shops – particularly in the town centre. But anyone familiar with the area will still find it fairly easy to spot shops that are missing. Particularly outside the main shopping areas there will be shops scattered across residential and commercial areas that haven't been added.  Now is the time for contributors to work towards something approaching complete coverage. Missing shops are likely to be the more specialised, smaller and more quirky independent retailers, shops outside the central retail core, in suburban shopping parades, and corner shops in residential areas.
  • Over 75% coverage means that locally OSM is capable of providing some of the most comprehensive retail data that is generally available. Contributors will find it increasingly time-consuming to deal with the remaining gaps, and the more difficult categories are the most likely to have been left aside. Now is the time to include them. This is also the time to verify that existing data is up to date and consistent. There will be opportunities to add value with information that will be of use to different types of data user. This might include features such as wheelchair access, ATMs, non-standard opening times, specialist services, etc. Beyond this, contributors have  choice. They can continue to add unreasonable levels of detail, that will never be used. Or better, broaden the scope of their survey into neighbouring towns and villages.

If this is broadly how things work, then I'd suggest the following priorities to help contributors:

  • Firstly, contributors are encouraged by seeing the results of their work. Currently the standard map is the main source of such feedback, but it doesn't show all retail outlets, and it doesn't render all retail characteristics that contributors add. It is unrealistic to expect the standard map to render everything, so I'd like to see a different way of showing contributors the results of their work: one that doesn't depend on adding increasing detail to the standard map.
  • Secondly, for areas where retail data is relatively thin, contributors who have some choice of where to map may benefit from guidance on where and how they can make the biggest impact. Tools that highlight areas where there is a substantial amount of missing retail data could save them time. Suggesting areas where retail data could be of high utility may influence their choice of where to map.
  • Thirdly, retail data generally has to be collected by survey, and there is a lot missing. Tools that help contributors collect data in the field (i.e. on the high street) will make the process more satisfying and help to speed the process
  • And finally, once  retail data in an area is relatively complete, the emphasis will change from improving coverage to improving consistency and adding value. Bulk edits won't help much, but tools that highlight inconsistencies and quirks in retail data will help contributors identify issues and improve quality.

Sunday, 2 August 2015

OSM Retail Survey: Part: 12a

Retail outlets in OSM are represented in different ways.

Very few shops have been added as a relation. Around two-thirds are a node, and one-third an area (almost always a closed way, occasionally as a relation between multiple ways).

Larger types of outlet (supermarkets, motor trade outlets, petrol station, furniture showrooms, etc.) are more likely to be recorded as an area (with around 50% in that form); while smaller outlets such as post-offices, and pharmacies more likely to be recorded as a node (around 85% in that form).

In effect, about a third of retail features are represented only by their location. In around two-thirds of cases there is more information on the geometry. The most common ways of representing the geometry of a retail outlet are

  • as an area which represents both a shop and a building
  • as a node or area that represents a shop, and lies within an area that represents a building

These two cases are equally common in the data. In the real world, some shops are always going to be closely associated with a specific building, while others are always going to be perceived as a facility that happens to be located within a particular building. So it is reasonable to expect data users to find both of these approaches acceptable.

Tagging indicates that almost all (87%) of areas that have been marked as a retail outlet are equivalent to a building footprint - i.e. they hold both a "shop" tag and a "building" tag. Among the remainder, some areas are tagged to represent landuse, and a few represent road surface, but most others carry no indication of what physical feature the area represents (i.e. no relevant tag added alongside “shop” or some type of “amenity”). The proportion of contributors who have used an area to represent landuse varies by the type of retail.  In the case of petrol stations, for example, it is 3%, and in the case of garden centres it is almost 25%.

When a retail outlet is drawn as a closed area, with a “shop” tag, but no other indication of what that area represents, it is most likely that the area is intended to represent a building. A random check suggests that this is what contributors normally intended, but renders cannot be certain, so the most likely outcome for retail features mapped as an area, without a “building” or “landuse” tag is that these features will not be rendered at all.

Where landuse is specified on a retail area it is normally “retail” (76% of cases). Of the other generic urban landuse terms, “commercial” represents 16%. The remainder are mostly more specific terms such as “landuse=plant_nursery”.

Of the retail outlets that do not have their own area defined, and are represented only by a node, just under half are contained within a (separately defined) building. In most cases (75%) the type of building is not defined further (“building=yes”), and in 10% the building is described as a retail building.

There are a couple of thousand retail buildings containing at least one shop. The biggest clusters of retail nodes within such buildings represent individual outlets within large shopping centres (e.g. the St James Centre in Edinburgh). However, these only account for a small proportion of the total.  Most buildings that contain shops only show a single shop node, and it is common for a single building to contain only a few retail outlets.

In around 20% of cases, the retail node within a building is the only retail feature within that building. It might be assumed in this case that a single shop occupies the whole building, but renderers cannot be certain whether there are other, unmapped shops within the same building. Their only safe option is to render both building and shop, and place the node in the position marked.
Although we can safely assume that almost all retail outlets should exist within a building, something over a third of all retail outlets in the database have no representation of a building associated with them. This is an indication of areas where building data is likely to be incomplete, but the information is of little value otherwise.

Some large retail outlets are made up of numerous different components: petrol stations and garden centres are common examples. For petrol stations there is relatively clear guidance on how the various components should be mapped. Guidance is less comprehensive when it comes to garden centres.

I've found 3,772 examples of “amenity=fuel” in the UK data, of which 70% are mapped as nodes, and 30% are areas. To map a service station as a node is simply to indicate the location. To map it as an area suggests that the contributor is at least aiming to provide more detailed visual information for rendering. Adding further detail on the products and services available, and detailed mapping of routes through the forecourt suggest that the contributor is aiming to support more sophisticated applications for more demanding users. To function properly, viable applications that can handle such complexity will require some consistency in the way that petrol stations are described.

My interpretation of the mapping guidance for petrol stations is that:

  • the building in the forecourt should be tagged as “amenity=fuel”: this guidance is generally followed, and is the approach in around 75% of cases where the petrol station has been mapped as an area. In around 3% of cases the area marked as “amenity=fuel” is also tagged to indicate retail landuse, which suggests it covers the whole site. In around 2% of cases it is also tagged as a shop, which suggest that it is intended to represent a building. However, renderers and applications cannot be certain that either is what the contributor intended. In around 20% of cases there is no indication from the tagging what the “amenity=fuel” area represents. Inspection suggests that in most cases it is the paved area around the pumps
  • the area around the pumps should be mapped as an area of highway: in practice this approach is only used in around 2% of cases, although there are a few more cases where the forecourt area is tagged as “amenity=parking”
  • use the shop tag alongside “amenity=fuel” to indicate other retail formats within the petrol station, such as a kiosk, or convenience store: this guidance is not generally followed – a shop tag is only used in 10% of petrol stations marked as an area, and only 5% of those marked as a node. Some of these petrol stations may truly have no other retail facilities, of course, but experience suggests that there are not many fuel outlets these days that only offer  fuel
  • the routes through the forecourt should be mapped as one-way service roads: (this I've not measured)
  • any canopy should be mapped as “building=roof”: (this I've not measured)
  • add a node for toilets as an amenity: because fuel is treated as an amenity in the tagging, there is little problem in tagging coexistence of a petrol station with with retail formats that are tagged as a shop, but there are potential issues around how best to tag co-existence with other common amenities. This guidance helps with adding toilets, but there is no guidance yet for other common amenities, such as a café 
  • there is no guidance yet on how to map the wider extent of the site – which may include customer parking, children's play areas, etc.

The result is that rendering for some petrol stations presents a reasonable interpretation of the different components on the ground, but this is too unusual, and the underlying data is too inconsistent to be of further use.

Representing the perimeter of a complex retail outlet would provide geometry information that would be particularly useful for data users. This would offer a mechanism for aggregating different components within the same facility that have been mapped separately (such as identifying a petrol station with toilets and a cafe) . However, for contributors there is confusion over how best to do this. The community is probably nearest to consensus on "landuse=retail", or "landuse=something more specific". However, this approach isn't widely adopted. In any case, it is virtually useless for anything more than rendering, because it loads the "landuse" tag with more than one meaning. Data users are not able to distinguish between cases where the "landuse" tag defines the outline of a specific outlet, and cases where it encompasses a wider area and more retail outlets.

In summary, the largest gap in the information on the geometry of retail premises is a lack of any information on the footprint for around two-thirds of retail features. In around 4% of cases there is some basic information on footprint, but data users will face considerable difficulty in interpreting what it means.

Where there is information on retail geometry it can help to identify gaps in data, particularly for building footprints.

The database also contains information on some more complex retail footprints, but this is not presented in a consistent way, and the various components are not sufficiently well-integrated for applications to make use of the information (other than basic rendering). There is some guidance for contributors on mapping more complex retail features, but this is not widely followed, and I have found no feedback mechanisms to encourage more consistent tagging of  complex cases.

Friday, 31 July 2015

OSM Retail Survey: Part-12

I see that most readers of these posts come from outside the UK. I hope the information is of some interest and use to others, but I should probably emphasise that this is just a survey of UK retail data on OSM. The structure of the retail industry differs from place to place. The UK, for example, has a relatively high proportion of retail turnover through large retail chains: i.e. fewer small retailers than France, and fewer mid-sized retailers than Germany. It would be interesting to know whether similar patterns of OSM data are found elsewhere, or whether there are differences that we can all learn from.

After the most basic information on the location and type of shop, the most common (and for many data users, the most useful) supplementary information in OSM is the name of the retailer. Some form of name is provided for around 90% of retail premises, but the proportion is lower for certain types of outlet. Most UK Post Offices, for example, do not have a name tag.

Names will probably be most useful to the data user because they allow searches for a specific retailer. In the UK, understanding the name of the chain may carry relatively high value, because the name often gives a clear idea of the format and range of goods on sale (most people have an idea of what to expect in a branch of Argos, even if the tagging of shop type is inconsistent).

Names are also a useful characteristic when analysing retail data. Where chains have multiple similar branches, this provides a vehicle for checking tagging consistency across a chain, and information can be extracted on OSM coverage of the larger chains.

There are numerous variations in the way names are provided by contributors. These fall into several groups:

  • Variation in capitalisation: Aldi / ALDI,  Asda / ASDA, Spar / SPAR, Lidl / LIDL, Pets at Home / Pets At Home
  • Variations in the use of abbreviations: Co-op, Co-Op / Co-operative, M&S / Marks & Spencer / Marks and Spencer, Toni&Guy / Toni & Guy / Toni and Guy, B&Q / B & Q / B and Q
  • Variations in the use of a possessive, with or without apostrophe: Sainsbury / Sainsbury's / Sainsburys, Wilkinson / Wilkinsons / Wilkinson's, McColl / McColl's, Jewson / Jewsons, Maplin / Maplins
  • Variations in the use of a definite article: Co-operative Food / The Co-operative Food, Money Shop / The Money Shop, Body Shop / The Body Shop, Carphone Warehouse / The Carphone Warehouse
  • Variations in spacing: WHSmith / WH Smith / W H Smith, Kwik Fit / Kwik-Fit, One Stop / One-Stop / OneStop, Phones 4U / Phones4U, TK Maxx / TKMaxx 
  • Including or omitting the location of the branch in the name: “Tesco, Swansea Marina”, “Batley Tesco”, “Tesco Barnstaple”, “Tesco Extra, Hartlepool”, “Tesco - Biggin Hill Express”, “Tesco (Greenock)”, “Tesco Rugeley Superstore”, “Tesco Ystrad Mynach”, “Tesco Extra (24hr) Maldon”, “Tesco South Tottenham Superstore”, and many, many more.

Different names for different formats within a chain are a slightly different issue. Tesco, Tesco Metro, Tesco Express and Tesco Extra can all be considered valid names for different retail formats used by Tesco. Similarly, Sainsbury's and Sainsbury's Local. These cannot really be considered variations in the same name. There is evidence, though that they are not used consistently, so “Tesco” is sometimes used as a synonym for the various different formats (and at other times it isn't).

It is difficult to put a firm number on how many times all these different kinds of inconsistency occur,  but it certainly runs into thousands (i.e. between 1% and 10% of mapped shops). So they are likely to cause some data users a degree of frustration. However,  similar patterns recur, and some names are particularly prone to certain variations. With varying degrees of effort data users who place a high value on standardised names will probably be able to find ways to work round many of the differences that are most important to them.

Occasionally the name will appear as the value in the "shop=" tag, but this is very rare. Various other tags are used to hold different types of name: primarily “name”, “operator”, and “brand”.  In around 4% of cases the contributor has provided a name, plus details on the operator, brand, or both, and in less than 2% of cases they have provided information on the operator, or brand, but not the name of the outlet. There is some inconsistency in the way that “operator” and “brand” are used, which will make life a bit difficult for data users.

Data users who ignore “operator” and “brand”, and use just the “name” field will lose information from around 6% of recorded retail outlets. However, there are certain sub-sectors where the operator or brand tag are more widely used.

In the case of petrol stations, for example, both are used, but both the brand of fuel and the operator are scattered quite widely across the name, operator and brand tags. For car showrooms contributors normally use the name tag to show the name of the outlet, but it is not uncommon for the name tag to show the car manufacturer. In around 11% of cases the manufacturer appears under brand, but in some cases it appears under operator. This inconsistency in the use of names could be a lost opportunity to provide additional information to data users, but it is difficult to know how much of a problem it will really give them. Anyone who wants to search for a Volkswagen Dealer, for example, will need to search for “VW” or “Volkswagen”. If they are capable of doing that on the name tag, then they will be able to search the name, brand and operator tags almost as easily. If they go to this amount of effort, they will currently pick up around 30 VW dealerships in the UK. This is well short of the true figure of around 200. Again, missing data is a bigger problem here than inconsistent tagging. Sensible end-users who want to find a VW dealer will use the manufacturer's dealer search rather than OSM data.

For most types of retail there is no major issue with misuse of the “brand” tag. On the whole it is used consistently for car dealerships and filling stations, and little used elsewhere. There is, though, some confusion around tagging of convenience store names. In the UK these are often independently owned and operated, but trading under a well-known national franchise. Examples include SPAR, Londis, Costcutter, Premier Stores and Nisa. Together these represent around a third of the convenience store sector in the UK, so they have a significant presence, but none of the various combinations of “name”, “operator” and “brand” really captures the business model. As a result the way they are tagged is quite inconsistent (roughly 80% of the time the franchise is tagged as “name”, roughly 13% as “operator”, and 6% as “brand”).

Address details are attached to about a third of retail properties in OSM, but only 15% have a postcode, and the proportion with a complete street address is in the region of 5-10% (depending on what components a user requires in order to regard the address as complete).

Contact details (web site, or phone) are provided for around 10% of retail properties, although for certain types the proportion is higher. For restaurants and bars, for example, the proportion with contact details is around 20%, and for estate agents around 25%. Is is more common to find a web site than a phone number, and comparatively rare to find both. In the case of restaurants, for example, 15% are tagged with a web site, 10% with a phone number, 5% with both, and 80% with neither. Restaurants have a higher level of coverage for contact details than most retail sectors on OSM, and yet, out of 60,000 restaurants in the UK, OSM has contact details for just over 2,000. This looks long way from being a set of information that is viable enough to attract data users.

Information on accessibility (“wheelchair=”) is provided for around 4% of retail properties, with a higher proportion for cafés and restaurants (7%) and for supermarkets (8%). Where wheelchair information is provided the value is “yes” in 63% of cases, but “no” in 20% of cases, and “limited” in 16% of cases.

Information on opening hours is provided for under 3% of retail properties. The types of outlet that fare better than average for this information are an odd mixture: Supermarkets (but not Convenience Stores), Bars (but not Pubs), Pharmacies (but not Post Offices) and Bicycle Shops. Cafés and Restaurants rate only slightly higher than average.

It's worth looking more closely at an example where the opening hours have been provided more often than average, and where they could play an important role in any feasible application. If I wanted to find a pharmacy near to home, that is open on a Sunday afternoon, then the nearest pharmacy where I can check “opening_hours” on OSM is nearly 50km away (and as it happen, it isn't open on a Sunday afternoon). There are four pharmacists within 100km that OSM tells me are open on a Sunday afternoon, but only one of them shows a phone number. This is potentially an application area where the tagging structure is ready to support a viable application. In London, and a few other towns and cities contributors have been diligent in adding sufficient detail to make a search viable (Stoke on Trent, Norwich,...).  But across most of the country, it doesn't look to me as though the data content is anywhere near ready to attract users to this type of data.

In summary, OSM has the potential to support more sophisticated searches of retail data than a simple location search, in the sense that data structures are in place to hold much valuable information. Where these are used, they are used fairly consistently. However, in most cases coverage of supplementary data is an issue. There is a considerable way to go before the supplementary data is sufficiently complete and consistent. The first priority is probably to work towards more consistent naming. Beyond that, at present, the potential applications for this data are hypothetical, and it is too early for an informed debate on priorities.

Monday, 27 July 2015

OSM Retail Survey: Part-11

The occupants of a retail premise will change over time, and as a result we should expect retail data in OSM to continually evolve as well.

Most of the retail data within OSM is less than 5 years old, so the chances are that the bulk of this is still more-or-less current. Around 5% of the data is 5-years old, and 2% is 6 years old. A growing proportion of this older data could now be inaccurate, but across the country it is likely that the proportion of data that is out-of-date will only represent a few percent of the total.

In some places (e.g. Islington, Leeds, and Sheffield), more than 10% of shops were added to the database over five years ago. In parts of Kent more than 30% of the shops were added more than five years ago. So there may be a case for some local reviews of older data, to update anything that has changed since it was last recorded.

In most locations, though, the current priority will still be to add missing data, then later work towards greater accuracy.

In the longer term, that picture is likely to change. In the last 12 months 28,000 retail properties in England have been edited. That's 5% of retail properties that have either been added to OSM, or updated. Some of the changes in OSM data over the last year will have been to correct a spelling, or to adapt tagging, and will not have involved a re-survey. But we can't easily measure how much has been fully updated. So for now, let's be optimistic, and assume that every edit brings that particular shop up to date.

If nothing changed on the ground, then at this rate it will take more than a decade to approach complete coverage of retail. But the situation on the ground does change. In 2014 the average length of a retail lease was less than nine years, and almost half of retail leases were for less than four years. Retail leases used to be for a longer period, and because of peaks in construction activity in 1990 and 2000 an unusually high number of 25-year, and 15-year leases are currently due for renewal.

 Not all retail property is leased, leases will sometimes be renewed without change of occupant, and some might carry forward for generations. So I don't know what proportion of OSM high street data we should expect to change over a year. If only 5% changes then current levels of editing activity are sufficient to maintain existing data, and gradually close the gap of missing data. But if we assume 10% of existing retail premises change over a year then the current rate at which OSM retail data is being edited will not be enough to deliver and maintain complete and accurate data on all retail properties in England.

Nationally, perhaps something like 7,000-14,000 entries on the database should be updated each year. Around where I live, the rate of change looks closer to 10% per year, rather than 5%, so I'm guessing a decent estimate of the national picture will be closer to the higher figure.

As database volumes rise there will be more to maintain. If contributors concentrate on adding missing retail properties, then by the time coverage reaches about 50%, the existing data will be going out of date as fast as new data is being added. If contributors concentrate on maintaining what has already been added, then they will have no time to add the missing 50% of retail properties. Either way, for the foreseeable future, there is going to be a lot of retail data that is either missing from OSM, or incorrect on OSM.

 If we can wait long enough, other factors might help. A decline in the number of retail premises would also accelerate progress towards 100% coverage, and the chart shows the effect of a 2% reduction in the number of retail properties each year. Even if this is factored in, reaching a worthwhile level of retail cover still looks like a slow process. Too slow.

It is not only individual shops that change. Retail business models also evolve, and over the long term we should expect this to affect the choice of tags. Some formats which once were common on the high street no longer exist (ironmongers into hardware, then homeware). A traditional grocery, or a video rental shop is now unusual.

On the other hand, perhaps candle shops are returning to the high street (“shop=candle”), and e-cigarettes are a recent arrival. The data on chains of mobile phone shops may be an example of how this process continues. Currently these chains are tagged with a mix of “mobile_phone” (95%), and “electronics” (5%). Perhaps contributors are adapting their tagging, in recognition that an established speciality has now matured, and the offer is starting to evolve as retailers extend into adjacent markets.

Sunday, 26 July 2015

OSM Retail Survey: Part: 10

Specialised types of shop offer a narrow range of categories, but provide wide choice within their specialist area. Generalist retailers offer a broad range of product categories, with less choice within each category. Large generalists (e.g. supermarkets) are able to offer both numerous categories and broad choice.

We use different terms for a large “supermarket” (with breadth and depth), a small “convenience” store (with some breadth and less depth), and a "butcher" or “newsagent” which we expect to be more more specialised. We expect a newsagent to offer a wider choice of newspapers and magazines than a convenience store, but we would still expect a convenience store to offer newspapers. We expect a convenience store to offer much more than newspapers, and we would be surprised if a newsagent offered nothing but newspapers. We expect a butcher to offer a wider choice of meat than a convenience store. Ours does excellent sandwiches, ready meals, pies, vegetables, and various other items as well. Although the principles are fairly clear, the precise boundaries between retail categories are always going to be difficult to pin down.

As a result, it doesn't matter how clear the definitions are for different terms covering different levels of specialisation. We should still expect some inconsistency in the way that different tags are used. Some retailers have a business model that is closer to the boundary than others, so it is inevitable that there will be a grey area where it is difficult to maintain a consistent boundary. The proper question isn't whether tagging ought to be consistent. It's whether there should be more consistency than we find.

To my mind there are several areas where the data does not look consistent enough. This is particularly true in the case of large stores which sell a broad range of goods (the big generalists).

For example, a data user who searches for “supermarket” and relies on the wiki for the definition, will expect to find “a large store for groceries and other goods” “a full service grocery store that often sells a variety of non-food products as well”. They will assume (perhaps because the wiki tells them) that “stores that do not provide full service grocery departments are generally not considered supermarkets”.

In practice they will find results that include a high proportion of outlets that fit this description, including most branches of the major chains that they will expect to find: ALDI, ASDA, Booths, Co-op, Iceland, Lidl, Morrison's, Sainbury's, Tesco, Waitrose, etc. However, they will also pick up a lot of convenience stores, and some stores tagged “supermarket” where few shoppers would expect to find groceries: Argos, Homebase, Matalan, Mothercare, Pets at Home, etc.

I estimate that around 10% of the data that they retrieve will not be what they expect.

Commercial search engines face a similar problem, because  smaller convenience stores often call themselves a supermarket, and this is inevitably picked up in their keyword searches. But OSM has a more structured data model. We should expect to perform better.

The situation with department stores is even more difficult for data users. The major chains are well covered, but they only represent about half of all retail outlets tagged as a department store. Data users who rely on the Wiki definition will be expecting “a large store with multiple clothing and other general merchandise departments”. They probably won't expect to pick up Poundstretcher, Argos, Matalan, Pets at Home, Staples, Superdrug, TK Maxx, etc. - but they will.

Wilkinson's (Wilko) is a difficult boundary case - with a particularly wide range of different key values for different branches.  My own view is that something like “homeware” would be the best description of their format, but only about 2% of contributors agree with me. And in practice, what should matter to data users is not what I think (even when I am right). What has to matter to data users is the consensus that develops across the majority of contributors. And in this particular case there is little consensus. It is difficult for anyone to know whether to consider Wilkinson's a department store or not. What is even more unsatisfactory for data users is that 25% of Wilkinson's stores are considered to be a department store, and even though that's the most popular option, 75% are tagged differently.

Neither of these examples is the result of a problem with the definition of the tags for a supermarket or a department store. The problem is that the same tags are being quite widely used for branches of chains where most contributors prefer an alternative. Good data on department stores and supermarkets is polluted by inconsistent data on other retail formats.

Looking further, the confusion lies partly in representing scale consistently, and partly in representing the degree of specialisation consistently.

Most specialists offer some categories of product that fall outside their main area of activity. Some position themselves as specialists in more than one area. As a result contributors can find it difficult to draw a consistent distinction between a specialist and a generalist outlet. If they are uncertain about the right specialist term to use, they tend to look for something more generic, and fall back on terms intended for generalists. This isn't entirely unreasonable behaviour. For a long time, the guidance, when in doubt, is to pick a popular tag that best fits the situation (rather than inventing a new one). Contributors don't necessarily have an understanding of all the tags  in use, and the result is that popular tags that were originally intended to apply to large outlets which offer a broad range are quite commonly used for smaller outlets offering a broad range, and for unusual specialists that are difficult for contributors to classify.

Looking at this another way, we have a choice of terms for shops which offer a broad range. Contributors who find it difficult to pick an appropriate tag veer towards picking one from a higher row in this table - they are the ones that are most widely used.

Primarily food Primarily non-food Hardware / building materials
Large generalists
“doityourself” (or sometimes “trade”)
Other generalists
“general” (rare) or “variety” (for pound shops)
“bakery”, “butcher”, “cheese”, etc.
“clothes”, “beauty”, “houseware”, etc.
“garden_centre”, “paint”, etc.

One result of tending towards tags for larger generalists is that supermarkets are over-represented in OSM. Industry figures show 6,410 stores in this category in the UK, whereas I found 7,045 (110%) in OSM. Convenience stores, on the other hand are under-recorded. I found 9,717 out of 48,303 identified by the industry (i.e. just 20%).

It is obvious from the data that contributors find it difficult to to make a distinction between a supermarket and convenience store. In England and Wales the law on opening hours varies for different sizes of store, with restricted hours on Sunday for those of more than 208 sq. metres (3,000 sq. ft.) So a supermarket of less than 280 square metres (3,000 sq. ft.) would be normally be considered a convenience store, and a convenience store of more than 280 square metres would be considered a supermarket. However, in OSM, at least 9% of outlets marked as a supermarket in OSM (and recorded as an area rather than a node) have a floorspace of less than 280 sq metres. Around one in three of the stores operated by one of the major convenience store chains is tagged as a supermarket. Convenience stores don't have to offer extended opening hours, we can't really expect contributors to measure the footprint, and the situation is further confused because some convenience stores describe themselves as a supermarket. The upshot is that almost a thousand convenience stores in OSM are marked as a supermarket. And meanwhile, because convenience stores are generally under-recorded, around 30% of the general grocery sector has yet to be added to OSM.

Changing tack, department stores sell a range of general merchandise, typically including clothing, household appliances, toys and games, personal-care products and garden equipment. Some also sell food, but non-specialised food stores are properly classified as supermarkets. With very few exceptions the major UK department store chains, such as John Lewis, Debenhams, and House of Fraser are tagged correctly as a department store. However, not all retail premises tagged “department_store” comfortably fit the description.

Examples include branches of Argos (normally tagged “catalogue”), TK Maxx and Matalan (normally tagged “clothes”), Poundland (normally tagged “variety_store”, sometimes “convenience” or “supermarket”), Mothercare (normally “baby_goods”, sometimes “clothes”), Wilkinson's (“department_store” for 25% of branches, plus a wide range of different alternatives).

The Wiki describes Do-It-Yourself-stores as being similar to hardware stores, except generally larger, stocking a wider range of products, and targeting customers who are non-professionals working on home improvements, redecorating, gardening, etc. Pure DIY stores are well covered in the database, and consistently tagged. In the case of Homebase, B&Q and Wickes, for example, more than two-thirds of branches are in the database,  and well over 90% are tagged as “doityourself”.

The same is not true of builders' merchants (which according to the documentation are properly tagged as “trade”). Fewer than 10% of Jewsons, and Travis Perkins branches are in the database, and they are tagged with a mix of “doityourself”, “hardware”, and “trade”, with “doityourself” as the most common.

There seem to be two issues here. One is that many trade outlets also serve non-professionals, so their business model overlaps with the scope of “doityourself” (this is accepted in the documentation on “shop=trade”, but contributors are either uncomfortable with it, or simply don't recognise these as trade outlets). The other issue is that there are different degrees of specialisation in the trade side of the market. Specialists in supplying the trade with building materials, timber, plumbing, bathroom furniture, electrical goods, tools, etc. all seem to be under-recorded, and inconsistently tagged. Again, where there is no clear consensus, contributors have fallen back on common tags such as “doityourself” and “hardware”, that were originally intended for generalists supplying the non-professional, and so are more widely used.

Branches of Wilkinson's and Robert Dyas don't fit comfortably into any of the most common categories, so they tend to suffer from highly inconsistent tagging (department_store, doityourself or hardware). We could blame contributors, but surely some of the tagging inconsistency shows that there may be a need for:

  • more specific options to cover particular retail format that do not comfortably fit the current categories
  • more generic options, so that contributors have an alternative to popular tags intended for large generalists