Sunday, 23 August 2015

OSM retail survey: Conclusions-2

This picks up from previous posts to consider more specifically what tools might help contributors.

The examples are rudimentary – stuff I've assembled for my own use, rather than robust tools for the wider community. If they have any value I hope it will be as prototypes for something more polished.

Missing data

There are about 385,000 retail properties in England that are missing from OSM, and the obvious way to help contributors is to point out where they are.

To help achieve the most rapid improvement across the whole country I have tried to find dense retail concentrations that haven't been thoroughly mapped yet.

These are the biggest concentrations of unmapped retail property in England and Wales: about 1,000 of them, each with an average of 100 missing retail outlets across an area of under 2 sq. km.

I've used a mix of Food Hygiene Data, Non-Domestic Rates, population data, and various other statistics to identify concentrations of retail outlets at a local level. I've done this for England and Wales. The same basic technique should work in Scotland because similar data is available, but the structure of the census geography, and data on non-domestic rates for Scotland is quite different, so the process needs tweaking, and I haven't got round to that yet.

My formula for estimating the number of retail premises at a local level can probably be improved, but it will never be perfect. At this stage I don't think it is good enough to reliably identify areas that are almost complete, because that needs more precision. But I think it is good enough to flag up areas that are far from complete. Contributors who are looking for significant concentrations of missing retail outlets should be able to do a quick check on the area. If it still looks empty on the map, they can head there with a reasonable expectation of adding enough new retail outlets to make the trip worthwhile.

Feedback based on local knowledge would be welcome, to help refine this a bit more.

Helping contributors to find nearby concentrations of missing retail outlets is one way to quickly increase the overall volume of data. A different starting point is to assume that thorough retail coverage in some areas has a higher value to data users than adding missing shops elsewhere. On that basis we may want to encourage contributors to concentrate first on mapping areas which we think have the highest potential value.

This example picks out a limited number of smaller towns and cities where OSM data might have high value (e.g. to students or visitors).

Areas coloured:
  • blue already contain more than 75% of my estimated number of retail outlets
  • green contain 50-75% of my estimated number of retail outlets
  • orange contain 25-50% of my estimated number of retail outlets
  • red contain less than 25% of my estimated number of retail outlets 

Each area is intended to cover a manageable size: one where a few contributors should quickly be able to bring retail content up to an impressive level. Larger cities are excluded on the basis that they justify a more systematic approach. My list is  bit arbitrary – it is intended to cover a mix of different towns of roughly similar size, distributed across the country. Are these really the areas where OSM retail data is likely to have most value? I doubt it, but that might be a useful discussion point in its own right. For each suggestion of a settlement that should be added, please feel free to suggest one that should be removed.

I can only assess how useful these estimates might be in areas that I know fairly well. Feedback on any unexpected results would be useful: to better understand where the technique can be improved.

Feedback to contributors

All contributors deserve to see the results of their work. But not all retail information is rendered on the standard map. And in my view it never can (and shouldn't) be. So to encourage contributors I would like to see a decent alternative to the standard map which shows more complete retail information. When I want to check specific content of the database I use either a data extract, Overpass, or the “Map Data” overlay on the standard map view. I'm happy to do this, but for many contributors (and particularly for novices) none of these techniques are particularly user-friendly. I suspect this is beyond my own technical capabilities, but there are examples (based on various data extracts) that illustrate the kind of thing that can be done.

Data collection

When mapping retail areas there will normally be some shops already recorded in OSM, which need checking. Alongside other existing features such as road junctions, these also provide reference points for adding new data. When surveying retail premises, it's handy to have a crib sheet to hand, on which to collect notes of any changes, which shows the current state of the data. This needs to show every relevant feature in the database, including some which won't be rendered on the standard map.

Below is an example generated (automagically, with some rather clunky SQL) from OSM data for Winchester High Street (the pedestrian part). It starts at the western (top) end.

I've set this up to collect any shops, amenities and offices within 25 metres of the highway centre line, and display them in order. This simplistic approach only exhibits the most basic information, and includes more features than I would really want: including shops beyond each end of the central line, up side streets, and occasionally from a nearby street running parallel. But it's easy enough to cross out any unnecessary entries. To allow for some additions there's an additional spacer inserted every 20 metres (roughly twice the width of a conventional shop front). I find sheets like this speed up the data collection process and make it easier to add notes.

Consistency checks

I hope I've made a clear case that across most of the UK adding missing retail data is a higher priority than cleaning up tagging inconsistencies. However, this isn't true everywhere, and pointers to inconsistencies could help contributors to clean local data.

Some basic consistency checks are easily carried on Overpass:

But this isn't ideal for finding all quirky data within a local area, and finding more complex inconsistencies sometimes involves extensive processing that isn't really practical interactively. Overpass isn't the ideal solution here, but it is possible to do more crunching on a data extract. Here are some examples. Unlike Overpass, anything here that is fixed won't be quickly updated in the overlay (some of these quirks are already fixed, which could get annoying). Note that, for the sake of simplicity, this overlay only contains some of the features in the UK that exhibit these quirks.


Brian said...

Hi Peter

I'm interested in your source data for the Ladywood/NIA/ICC/Broad Street polygon as I'm not convinced there are 50 missing retail premises in that neighbourhood


Brian (brianboru OSM username

gom1 said...

Thanks for your comment Brian.

Local knowledge easily trumps theoretical estimates, so I'm sure you are right. What is interesting to me is why I am wrong. Because that might help to improve the estimate. I'd be interested to know how far you think I am out,and whether the following makes sense.

The underlying data that I am using to estimate the number of outlets in that area is:

44 retail properties from the Food Standard Agency Food Hygiene address data
Population of 482 based on ONS census data
Population density of 1,554 per sq km
Area of 0.310 sq km
38 retail properties uncovered in OSM (including shops and certain retail outlets marked as amenity / office)

My formula for estimating the number of retail outlets at this level of geography comes up with a prediction of 100 retail outlets for that area. However, the total prediction over the whole of Birmingham comes to more than the actual number of outlets published by the VOA. So I have adjusted all Birmingham predictions down by about 11%.

Across the whole country (as best I can tell) about 80% of my estimates come within about 20% of reality. I think that's good enough to point out areas with low coverage, but not good enough to identify areas with very high coverage. A few estimates look way out, and I'm still trying to understand how I could identify these.

Some possible reasons for the discrepancy here:

The most difficult areas to predict are certain large seaside towns (which tend to be under-stated, presumably because retail there is catering for large numbers of visitors) and smaller towns on the edge of large conurbations (which tend to be over-stated presumably because the neighbouring city is drawing away some trade). Perhaps the second could be related?

My map shows a simplified outline of the area, which is slightly different to the actual outline used in calculations. In some areas this means that one side of an important street has been added or removed, but that doesn't affect the calculations, and doesn't seem to be the case here.

In this case, by far the biggest influence on the estimate is the number of properties in the FSA database, most of which look like pubs and restaurants. Other factors have only limited effect on the final prediction, and in the end they more-or-less cancel each other out. I'm guessing that the problem here is dealing with a retail area where there is a high concentration of food outlets, and a low proportion of non-food retail. That is a pattern that shows up in some large conurbations, and one which my current model doesn't take into account.

What I have't figured out is how to get round it (if, indeed, that is possible).

gom1 said...

It's also possible that I am picking up FSA outlets that I shouldn't be, or missing OSM features.

There are 22 premises that appear in both of my lists lists (All Bar One, Bank Restaurant, Boots, Cafe Rouge, Carluccios, Chilacas, Costa, Eat, Eds Diner, Gourmet Burger, Handmade Burger, La Tasca, Pitcher & Piano, Pizza Express, Reflex, Slug&Lettuce, Spar, Starbucks, Strada, Flapper & Firkin, Malt House, Wagamama)

These are the premises that appear for this area in my extract of FSA data, but do not appear in my extract of OSM data:

Barclaycard Arena
Brasshouse Centre
Caffe Nero
Cambrian Stores
Celebrity Indian Restaurant
Crescent Theatre
Ethos Urban
Eurest-Compass Group
Instituteion of Engineering and Technology
Internatonal Convention Centre
Lloyds TSB
National Sea Life Centre
Oak Kitchen
Thai Edge
Bannatyne Spa
Floating Coffee Company
Prince of Wales

And for completeness, these are the outlets that appear in my extract of OSM data, but not in my FSA data.

baguette du Monde
Cafe Vite
Castle Fine Art
Centenary Cafe
City Cafe
Prince of Wales
Teppan Yaki
The Brasshouse
W H Smith

There are also two un-named cafe's and one "shop=alcohol" which might match some of the FSA premises.

gom1 said...

One more thing I should add for anyone following this discussion who isn't familiar with the area.

We are discussing how to measure a few dozen retail outlets here: out of more than 11,000 in a city with a population over 1 million. Retail outlets across Birmingham have been thoroughly mapped on OSM, and among UK cities of this size retail is among the most well covered in OSM.

Marc Gemis said...

On one hand I see Brasshouse & Brasshouse Centre, on the other hand The Brasshouse.
Forthermore one list contains Ethos Urban, the other contains Ethos.
In both lists you have an Edmunds.

Does this mean that your matching algorithm can/should be improved ?

gom1 said...

Thanks for that Marc.

There may be some misunderstanding though - there is no matching algorithm. I'm working from counts of outlets within an area, not the individual names. The lists in my replies were produced by manually matching data extracts - in the hope that it would help local mappers find any missing outlets. Apart from me missing some, there will no doubt have been some name changes etc. Places close, and others open all the time.

How did you get on with the others on the list?