Wednesday, 22 July 2015

OSM Retail Survey: Part-6

To assess how OSM data compares to commercial services similar searches of retail data were compared across different types of platform. I have not yet managed to do this programmatically, but a broad impression can be gained by comparing the results from a commercial search engine with the results of searching a similar area for equivalent tags on Overpass Turbo (http://overpass-turbo.eu/). The comparisons cannot be carried out precisely, so the approach relies on general impressions, and the findings are more qualitative than quantitative.  Because this approach is so subjective, it would be interesting to hear the impressions that others have of similar comparisons.

OSM was searched by specific categories of shop. The equivalent searches of commercial engines relied on using similar keywords. While these two different approaches can produce similar numbers of results, there were also differences in the specific results that were obtained.

Scope OSM Commercial Notes
Supermarkets and convenience stores around Maidenhead Around 50 examples Around 50 examples Similar coverage, and similar mix. Both identify many convenience stores as a supermarket
Pet shops in Truro None Three, plus some variants, such as pet charities OSM retail coverage is incomplete. Commercial search is more effective
Fishmongers across Norfolk Around 60 examples Around 40 examples OSM coverage better within Norwich (though some duplicates) but thin elsewhere. Commercial search produces more results in coastal towns which are less well-mapped in OSM
DIY on Tyneside Around 80 examples Around 40 examples Both find branches of major chains. Commercial search picks up smaller stores by name match, including some false-positives. OSM picks up some smaller hardware shops based on DIY tagging
Cafés in Harrogate Around 20 examples Around 70 examples OSM retail coverage looks incomplete. Commercial search better at finding in-store cafés and similar, but also includes many false positives (e.g. restaurants)

Inherently, the OSM search was looking for a particular “key=value” pair. I have no inside knowledge of exactly how commercial search engines do this, but it's well understood that - given a particular keyword - they use subtle algorithms to find equivalent matches within bodies of text. This includes some fuzzy searching using inflexions, synonyms and various matching algorithms to expand the scope of results beyond the specific keywords that were requested. For example, if we ask a search find “pharmcy” we are not surprised when it corrects the spelling to “pharmacy” and then retrieves “pharmacies”, pharmacist”, etc. We expect such a search to find retail pharmacies, but we are not surprised that it also retrieves university courses, job vacancies, drug manufacturers, and work by Damien Hirst as well.

I suspect that commercial search engines are also embedding some assumptions about major retail chains. So, for example, if I search for a hardware shop they seem to have some understanding that branches of B&Q and Wickes will also be of interest.

By contrast, the assumed behaviour of data retrieval in OSM is that it will be based on a search for nodes, ways and relations that satisfy a specific set of documented values within a limited subset of available keys. This implicit assumption about how data retrieval will work has an effect on the way that contributors chose how to represent data.

Certainly this model has advantages, and opens up opportunities for users of OSM data that may be difficult to achieve with services that operate on a different search techniques. For example (and for some encouragement about the quality of data that OSM is already able to deliver), try searching for a café with wheelchair access in London.




It may, of course, be a reasonable assumption, that future OSM data retrieval will be heavily based on searching for combinations of specific key=value pairs, but this may also be too limiting as a way to think about how things will work. For example, an application that is asked to find a cycle shop could search both "shop=bicycle" and "name similar to Halfords". A search for a hardware shop could well re-cast this as a search for any combination of shop=hardware / doityourself / trade, or any outlet that is part of a chain that has a name like B&Q, Homebase, Wickes, Jewson, etc.

In summary: at its best, OSM is capable of outperforming a commercial search engine in terms of both the quantity and precision of the results obtained. Generally searches based on OSM data should retrieve fewer false positives because they can draw (to a greater extent) on a degree of data structure. However, successful retrieval of data from OSM relies heavily on the volume of data recorded, of a particular type of shop, within a particular area.

OSM coverage tends to vary more from place to place. In areas where OSM coverage is around 50-60% of retail premises then my impression is that data users can expect the results of a search of OSM to match the volume of data retrieved from a commercial search engine. Commercial search engines do not find every retail outlet, so in places where OSM coverage is almost complete data users can expect better results from OSM. However, for most retail formats, across much of the UK a search for retail premises on OSM is less effective in retrieving results than a commercial search engine.

No comments: