Friday 31 July 2015

OSM Retail Survey: Part-12

I see that most readers of these posts come from outside the UK. I hope the information is of some interest and use to others, but I should probably emphasise that this is just a survey of UK retail data on OSM. The structure of the retail industry differs from place to place. The UK, for example, has a relatively high proportion of retail turnover through large retail chains: i.e. fewer small retailers than France, and fewer mid-sized retailers than Germany. It would be interesting to know whether similar patterns of OSM data are found elsewhere, or whether there are differences that we can all learn from.

After the most basic information on the location and type of shop, the most common (and for many data users, the most useful) supplementary information in OSM is the name of the retailer. Some form of name is provided for around 90% of retail premises, but the proportion is lower for certain types of outlet. Most UK Post Offices, for example, do not have a name tag.

Names will probably be most useful to the data user because they allow searches for a specific retailer. In the UK, understanding the name of the chain may carry relatively high value, because the name often gives a clear idea of the format and range of goods on sale (most people have an idea of what to expect in a branch of Argos, even if the tagging of shop type is inconsistent).

Names are also a useful characteristic when analysing retail data. Where chains have multiple similar branches, this provides a vehicle for checking tagging consistency across a chain, and information can be extracted on OSM coverage of the larger chains.

There are numerous variations in the way names are provided by contributors. These fall into several groups:

  • Variation in capitalisation: Aldi / ALDI,  Asda / ASDA, Spar / SPAR, Lidl / LIDL, Pets at Home / Pets At Home
  • Variations in the use of abbreviations: Co-op, Co-Op / Co-operative, M&S / Marks & Spencer / Marks and Spencer, Toni&Guy / Toni & Guy / Toni and Guy, B&Q / B & Q / B and Q
  • Variations in the use of a possessive, with or without apostrophe: Sainsbury / Sainsbury's / Sainsburys, Wilkinson / Wilkinsons / Wilkinson's, McColl / McColl's, Jewson / Jewsons, Maplin / Maplins
  • Variations in the use of a definite article: Co-operative Food / The Co-operative Food, Money Shop / The Money Shop, Body Shop / The Body Shop, Carphone Warehouse / The Carphone Warehouse
  • Variations in spacing: WHSmith / WH Smith / W H Smith, Kwik Fit / Kwik-Fit, One Stop / One-Stop / OneStop, Phones 4U / Phones4U, TK Maxx / TKMaxx 
  • Including or omitting the location of the branch in the name: “Tesco, Swansea Marina”, “Batley Tesco”, “Tesco Barnstaple”, “Tesco Extra, Hartlepool”, “Tesco - Biggin Hill Express”, “Tesco (Greenock)”, “Tesco Rugeley Superstore”, “Tesco Ystrad Mynach”, “Tesco Extra (24hr) Maldon”, “Tesco South Tottenham Superstore”, and many, many more.

Different names for different formats within a chain are a slightly different issue. Tesco, Tesco Metro, Tesco Express and Tesco Extra can all be considered valid names for different retail formats used by Tesco. Similarly, Sainsbury's and Sainsbury's Local. These cannot really be considered variations in the same name. There is evidence, though that they are not used consistently, so “Tesco” is sometimes used as a synonym for the various different formats (and at other times it isn't).

It is difficult to put a firm number on how many times all these different kinds of inconsistency occur,  but it certainly runs into thousands (i.e. between 1% and 10% of mapped shops). So they are likely to cause some data users a degree of frustration. However,  similar patterns recur, and some names are particularly prone to certain variations. With varying degrees of effort data users who place a high value on standardised names will probably be able to find ways to work round many of the differences that are most important to them.

Occasionally the name will appear as the value in the "shop=" tag, but this is very rare. Various other tags are used to hold different types of name: primarily “name”, “operator”, and “brand”.  In around 4% of cases the contributor has provided a name, plus details on the operator, brand, or both, and in less than 2% of cases they have provided information on the operator, or brand, but not the name of the outlet. There is some inconsistency in the way that “operator” and “brand” are used, which will make life a bit difficult for data users.

Data users who ignore “operator” and “brand”, and use just the “name” field will lose information from around 6% of recorded retail outlets. However, there are certain sub-sectors where the operator or brand tag are more widely used.

In the case of petrol stations, for example, both are used, but both the brand of fuel and the operator are scattered quite widely across the name, operator and brand tags. For car showrooms contributors normally use the name tag to show the name of the outlet, but it is not uncommon for the name tag to show the car manufacturer. In around 11% of cases the manufacturer appears under brand, but in some cases it appears under operator. This inconsistency in the use of names could be a lost opportunity to provide additional information to data users, but it is difficult to know how much of a problem it will really give them. Anyone who wants to search for a Volkswagen Dealer, for example, will need to search for “VW” or “Volkswagen”. If they are capable of doing that on the name tag, then they will be able to search the name, brand and operator tags almost as easily. If they go to this amount of effort, they will currently pick up around 30 VW dealerships in the UK. This is well short of the true figure of around 200. Again, missing data is a bigger problem here than inconsistent tagging. Sensible end-users who want to find a VW dealer will use the manufacturer's dealer search rather than OSM data.

For most types of retail there is no major issue with misuse of the “brand” tag. On the whole it is used consistently for car dealerships and filling stations, and little used elsewhere. There is, though, some confusion around tagging of convenience store names. In the UK these are often independently owned and operated, but trading under a well-known national franchise. Examples include SPAR, Londis, Costcutter, Premier Stores and Nisa. Together these represent around a third of the convenience store sector in the UK, so they have a significant presence, but none of the various combinations of “name”, “operator” and “brand” really captures the business model. As a result the way they are tagged is quite inconsistent (roughly 80% of the time the franchise is tagged as “name”, roughly 13% as “operator”, and 6% as “brand”).

Address details are attached to about a third of retail properties in OSM, but only 15% have a postcode, and the proportion with a complete street address is in the region of 5-10% (depending on what components a user requires in order to regard the address as complete).

Contact details (web site, or phone) are provided for around 10% of retail properties, although for certain types the proportion is higher. For restaurants and bars, for example, the proportion with contact details is around 20%, and for estate agents around 25%. Is is more common to find a web site than a phone number, and comparatively rare to find both. In the case of restaurants, for example, 15% are tagged with a web site, 10% with a phone number, 5% with both, and 80% with neither. Restaurants have a higher level of coverage for contact details than most retail sectors on OSM, and yet, out of 60,000 restaurants in the UK, OSM has contact details for just over 2,000. This looks long way from being a set of information that is viable enough to attract data users.

Information on accessibility (“wheelchair=”) is provided for around 4% of retail properties, with a higher proportion for cafés and restaurants (7%) and for supermarkets (8%). Where wheelchair information is provided the value is “yes” in 63% of cases, but “no” in 20% of cases, and “limited” in 16% of cases.

Information on opening hours is provided for under 3% of retail properties. The types of outlet that fare better than average for this information are an odd mixture: Supermarkets (but not Convenience Stores), Bars (but not Pubs), Pharmacies (but not Post Offices) and Bicycle Shops. Cafés and Restaurants rate only slightly higher than average.

It's worth looking more closely at an example where the opening hours have been provided more often than average, and where they could play an important role in any feasible application. If I wanted to find a pharmacy near to home, that is open on a Sunday afternoon, then the nearest pharmacy where I can check “opening_hours” on OSM is nearly 50km away (and as it happen, it isn't open on a Sunday afternoon). There are four pharmacists within 100km that OSM tells me are open on a Sunday afternoon, but only one of them shows a phone number. This is potentially an application area where the tagging structure is ready to support a viable application. In London, and a few other towns and cities contributors have been diligent in adding sufficient detail to make a search viable (Stoke on Trent, Norwich,...).  But across most of the country, it doesn't look to me as though the data content is anywhere near ready to attract users to this type of data.

In summary, OSM has the potential to support more sophisticated searches of retail data than a simple location search, in the sense that data structures are in place to hold much valuable information. Where these are used, they are used fairly consistently. However, in most cases coverage of supplementary data is an issue. There is a considerable way to go before the supplementary data is sufficiently complete and consistent. The first priority is probably to work towards more consistent naming. Beyond that, at present, the potential applications for this data are hypothetical, and it is too early for an informed debate on priorities.

Monday 27 July 2015

OSM Retail Survey: Part-11

The occupants of a retail premise will change over time, and as a result we should expect retail data in OSM to continually evolve as well.

Most of the retail data within OSM is less than 5 years old, so the chances are that the bulk of this is still more-or-less current. Around 5% of the data is 5-years old, and 2% is 6 years old. A growing proportion of this older data could now be inaccurate, but across the country it is likely that the proportion of data that is out-of-date will only represent a few percent of the total.

In some places (e.g. Islington, Leeds, and Sheffield), more than 10% of shops were added to the database over five years ago. In parts of Kent more than 30% of the shops were added more than five years ago. So there may be a case for some local reviews of older data, to update anything that has changed since it was last recorded.

In most locations, though, the current priority will still be to add missing data, then later work towards greater accuracy.

In the longer term, that picture is likely to change. In the last 12 months 28,000 retail properties in England have been edited. That's 5% of retail properties that have either been added to OSM, or updated. Some of the changes in OSM data over the last year will have been to correct a spelling, or to adapt tagging, and will not have involved a re-survey. But we can't easily measure how much has been fully updated. So for now, let's be optimistic, and assume that every edit brings that particular shop up to date.

If nothing changed on the ground, then at this rate it will take more than a decade to approach complete coverage of retail. But the situation on the ground does change. In 2014 the average length of a retail lease was less than nine years, and almost half of retail leases were for less than four years. Retail leases used to be for a longer period, and because of peaks in construction activity in 1990 and 2000 an unusually high number of 25-year, and 15-year leases are currently due for renewal.

 Not all retail property is leased, leases will sometimes be renewed without change of occupant, and some might carry forward for generations. So I don't know what proportion of OSM high street data we should expect to change over a year. If only 5% changes then current levels of editing activity are sufficient to maintain existing data, and gradually close the gap of missing data. But if we assume 10% of existing retail premises change over a year then the current rate at which OSM retail data is being edited will not be enough to deliver and maintain complete and accurate data on all retail properties in England.

Nationally, perhaps something like 7,000-14,000 entries on the database should be updated each year. Around where I live, the rate of change looks closer to 10% per year, rather than 5%, so I'm guessing a decent estimate of the national picture will be closer to the higher figure.

As database volumes rise there will be more to maintain. If contributors concentrate on adding missing retail properties, then by the time coverage reaches about 50%, the existing data will be going out of date as fast as new data is being added. If contributors concentrate on maintaining what has already been added, then they will have no time to add the missing 50% of retail properties. Either way, for the foreseeable future, there is going to be a lot of retail data that is either missing from OSM, or incorrect on OSM.

 If we can wait long enough, other factors might help. A decline in the number of retail premises would also accelerate progress towards 100% coverage, and the chart shows the effect of a 2% reduction in the number of retail properties each year. Even if this is factored in, reaching a worthwhile level of retail cover still looks like a slow process. Too slow.


It is not only individual shops that change. Retail business models also evolve, and over the long term we should expect this to affect the choice of tags. Some formats which once were common on the high street no longer exist (ironmongers into hardware, then homeware). A traditional grocery, or a video rental shop is now unusual.

On the other hand, perhaps candle shops are returning to the high street (“shop=candle”), and e-cigarettes are a recent arrival. The data on chains of mobile phone shops may be an example of how this process continues. Currently these chains are tagged with a mix of “mobile_phone” (95%), and “electronics” (5%). Perhaps contributors are adapting their tagging, in recognition that an established speciality has now matured, and the offer is starting to evolve as retailers extend into adjacent markets.

Sunday 26 July 2015

OSM Retail Survey: Part: 10

Specialised types of shop offer a narrow range of categories, but provide wide choice within their specialist area. Generalist retailers offer a broad range of product categories, with less choice within each category. Large generalists (e.g. supermarkets) are able to offer both numerous categories and broad choice.

We use different terms for a large “supermarket” (with breadth and depth), a small “convenience” store (with some breadth and less depth), and a "butcher" or “newsagent” which we expect to be more more specialised. We expect a newsagent to offer a wider choice of newspapers and magazines than a convenience store, but we would still expect a convenience store to offer newspapers. We expect a convenience store to offer much more than newspapers, and we would be surprised if a newsagent offered nothing but newspapers. We expect a butcher to offer a wider choice of meat than a convenience store. Ours does excellent sandwiches, ready meals, pies, vegetables, and various other items as well. Although the principles are fairly clear, the precise boundaries between retail categories are always going to be difficult to pin down.

As a result, it doesn't matter how clear the definitions are for different terms covering different levels of specialisation. We should still expect some inconsistency in the way that different tags are used. Some retailers have a business model that is closer to the boundary than others, so it is inevitable that there will be a grey area where it is difficult to maintain a consistent boundary. The proper question isn't whether tagging ought to be consistent. It's whether there should be more consistency than we find.

To my mind there are several areas where the data does not look consistent enough. This is particularly true in the case of large stores which sell a broad range of goods (the big generalists).

For example, a data user who searches for “supermarket” and relies on the wiki for the definition, will expect to find “a large store for groceries and other goods” “a full service grocery store that often sells a variety of non-food products as well”. They will assume (perhaps because the wiki tells them) that “stores that do not provide full service grocery departments are generally not considered supermarkets”.

In practice they will find results that include a high proportion of outlets that fit this description, including most branches of the major chains that they will expect to find: ALDI, ASDA, Booths, Co-op, Iceland, Lidl, Morrison's, Sainbury's, Tesco, Waitrose, etc. However, they will also pick up a lot of convenience stores, and some stores tagged “supermarket” where few shoppers would expect to find groceries: Argos, Homebase, Matalan, Mothercare, Pets at Home, etc.

I estimate that around 10% of the data that they retrieve will not be what they expect.

Commercial search engines face a similar problem, because  smaller convenience stores often call themselves a supermarket, and this is inevitably picked up in their keyword searches. But OSM has a more structured data model. We should expect to perform better.

The situation with department stores is even more difficult for data users. The major chains are well covered, but they only represent about half of all retail outlets tagged as a department store. Data users who rely on the Wiki definition will be expecting “a large store with multiple clothing and other general merchandise departments”. They probably won't expect to pick up Poundstretcher, Argos, Matalan, Pets at Home, Staples, Superdrug, TK Maxx, etc. - but they will.

Wilkinson's (Wilko) is a difficult boundary case - with a particularly wide range of different key values for different branches.  My own view is that something like “homeware” would be the best description of their format, but only about 2% of contributors agree with me. And in practice, what should matter to data users is not what I think (even when I am right). What has to matter to data users is the consensus that develops across the majority of contributors. And in this particular case there is little consensus. It is difficult for anyone to know whether to consider Wilkinson's a department store or not. What is even more unsatisfactory for data users is that 25% of Wilkinson's stores are considered to be a department store, and even though that's the most popular option, 75% are tagged differently.

Neither of these examples is the result of a problem with the definition of the tags for a supermarket or a department store. The problem is that the same tags are being quite widely used for branches of chains where most contributors prefer an alternative. Good data on department stores and supermarkets is polluted by inconsistent data on other retail formats.

Looking further, the confusion lies partly in representing scale consistently, and partly in representing the degree of specialisation consistently.

Most specialists offer some categories of product that fall outside their main area of activity. Some position themselves as specialists in more than one area. As a result contributors can find it difficult to draw a consistent distinction between a specialist and a generalist outlet. If they are uncertain about the right specialist term to use, they tend to look for something more generic, and fall back on terms intended for generalists. This isn't entirely unreasonable behaviour. For a long time, the guidance, when in doubt, is to pick a popular tag that best fits the situation (rather than inventing a new one). Contributors don't necessarily have an understanding of all the tags  in use, and the result is that popular tags that were originally intended to apply to large outlets which offer a broad range are quite commonly used for smaller outlets offering a broad range, and for unusual specialists that are difficult for contributors to classify.

Looking at this another way, we have a choice of terms for shops which offer a broad range. Contributors who find it difficult to pick an appropriate tag veer towards picking one from a higher row in this table - they are the ones that are most widely used.


Primarily food Primarily non-food Hardware / building materials
Large generalists
“supermarket”
“department_store”
“doityourself” (or sometimes “trade”)
Other generalists
“convenience”
“general” (rare) or “variety” (for pound shops)
“hardware”
Specialists
“bakery”, “butcher”, “cheese”, etc.
“clothes”, “beauty”, “houseware”, etc.
“garden_centre”, “paint”, etc.


One result of tending towards tags for larger generalists is that supermarkets are over-represented in OSM. Industry figures show 6,410 stores in this category in the UK, whereas I found 7,045 (110%) in OSM. Convenience stores, on the other hand are under-recorded. I found 9,717 out of 48,303 identified by the industry (i.e. just 20%).

It is obvious from the data that contributors find it difficult to to make a distinction between a supermarket and convenience store. In England and Wales the law on opening hours varies for different sizes of store, with restricted hours on Sunday for those of more than 208 sq. metres (3,000 sq. ft.) So a supermarket of less than 280 square metres (3,000 sq. ft.) would be normally be considered a convenience store, and a convenience store of more than 280 square metres would be considered a supermarket. However, in OSM, at least 9% of outlets marked as a supermarket in OSM (and recorded as an area rather than a node) have a floorspace of less than 280 sq metres. Around one in three of the stores operated by one of the major convenience store chains is tagged as a supermarket. Convenience stores don't have to offer extended opening hours, we can't really expect contributors to measure the footprint, and the situation is further confused because some convenience stores describe themselves as a supermarket. The upshot is that almost a thousand convenience stores in OSM are marked as a supermarket. And meanwhile, because convenience stores are generally under-recorded, around 30% of the general grocery sector has yet to be added to OSM.

Changing tack, department stores sell a range of general merchandise, typically including clothing, household appliances, toys and games, personal-care products and garden equipment. Some also sell food, but non-specialised food stores are properly classified as supermarkets. With very few exceptions the major UK department store chains, such as John Lewis, Debenhams, and House of Fraser are tagged correctly as a department store. However, not all retail premises tagged “department_store” comfortably fit the description.

Examples include branches of Argos (normally tagged “catalogue”), TK Maxx and Matalan (normally tagged “clothes”), Poundland (normally tagged “variety_store”, sometimes “convenience” or “supermarket”), Mothercare (normally “baby_goods”, sometimes “clothes”), Wilkinson's (“department_store” for 25% of branches, plus a wide range of different alternatives).

The Wiki describes Do-It-Yourself-stores as being similar to hardware stores, except generally larger, stocking a wider range of products, and targeting customers who are non-professionals working on home improvements, redecorating, gardening, etc. Pure DIY stores are well covered in the database, and consistently tagged. In the case of Homebase, B&Q and Wickes, for example, more than two-thirds of branches are in the database,  and well over 90% are tagged as “doityourself”.

The same is not true of builders' merchants (which according to the documentation are properly tagged as “trade”). Fewer than 10% of Jewsons, and Travis Perkins branches are in the database, and they are tagged with a mix of “doityourself”, “hardware”, and “trade”, with “doityourself” as the most common.

There seem to be two issues here. One is that many trade outlets also serve non-professionals, so their business model overlaps with the scope of “doityourself” (this is accepted in the documentation on “shop=trade”, but contributors are either uncomfortable with it, or simply don't recognise these as trade outlets). The other issue is that there are different degrees of specialisation in the trade side of the market. Specialists in supplying the trade with building materials, timber, plumbing, bathroom furniture, electrical goods, tools, etc. all seem to be under-recorded, and inconsistently tagged. Again, where there is no clear consensus, contributors have fallen back on common tags such as “doityourself” and “hardware”, that were originally intended for generalists supplying the non-professional, and so are more widely used.

Branches of Wilkinson's and Robert Dyas don't fit comfortably into any of the most common categories, so they tend to suffer from highly inconsistent tagging (department_store, doityourself or hardware). We could blame contributors, but surely some of the tagging inconsistency shows that there may be a need for:

  • more specific options to cover particular retail format that do not comfortably fit the current categories
  • more generic options, so that contributors have an alternative to popular tags intended for large generalists 

Saturday 25 July 2015

OSM Retail Survey: Part-9

False synonyms

True synonyms add to the confusion, provoke debate, and may discourage some data users, but in practice I suspect “false synonyms” are a bigger problem. By this, I mean tags that are used interchangeably by contributors, even when they are not true synonyms according to the guidelines. Again, we can use major chains to do some cross-checking of whether tags with similar meanings are applied consistently.

  • Almost every major chain of pharmacies has a mix of outlets tagged as “shop=pharmacy” and “shop=chemist”. 
  • Similarly “alcohol”, “wine”, “beverages” seem to be used interchangeably for chains of off-licences and wine merchants, with “alcohol” as the most common of these. The less common “off-licence” is not widely used on retail outlets
  • For chains such as Ladbrookes and William Hill, “bookmaker'” is the most common, but “betting” and “gambling” are also quite common 
  • There is a lot of overlap between outlets that are described by the relatively common “doityourself”, and the less common “hardware”, “building_supplies”, “trade”
  • For “mobile_phone” the less common alternatives are “phone” and “electronics”. Tagging "electronics" could be a symptom of an evolving retail format. Phone looks like a false synonym.

The documentation in the wiki makes it reasonably clear that the above are not true synonyms, but contributors have treated them as synonyms in the sense that similar branches of the same chain use a mix of different values. As a result, data users are unable to tell where there is a true difference, and where there is imprecise tagging. In effect data users are being pushed to treat these as synonyms, even though they are documented as having different meanings.

These are examples of retail formats that contributors have difficulty with. Data users, those who maintain the documentation, and those who advocate changes to tagging need to be sensitive to where these occur. We'll look in more detail at some common examples shortly.

Multi-specialists

The above are all examples of specialist retailers. Multiple specialities are another area that give contributors a problem. Halfords is one of the most easily identified examples. How best to tag a store that offers both bicycles and car parts? The solutions that contributors have come up with include around 30 different variants:

  • Choosing just one of the options: “bicycle”, “automotive”, “car_accessories”, “auto_accessories”, “car_parts” and ignoring any other area of specialisation
  • Contributing a list of options separated by semi-colons: “bicycle;car_parts”, “car;bicycle” “bicycle; car_accessories”, “motor;bicycle”
  • Using a more generic category: “doityourself”, “hardware”

The usual way to assign multiple values to a key is a list separated by semi-colons. In practice this is not widely used for shops (less than one in a thousand examples), but there are examples which give an idea of other multiple specialities that are giving contributors problems:

  • “hairdresser;beauty”
  • “kitchen;bathroom” 
  • “greengrocer;florist”
  • “dry_cleaning;laundry”
  • “art;frame”
  • “car;bicycle”
  • “shoe_repair;key_cutting”
  • “bicycle;car_parts”
  • “tattoo;piercing”

Noticeably, these are all pairs. Happily, there don't seem to be any long lists of shop types. Contributors recognise that the intention is to record mixed types of speciality shop, not to list all the categories of good for sale.

The limited number of examples mean that these won't give data users a great problem. If they chose to ignore them they won't lose much data. If they prefer to break out the list then it won't give the much difficulty. More importantly, to my mind, contributors are sending signals here about retail formats that they find it difficult to categorise. This could be valuable information for those who maintain the documentation, and those who advocate changes to tagging.

Friday 24 July 2015

OSM Retail Survey: Part-8

Consistency

One way to assess tagging consistency is to examine differences in tagging across similar outlets of the larger retail chains. Contributors don't always agree on how to tag similar shops, they don't always follow the guidelines, the guidelines aren't static, and they aren't always consistent.

Regardless of what the documentation might say, and the merits of any minority view, in practice data users will have to follow the consensus that has been adopted by the majority of contributors.

In principle crowd-sourcing will end up tagging most of a retail chain with the “correct” tag. By examining variations in tagging across a retail chain we can get an idea of the proportion of outlets that have been tagged according to the consensus, and how many fall outside the consensus. Data users will be able to accommodate variations, to some extent, but they won't be able to accommodate all of them.

  • In the case of banks, for example, there is very little variation in tagging: 100% of Barclays,  HSBC, Natwest, and Lloyds / TSB branches are tagged “amenity=bank”. 
  • In other sectors, Subway doesn't fall far behind the consistency of banks at 95% tagged “amenity=fast_food”.
  • At the other extreme there are more challenging examples. Wilkinson's seems to be one of the more difficult chains for contributors to classify: “department_store” is the most common choice, but only accounts for 25% of examples. “hardware”, “variety_store”, “doityourself”, “supermarket”, “general”, “convenience”, “household” and “houseware” are also popular. Robert Dyas, with a similar retail format, faces similar difficulties. 
  • In the case of Halfords 42% of branches are tagged “shop=bicycle” (which doesn't really capture their business format) and the rest use a wide variety of tags. 
  • Argos has 40% tagged “shop=catalogue” and the rest a variety. 

In general the most specialised chains tend to be tagged more consistently, and the most consistent tagging of all is found within chains of smaller outlets, with a well-established, widely understood,  unambiguous specialisation (“estate_agent”, “funeral_directors”, “hairdresser”, “toys”, “optician”, “laundry”, and “travel_agency”).

Less consistent tagging is found in chains where the specialisation is more ambiguous (“gift”, “catalogue”, “accessories”).

There are many shops that are not part of a chain, and we can't easily assess how consistently they are tagged. But if we assume that the pattern of tagging inconsistencies across retail chains is repeated across the whole of the retail market, then we can get some idea of how consistent tagging might be overall. In practice there tends to be more consistency across larger chains, and less across smaller chains, so results vary according to how widely we cast the net. As a broad indication we should probably anticipate that something in the region of 20% of retail outlets have been tagged with a value that differs from the one that the majority of contributors would choose (and hence the value that data users would have to expect).

Some variation is inevitable: retail business models evolve over time, and vary from place to place; different contributors place different emphasis on different characteristics; tagging guidelines change as they are refined. However, if 20% of existing data is tagged with a value different to the one that most contributors would chose, then across England there are almost 30,000 retail premises in the database that data users will find it hard to recognise, and which should perhaps be brought more into line. After the 385,000 missing retail premises, it seems to me that this must rank as the second largest data quality issue.

Synonyms

Many community discussions of tagging inconsistencies revolve around synonyms. The controversy often lies in deciding when different contributors are using different terms to describe exactly the same thing, and when they are using different terms to describe subtle differences.

Any list of synonyms invites debate, but examples that are unlikely to be controversial, and where the difference is more than a spelling mistake would probably include travel_agent / travel_agency, newspaper / newsagent, jewellery / jewelry, and deli / delicatessen. I suspect that most would also count baby / baby_goods, seafood / fish / fishmonger, bathroom_furnishing / bathroom, beauty_salon / beauty, etc.  as true synonyms.

If this is anywhere near a complete list, then true synonyms do not look like a significant problem across all retail data. Including spelling mistakes they account for fewer than 1% of all shops in the database. However, they represent a higher proportion of data within some categories of shop, and they can account for a significant proportion of the more unusual categories.

The retail categories where synonyms are likely to present the greatest problem are where they account for a significant proportion of an important category. Everyone will have different ideas of what makes a proportion significant, and a category important, so it is worth considering a couple of real examples.

I reckon there are about 2,500 delicatessens in the UK, and I can find just under 500 in the database. Of those, 456 are tagged “shop=deli”, and 39 are tagged “shop=delicatessen”. Any data user who searches for “shop=deli” will miss 39 delicatessens in the database with the “wrong” tag value, and will miss about 2,000 delicatessens that aren't in the database at all (or at least not with a recognisable tag). Of the two, the bigger problem is surely the 2,000 missing delicatessens.

On the other hand, some synonyms have a more balanced mix of values. Out of 950 independent fishmongers in the UK, 80% haven't been recorded at all.  Of the 20% of fishmongers that are in the database, 47% are tagged “shop=seafood”, 44% are tagged “shop=fishmonger”, and 9% are tagged “shop=fish”. This is more problematic, because anyone who looks for just one of the values is going to miss about half of the available data. Nevertheless, I suspect that anyone who is thinking of trawling the data for a fishmonger is still going to be scuppered by the 80% that are missing from the database altogether, not the inconvenience of testing for two or three different tag values.

I reckon that even within the more problematic categories the issues with synonyms aren't difficult to manage. Where data volumes are small, it is not difficult to fix the data. Where data volumes are large, and one value is dominant, then data users who don't look for a synonym will only lose a small proportion of the data. Where volumes are large and synonyms equally matched then keen data users will go to the trouble of testing for several different values.

Problems with spelling and synonyms are not difficult to fix, but they are relatively small in number, so not the highest priority. The bigger challenge is to achieve greater consistency in the choice of tag for similar shops. The data can provide some pointers on how to do that, but they will wait for the next post.

Thursday 23 July 2015

OSM Retail Survey: Part-7

This material seems to be generating quite a bit of interest, and I'm starting to get questions asking about what it means in practice. We'll come to that, but first I'd like to consider a different aspect of data quality. So far most of the focus has been on coverage: what proportion of retail features have been added to the database.

Coverage of one common category of shop has not been considered yet, though. In around 1% of cases the intent of the contributor was clearly to indicate that this was a shop that was not in use. These include “shop=closed”, “shop=empty”, and most commonly “shop=vacant”.

High street vacancy rates across the UK are currently averaging around 10%. Out of around 50,000 vacant shops, we have data on just over 1,000 (2%). This is one of the lowest levels of coverage that we have identified. We can probably assume that contributors are most active in the most vibrant high streets (i.e. those with fewest vacancies), but this still suggests that vacant shops are badly under-recorded in OSM. It is difficult to say whether that means the missing vacant shops are completely un-recorded, or recorded in a way that is difficult to recognise. Either way they are not readily available to data users. However, that probably doesn't matter greatly. It's difficult to imagine many users who would value an application that can find the nearest vacant shop.

But data quality is not just about completeness. We must also question whether the recorded data accurately represents what is on the ground.

In my efforts to uncover as many retail premises as possible I've identified over 2,000 different tag combinations. Around 80 of those account for more than 95% of retail premises. The most common 26 account for 85%. Among the 2,000 are around 200 minor spelling mistakes. These represent 10% of the tagging variations, but a much smaller proportion of the data.

My estimate of the number of spelling mistakes is based on calculating the Levenshtein distance between different values of the shop tag. Where there are only one or two differences in spelling between one tag and another, my initial premise is that the less common variant is a spelling mistake for the more common alternative. However, this approach also picks up some correct values of the shop tag, that have to be eliminated manually from the sample (“shop=car” and “shop=card” for example only differ in one character, but are not spelling variations of each other). The approach is bound to miss some more complex spelling mistakes, but hopefully not too many. I think it is capturing the great majority.

Variations in the use of plural and singular forms account for around 60% of these errors; differences in capitalisation for around 8%; and differences in hyphenation and underscores around 6%). The remaining 25% of near matches are more diverse. Overall this approach detected spelling mistakes in around 0.7% of shop tagging.

The "proper" values of “shop” that are most commonly misspelled are “card” (cards), “carpet” (carpets), “solicitors” (solicitor), off_licence (off_license, offlicence, off-licence).

Given the controversies over bulk editing, it may be worth noting that
  • the number of retail features in the database which contain a spelling mistake in the shop key is in the order of 1,000 (compared to 385,000 missing retail premises). 
  • around a third of the spelling mistakes in the shop tag are unique occurrences
  • many spelling mistakes are an unusual spelling of a value which itself is comparatively rare (or a non-standard use of the “shop” tag)
  • data users are probably just going to ignore these - the volumes of lost data are too small to justify a lot of effort on their part   
In other words, it looks as though very few of these cases are suitable for bulk editing: virtually all either need to be checked and fixed manually, or can be more easily fixed manually than with a bulk edit.

Examples of spelling mistakes which occur more than a couple of times include:

Less common 
value
Occurrences More common
equivalent
Normal
occurrences
cards 60card77
carpets 32 carpet 494
solicitor 22 solicitors 32
crafts 16 craft 152
kitchens 15 kitchen 287
bathrooms 14 bathroom 72
antique 13 antiques 395
game 13 games 29
chandler 13 chandlery 15
bookmakers 12 bookmaker 1,542
opticians 12 optician 1,226
communications 12 communication 14
beds 11 bed 100
tile 11 tiles 67
window 11 windows 35
fireplaces 11 fireplace 26
printers 11 printer 25
furnishing 10 furnishings 23
off_license 10 off_licence 17
grocer 9 grocery 52
estate agent 8 estate_agent 1,656
accountants 8 accountant 23

Wednesday 22 July 2015

OSM Retail Survey: Part-6

To assess how OSM data compares to commercial services similar searches of retail data were compared across different types of platform. I have not yet managed to do this programmatically, but a broad impression can be gained by comparing the results from a commercial search engine with the results of searching a similar area for equivalent tags on Overpass Turbo (http://overpass-turbo.eu/). The comparisons cannot be carried out precisely, so the approach relies on general impressions, and the findings are more qualitative than quantitative.  Because this approach is so subjective, it would be interesting to hear the impressions that others have of similar comparisons.

OSM was searched by specific categories of shop. The equivalent searches of commercial engines relied on using similar keywords. While these two different approaches can produce similar numbers of results, there were also differences in the specific results that were obtained.

Scope OSM Commercial Notes
Supermarkets and convenience stores around Maidenhead Around 50 examples Around 50 examples Similar coverage, and similar mix. Both identify many convenience stores as a supermarket
Pet shops in Truro None Three, plus some variants, such as pet charities OSM retail coverage is incomplete. Commercial search is more effective
Fishmongers across Norfolk Around 60 examples Around 40 examples OSM coverage better within Norwich (though some duplicates) but thin elsewhere. Commercial search produces more results in coastal towns which are less well-mapped in OSM
DIY on Tyneside Around 80 examples Around 40 examples Both find branches of major chains. Commercial search picks up smaller stores by name match, including some false-positives. OSM picks up some smaller hardware shops based on DIY tagging
Cafés in Harrogate Around 20 examples Around 70 examples OSM retail coverage looks incomplete. Commercial search better at finding in-store cafés and similar, but also includes many false positives (e.g. restaurants)

Inherently, the OSM search was looking for a particular “key=value” pair. I have no inside knowledge of exactly how commercial search engines do this, but it's well understood that - given a particular keyword - they use subtle algorithms to find equivalent matches within bodies of text. This includes some fuzzy searching using inflexions, synonyms and various matching algorithms to expand the scope of results beyond the specific keywords that were requested. For example, if we ask a search find “pharmcy” we are not surprised when it corrects the spelling to “pharmacy” and then retrieves “pharmacies”, pharmacist”, etc. We expect such a search to find retail pharmacies, but we are not surprised that it also retrieves university courses, job vacancies, drug manufacturers, and work by Damien Hirst as well.

I suspect that commercial search engines are also embedding some assumptions about major retail chains. So, for example, if I search for a hardware shop they seem to have some understanding that branches of B&Q and Wickes will also be of interest.

By contrast, the assumed behaviour of data retrieval in OSM is that it will be based on a search for nodes, ways and relations that satisfy a specific set of documented values within a limited subset of available keys. This implicit assumption about how data retrieval will work has an effect on the way that contributors chose how to represent data.

Certainly this model has advantages, and opens up opportunities for users of OSM data that may be difficult to achieve with services that operate on a different search techniques. For example (and for some encouragement about the quality of data that OSM is already able to deliver), try searching for a café with wheelchair access in London.




It may, of course, be a reasonable assumption, that future OSM data retrieval will be heavily based on searching for combinations of specific key=value pairs, but this may also be too limiting as a way to think about how things will work. For example, an application that is asked to find a cycle shop could search both "shop=bicycle" and "name similar to Halfords". A search for a hardware shop could well re-cast this as a search for any combination of shop=hardware / doityourself / trade, or any outlet that is part of a chain that has a name like B&Q, Homebase, Wickes, Jewson, etc.

In summary: at its best, OSM is capable of outperforming a commercial search engine in terms of both the quantity and precision of the results obtained. Generally searches based on OSM data should retrieve fewer false positives because they can draw (to a greater extent) on a degree of data structure. However, successful retrieval of data from OSM relies heavily on the volume of data recorded, of a particular type of shop, within a particular area.

OSM coverage tends to vary more from place to place. In areas where OSM coverage is around 50-60% of retail premises then my impression is that data users can expect the results of a search of OSM to match the volume of data retrieved from a commercial search engine. Commercial search engines do not find every retail outlet, so in places where OSM coverage is almost complete data users can expect better results from OSM. However, for most retail formats, across much of the UK a search for retail premises on OSM is less effective in retrieving results than a commercial search engine.

OSM Retail Survey: Part-5

Apart from estimating overall coverage, it should also be possible to provide feedback on the type of coverage within a town or similar area. In one small town that I am fairly familiar with, pubs had been thoroughly mapped, but none of the cafés or shops had been mapped. It is relatively easy to measure that kind of discrepancy in the OSM data, and contributors might find that kind of feedback useful as a pointer to areas that need more attention.

It isn't difficult to derive some broad rules of thumb about the balance between different types of retail premises that might suggest where coverage looks incomplete. Across all of the data that I have extracted, 48% of retail premises are shops, and the rest offer either refreshments or services. There are a number of places where the mix is quite different. Of course it may be that some of these towns have an extraordinarily large number of cafés and pubs. More likely that contributors haven't got round to adding many shops yet.



Similarly, there are towns where there don't seem to be as many cafés and pubs as one would normally expect. Again, this could reflect reality on the ground, but it might also point to areas that deserve some more attention.



Following the same line of thought, it ought to be possible to measure the mix on individual shopping streets. For this experiment I used the centre of Nottingham. I have no local knowledge of Nottingham, but the coverage of retail premises there is comprehensive - so the data is relatively easy to work with. Here the mix of retail premises is highlighted on any street where there is a decent sample to work with. The proportion of Shops is shown in Cyan; Refreshments (cafés, pubs, etc) in Yellow, and Services (banks, estate agents, etc) in Magenta. Green implies areas where shops and refreshments predominate. Orange implies that refreshments and services predominate (i.e. comparatively few shops). The idea was to test whether it is possible to give contributors an overall impression of the contents of the map which they can compare against local knowledge of how the town centre is organised – at a broader level than the detailed location of individual shops. It has flaws, and the data is difficult to manipulate - so I'm not convinced the approach is practical - but it might point a way towards better alternatives.



Contributors with an interest in mapping particular types of retail may be able to take advantage of the fact that similar types of retail tend to cluster together. On OSM, 85% of clothes shops have another clothes shop with 100 metres (25% have at least 10 more clothes shops within 100 metres); 70% of banks have another bank within 100 metres; 60% of estate agents and 60% of fast food outlets have an estate agent / fast food outlet within 100 metres; 40% of pubs have another pub within 100 metres. Identifying this kind of cluster might be helpful for some kinds of location search, and it may also provide useful feedback to contributors, who are able to compare the state of the map against local knowledge to identify clusters that look incomplete.

Here, for example is a map of Manchester showing clusters of clothing shops that can be identified from existing data. The analysis began with a broad definition of a clothing shop (shop=clothes, shoes, fashion, boutique, or department store) then used R clustering capabilities (the DBSCAN algorithm) on a data extract to find areas where there are more than five clothing shops within 100 metres of each other. This particular example is probably of limited use to those of us who are unfamiliar with Manchester (and also, for that matter, for those of us who are unfamiliar with shopping for clothes). But on the face of it, there must be quite a lot of missing clothes shops in Manchester, and the presence and absence of clusters in the data might point local fashion-conscious mappers to areas that deserve attention.


SK53 has just pointed out that it should be possible to extend this kind of approach using Food Hygiene data to identify retail areas, and compare them with OSM data. I haven't tried yet, but it sounds like a promising idea.



Here is an additional example, picking up on the idea that Food Hygiene data might be used to identify suburban areas that need more attention. The Food Hygiene data shows location and food hygiene status for a variety of retail outlets, including pubs, supermarkets, takeaways, restaurants, cafes and some other types of retailer. Of course, the same data could also be used to identify individual outlets that are missing, but since the data only covers certain types of outlet, the aim here is more general. The idea is to identify suburban areas where there may be several missing retail outlets, including some that don't offer food. 

These are Liverpool suburbs where there is Food Hygiene data on at least five retail outlets, but none appear in OSM. Relatively few areas in the UK fit these rather crude criteria. More sophisticated approaches must be possible, but refining them will need more experimentation, and that will take longer. Meanwhile this suggests that the general approach should work in principle.



And another example, covering Sunderland. This uses the more granular ONS Lower Layer Super Output Areas. Those rendered are where OSM contains no retail outlet, but the Food Standards Agency has at least one Food Hygiene Record (for a high-street business type). The darker the polygon, the more FSA records it contains, and hence the more retail outlets are likely to be missing from OSM. 





And a third example, for Sheffield, showing the difference between the number of Food Hygiene Records (for high-street business types), and the number of OSM retail features that fall within each LSOA. Once again, the figures aren't directly comparable. The aim is to highlight areas where the OSM data is implausibly thin, so the figures are no more than a proxy measure of how great the shortfall is likely to be. Areas are not coloured where the volume of OSM data is equal to or larger than the FSA Food Hygiene figure - but this doesn't necessarily imply that they are complete. The real message is "if you go to the dark red areas you should find lots of unmapped shops to add".




OSM Retail Survey: Part-4

It is often useful to raise our sights from the data that has been recorded in OSM, and consider the data that hasn't been recorded. As discussed above, the overall level of retail coverage in England is 27% of retail premises. However, there are variations in the extent to which different types of shop are recorded.

There are a number of ways to assess OSM coverage of a specific sector at a national level. The approach, broadly, is to count the number of shops of a particular type that are already recorded in OSM, estimate the total number that should be there, and compare the two.

There are various sources of statistics that give basic information on numbers of different types of shop at a national level. Getting useful estimates of numbers at a more local level is a bigger challenge.


  • It isn't difficult to find information on the number of branches for major retail chains - through their own publications, from business reporting, or from Wikipedia. The OSM Wiki has a detailed page on major UK Retail chains. 
  • There are national statistics which can be used for some sectors and types of shop (for example, see UKBA01a Enterprise/local units by 4 Digit SIC and UK Regions; and Retail Hereditaments by Administrative Area issued by the Valuation Office Agency). 
  • For independent specialists, many trade associations publish figures on the size of their sector.
  • Press articles and market research companies will sometimes publish figures on the number of different types of retailer. 
  • When all else fails, searching a directory  (such as Yellow Pages) can give some idea of the likely number of outlets. 
  • Where retailers need to be licensed (tattoo parlours, for example), I thought it would be easy to obtain figures at a local authority level. No doubt this would be possible (through FOIA, for example), but so far I haven't found a more accessible source of licensing statistics. Local figures may be easier to find from the licensing authority, and any ideas would be welcome on where to find national figure.

For specific retail locations:


  • within the OSM community, Robert Whittaker has tools relating to Post Offices on his Post Hoc pages (http://robert.mathmos.net/osm/postboxes/). 
  • Beyond the community, most large retail chains publish a list of branches, and some have given permission for this information to be added to OSM. 
  • Trade associations for specialist independent retailers don't normally seem to provide information on the location of individual members, but some may,
  • The NHS provides lists of pharmacies and opticians.

Using this approach, and taking three examples where we might expect levels of recording to be relatively high:

  • There are 11,696 post offices in the UK, and I have found 7,622 (65%) of them in OSM. Around 90% of built-up areas with a population of more than 5,000 have a post office in OSM. The 72 that don't might be good places to find missing post offices. More generally, post offices can be used as an indicator of a wider gap in coverage. They are one of the types of retail outlet that are likely to be added before other retail properties. So a larger settlement with a missing post office is likely to contain other retail properties that need to be added. 
  • There are 11,647 community pharmacies in England. I found 4,225 tagged as a pharmacy (36% of the total), and another 483 tagged as “chemist”. Strictly speaking “chemist” is for shops that don't supply prescriptions, but has been quite widely used as a synonym for “pharmacy”. Taken together we locate about 40% of community pharmacies. More than half are missing. I imagine that towns the size of Braintree, Grantham, Peterlee, Melton Mowbray, Haverhill, Maghull, and Congleton have a community pharmacy – but none seems to be recorded in OSM (they are displayed in Google location searches). Only around half of the pharmacies in England are specialised shops – the rest are an operation embedded within another store. Community pharmacies that operate from within a large supermarket seem to be under-recorded in OSM. 
  • There are about 2,500 specialist bicycle shops in the UK, and I have found 1,631 (65%) in the OSM database. The largest bicycle retailer, Halfords, has 465 branches across the UK, of which I found 364 (78%). I'm not sure how many of those offer bicycles, but OSM says that 151 of them do (41%). That must surely under-state the true figure.

And some examples where I expected coverage to be relatively low:

  • Figures suggest that there are about 1,400 pound stores in the UK. I've found 586 tagged with “variety_store”, and another 170 or so with alternative tagging. Which means that tagging is inconsistent, but suggests that coverage is over 50% - i.e. more than I expected. Perhaps my estimate of the total is too low
  • There are 8,500 Charity Shops in England, 900 in Scotland and 500 in Wales. I should find 9,900 in my data extract. Depending on how carefully I interpret the data, I can find between 1,751 and 1,994 (18-20%). Around 90% are tagged as “shop=charity” but there is a smattering of others tagged according to their specialisation: “shop=clothes”, “shop=secondhand” or “shop=books”



Primary
tag value
OSM count
(UK)
OSM count
(England)
Estimated 
 actual (UK)
Estimated 
 actual (England)
Approx. coverage
pub
34,937
31,180
48,000

73%
restaurant
16,062
13,855
60,000

27%
fast_food
15,762
13,794

41,295
33%
cafe
13,137
11,280
16,501

80%
convenience
13,108
11,212
48,303

27%
supermarket
8,720
7,352
6,410

119%
post_office
7,622
6,199
11,696

65%
hairdresser
7,187
6,366
38,300

19%
fuel
6,207
5,190
8,588

72%
bank
5,946
5,089
8,961

66%
pharmacy
4,871
4,225

11,647
36%
charity
1,682
1,476
9,900
8,500
17%
bicycle
1,631
1,406
2,500

65%
beauty
1,543
1,361
13,000

12%
bookmaker
1,386
1,242
9,128

15%
optician
1,161
1,041
7,250

16%
florist
983
885
8,000

12%
alcohol
784
665
5,575
4,195
14%
variety_store
586
528
1,400

42%
deli
456
391
2,500

18%
seafood
87
70
950

9%



It is interesting to consider in more detail at how data users might interpret some specific examples.

Finding a pharmacist (i.e. someone who can dispense prescriptions) could be the basis of a useful application, and there have been various attempts to develop appropriate tagging, but the actual data is quite complex for data users to interpret.

Values of “pharmacy” and “chemist” can appear for “amenity” and “shop”; “dispensing” can be set to “yes”, “no” or sometimes the name of the outlet. And all of these can be combined in different ways, alongside other values of “amenity” and “shop”.

  • “amenity=pharmacy” alongside any value of “shop=*” and either “dispensing=yes” or no value for “dispensing”: this is in line with the various guidelines, and unambiguously indicates that prescriptions will be dispensed. This accounts for almost 90% of cases in the data
  • “shop=chemist” without any indication of “dispensing”: is correct tagging for a place where prescriptions will NOT be dispensed, but examining actual examples suggests that it is widely mis-used for pharmacies. So in practice it has to be regarded as ambiguous. It represents almost 10% of cases.
  • “amenity=pharmacy” with “dispensing=no”: is inconsistent tagging, and not in line with the guidelines, but can still be interpreted fairly confidently as a place where prescriptions will NOT be dispensed. It accounts for around 1% of cases
  • “shop=chemist” without “amenity=pharmacy”, and with “dispensing=no” is correct tagging, and unambiguously a place where prescriptions will NOT be dispensed. It only accounts for 0.1% of cases.
  • “shop=pharmacy”, “amenity=chemist”, with or without other values are examples of incorrect use of the tags, but small in volume (less than 0.5%), and often appear alongside a correct tag (e.g. “shop=pharmacy”+“amenity=pharmacy”): the incorrect tag values can safely be ignored by data users without sacrificing significant amounts of relevant data

The above figures are calculated from pharmacies and chemists recorded in the database. So it is worth recalling that this only accounts for 40% of actual pharmacies, and around 60% of these outlets do not appear in the database at all.

In practice data users are going to be reasonably confident that they have found a dispensing pharmacist where “amenity=pharmacy” is present, and “dispensing” is either absent, or set to anything other than “no”. They will have to treat “shop=chemist” as ambiguous in this context. In practice they will probably ignore everything else because the complexity of the logic increases out of all proportion to the quantity of reliable data that it can uncover. In summary they will confidently interpret 90% of the data in the database, and find just over one in three pharmacies. If they interpret the data more loosely they will be able to point their users to about 40% of real pharmacies. If they want to find more, then at present they will have to look elsewhere for their data.

Next we will look at how coverage by type of retail outlet might be used to provide useful feedback to contributors and data users.....