Consistency
One way to assess tagging consistency is to examine differences in tagging across similar outlets of the larger retail chains. Contributors don't always agree on how to tag similar shops, they don't always follow the guidelines, the guidelines aren't static, and they aren't always consistent.Regardless of what the documentation might say, and the merits of any minority view, in practice data users will have to follow the consensus that has been adopted by the majority of contributors.
In principle crowd-sourcing will end up tagging most of a retail chain with the “correct” tag. By examining variations in tagging across a retail chain we can get an idea of the proportion of outlets that have been tagged according to the consensus, and how many fall outside the consensus. Data users will be able to accommodate variations, to some extent, but they won't be able to accommodate all of them.
- In the case of banks, for example, there is very little variation in tagging: 100% of Barclays, HSBC, Natwest, and Lloyds / TSB branches are tagged “amenity=bank”.
- In other sectors, Subway doesn't fall far behind the consistency of banks at 95% tagged “amenity=fast_food”.
- At the other extreme there are more challenging examples. Wilkinson's seems to be one of the more difficult chains for contributors to classify: “department_store” is the most common choice, but only accounts for 25% of examples. “hardware”, “variety_store”, “doityourself”, “supermarket”, “general”, “convenience”, “household” and “houseware” are also popular. Robert Dyas, with a similar retail format, faces similar difficulties.
- In the case of Halfords 42% of branches are tagged “shop=bicycle” (which doesn't really capture their business format) and the rest use a wide variety of tags.
- Argos has 40% tagged “shop=catalogue” and the rest a variety.
In general the most specialised chains tend to be tagged more consistently, and the most consistent tagging of all is found within chains of smaller outlets, with a well-established, widely understood, unambiguous specialisation (“estate_agent”, “funeral_directors”, “hairdresser”, “toys”, “optician”, “laundry”, and “travel_agency”).
Less consistent tagging is found in chains where the specialisation is more ambiguous (“gift”, “catalogue”, “accessories”).
There are many shops that are not part of a chain, and we can't easily assess how consistently they are tagged. But if we assume that the pattern of tagging inconsistencies across retail chains is repeated across the whole of the retail market, then we can get some idea of how consistent tagging might be overall. In practice there tends to be more consistency across larger chains, and less across smaller chains, so results vary according to how widely we cast the net. As a broad indication we should probably anticipate that something in the region of 20% of retail outlets have been tagged with a value that differs from the one that the majority of contributors would choose (and hence the value that data users would have to expect).
Some variation is inevitable: retail business models evolve over time, and vary from place to place; different contributors place different emphasis on different characteristics; tagging guidelines change as they are refined. However, if 20% of existing data is tagged with a value different to the one that most contributors would chose, then across England there are almost 30,000 retail premises in the database that data users will find it hard to recognise, and which should perhaps be brought more into line. After the 385,000 missing retail premises, it seems to me that this must rank as the second largest data quality issue.
Synonyms
Many community discussions of tagging inconsistencies revolve around synonyms. The controversy often lies in deciding when different contributors are using different terms to describe exactly the same thing, and when they are using different terms to describe subtle differences.Any list of synonyms invites debate, but examples that are unlikely to be controversial, and where the difference is more than a spelling mistake would probably include travel_agent / travel_agency, newspaper / newsagent, jewellery / jewelry, and deli / delicatessen. I suspect that most would also count baby / baby_goods, seafood / fish / fishmonger, bathroom_furnishing / bathroom, beauty_salon / beauty, etc. as true synonyms.
If this is anywhere near a complete list, then true synonyms do not look like a significant problem across all retail data. Including spelling mistakes they account for fewer than 1% of all shops in the database. However, they represent a higher proportion of data within some categories of shop, and they can account for a significant proportion of the more unusual categories.
The retail categories where synonyms are likely to present the greatest problem are where they account for a significant proportion of an important category. Everyone will have different ideas of what makes a proportion significant, and a category important, so it is worth considering a couple of real examples.
I reckon there are about 2,500 delicatessens in the UK, and I can find just under 500 in the database. Of those, 456 are tagged “shop=deli”, and 39 are tagged “shop=delicatessen”. Any data user who searches for “shop=deli” will miss 39 delicatessens in the database with the “wrong” tag value, and will miss about 2,000 delicatessens that aren't in the database at all (or at least not with a recognisable tag). Of the two, the bigger problem is surely the 2,000 missing delicatessens.
On the other hand, some synonyms have a more balanced mix of values. Out of 950 independent fishmongers in the UK, 80% haven't been recorded at all. Of the 20% of fishmongers that are in the database, 47% are tagged “shop=seafood”, 44% are tagged “shop=fishmonger”, and 9% are tagged “shop=fish”. This is more problematic, because anyone who looks for just one of the values is going to miss about half of the available data. Nevertheless, I suspect that anyone who is thinking of trawling the data for a fishmonger is still going to be scuppered by the 80% that are missing from the database altogether, not the inconvenience of testing for two or three different tag values.
I reckon that even within the more problematic categories the issues with synonyms aren't difficult to manage. Where data volumes are small, it is not difficult to fix the data. Where data volumes are large, and one value is dominant, then data users who don't look for a synonym will only lose a small proportion of the data. Where volumes are large and synonyms equally matched then keen data users will go to the trouble of testing for several different values.
Problems with spelling and synonyms are not difficult to fix, but they are relatively small in number, so not the highest priority. The bigger challenge is to achieve greater consistency in the choice of tag for similar shops. The data can provide some pointers on how to do that, but they will wait for the next post.
No comments:
Post a Comment