Thursday 27 June 2013

Looking at where photos are taken

The photographs that we take tell a story about where we have been, what caught our eye, and the memories we want to keep and share with others. 

Photo sharing web sites, such as Flickr, hold a considerable amount of information about the pictures that have been taken in a particular area. Looking at where others chose to record their visit could help us to plan ours. If the information says anything about the experience of visitors then it might even inform discussion on how to promote tourism, or identify priorities for helping visitors to get the most out of their visit.

I've been experimenting with different ways of looking at location information from photographs across a variety of different  places. Here's an example for Durham - a city that I used to know well, but which I haven't visited for a while (too long). 


It's no surprise that most activity falls around the well-known features. The river, and bridges attract attention, and some well-known view-points also stand out. Beyond that, it needs more local knowledge to distinguish between less well-known features and anomalies in the data. Looking at the results for my own town, we can see several picturesque locations that are known to locals, but seem to be largely ignored by visitors. There are other features that seem to catch the eye of visitors, but which we tend to take for granted.

If this looks as though it could be interesting, then here's how it was done.

Firstly, I extracted data from Flickr, using the API "photos search" with a suitable bounding box, and a list of "extras" to pull coordinates, and related information. This extracted data for about 28,000 pictures. However, it's pretty obvious that contributors are not always as careful as they might be about recording the correct location of their photograph. You can't eliminate this problem entirely, but it is possible to improve data quality by being a bit selective, and tweaking the process later can help to distinguish between some of the noise and useful information.

Flickr records the precision of the location information that contributors provide. They call it "accuracy", based on the scale of the map that the contributor was using when they marked the position. If they were zoomed into the finest detail, then accuracy is set to 16. If they were trying to position the picture within a wider area, then accuracy is set to a lower number. I ignored any image where the accuracy of the location was less than 15. This doesn't guarantee that the location is accurate, but it does imply that the contributor has gone to the effort of zooming into detail before marking the location. So it eliminates cases where the contributor has taken a rough stab at the location from a map of a large area. 

This still leaves the common problem of contributors who go to the trouble of picking a precise location, then attach a large batch of different images at the same point. Where there was a batch of images from the same contributor at exactly the same location I only used the first image, and discarded the others. That way any clusters are going to be formed either by multiple contributors marking very similar locations, or by the same contributor taking the trouble to mark different locations in close proximity.

By this stage my sample was down to about 1,600 recorded locations. Some of that data is still noise, but much is clustered around real points of interest.

Analysing clusters depends on calculating the distances between points, so I converted latitude and longitude from Flickr to Ordnance Survey Eastings and Northings. That way, distances are calculated in Metres rather than Degrees.

I then fed the data into R so that I had access to a choice of different clustering algorithms. There is a wide selection of these in R, and I experimented with a few, but ended up using DBSCAN. Basically because this seems to work reasonably well with this kind of data. 

It's a bit more complicated than this, but in principle DBSCAN considers two points as being within the same cluster if they are close to each other. You have to define "close together". I did this by trial and error, and ended up setting "epsilon" to 20 metres. If the figure is set too high, points tend to merge into huge clusters. If the figure is too low then you end up artificially distinguishing between small clusters at a level of detail that the data quality doesn't justify. 

In effect the whole approach relies on the assumption that dense clusters of activity are separated by corridors of  less dense "noise". In areas where large features with thinly spread activity are combined with smaller features with dense activity then this approach doesn't work very well. Consider the case of a stately home with large grounds, for example. Even if there is a similar level of activity in both house and gardens, the stately home will tend to appear as a cluster, while the grounds do not because the results are spread over a much larger area, and just end up looking like noise.

The other parameter that DBSCAN needs is a figure for the minimum number of points in a cluster. This has two effects. Firstly it helps to eliminate "noise" in the data. For example, a minimum cluster size of 2 means that any arbitrary freestanding point would not be considered to be part of a cluster. But there is also a side-effect which I suspect is more important with this kind of data. If two "real" clusters come fairly close together, and there is a bit of "noise" between them, they will tend to merge into a single cluster. By setting a reasonable threshold for a cluster size this avoids the risk that a few isolated points form a bridge between two clusters that should really be separate. There is a related issue that occurs in some other areas, but not in Durham. Long, thin clusters of points can occur along a well-known path. Linking all the pieces of path together depends on a fairly consistent line of points, but normally the density of points along a path will vary. Here a low threshold for minimum cluster size can help to avoid things breaking up. Across the centre of Durham the density of activity is quite high and quite consistent, so in this case, using trial and error, I picked a relatively large figure of 5 as the minimum cluster size. 

Finally I loaded the list of points into a Postigs database, and used concave hull to draw the cluster outlines. Then plotted the final result using QGIS.

Suggestions for improvement are welcome.