Friday, 13 June 2014

Pre-processing water

Like roads, water can be represented as either a line or an area. Roads represented as lines tend to dominate those represented as areas, but areas of water play the more important role. Lines of water (at least in my part of the world) are almost all small streams. They add detail, but don't dominate. Dealing with the sea is a special case of dealing with water. It needs to be handled differently, but that's not difficult.

I find the quality of data on coastal landscape (beaches, cliffs, rocky outcrops) is a bit patchy. It's important to me that I get the big areas of water right (rivers, lakes and the coastline). I'm less concerned with the finer detail of streams etc. I spot the occasional oddity, but on the whole I find that the current data on inland water is as complete and as accurate as I need - if not more so. As a result I've not done any systematic examination of the quality of data on inland water. The only way that I've used to examine data on coastal landscape is by eye.

Some ways that describe water are assumed to represent areas by default, others are assumed to represent lines by default. To collect areas of water I extract all the ways marked 'natural=water', 'waterway=riverbank', 'pond','dock', 'boatyard','lake','reservoir', or 'estuary'. All of these are assumed to represent areas. I also extract anything tagged 'area=yes' alongside 'waterway=...whatever'. Because I'm selecting an inclusive list of values I end up dropping some possible areas of water with this approach. That isn't very satisfactory. However, the list of potential "waterway" values is quite long, and varies across features that should represent areas of water ("marina"), those that could represent either a closed line or an area ("moat"), as well as those that probably don't represent water ("pumping-station") and those that probably need individual treatment ("weir", "slipway"). I can't see a simple approach that would handle all these cases robustly. Users who are interested in water features would have to, but (so far) I don't.

In practice a few of the ways I select are not valid representations of an area, so I throw away anything that isn't a closed loop with more than three nodes and no self-intersection. What's left are easily converted to polygons, the geometry simplified, and then added to a table of water areas.

Multi-polygons of water are assembled in the same ways as multi-polygons of buildings (see earlier post), using multi-polygon relations that carry similar water-related tags to the simple polygons described above. They are added to the same table as the simple polygons.

I don't load coastline, or the sea, from the standard OSM extract. It's quicker and easier to pull the shapefiles from here, and import them into postgis. I am only working on UK data so I could then delete most of the world's oceans. I'm not sure whether this would make a significant performance difference. I thought I might run into problems because I'm working off two different data extracts, but of course the coastline is pretty stable, and it hasn't caused me any difficulty in practice.

While I'm dealing with areas of water I also add 'natural=beach' and 'natural=wetland' areas to the same table as I use for water areas. It's a pragmatic approach, which I adopted because I've found it  it makes life easier later on if I can treat all these areas together.

Because I put water, beach and wetland into the same table I need to classify these differently. However, I've never felt it necessary to make any further distinction between different types of water area. I use the same basic style for all, whether they are tagged as "riverbank", "canal", "dock", "boatyard", "pond", "lake", "reservoir" or whatever. So I don't bother with further classification. The original tags are all retained in case I ever need them.

In principle I treat every other way that is tagged "waterway" as a linear feature. The vast majority of these are marked "stream", "drain", "river", "canal", or "ditch". The rest of the waterway values in the UK are a bit of a ragbag. Most have a very small number of examples. Some data users may need to deal with these more systematically, but I've never found the need. I classify the most common values in a separate column, so I can handle different styles easily. I use an "enum" data type to impose a closed set of values. Everything other than the most common values is classified as "other" so I can spot any important instances and deal with them manually. I've not needed to (so far). 

I add a "label" column to the tables for both areas of water and linear waterways. In principle I could do all sorts of fancy stuff with this, but at present it simply contains the value of any "name" tag. 

There are some questions I haven't addressed when dealing with the water data at a relatively large scale. I'm not very happy with the way that I distinguish between areas and linear features, though it seems to work well enough in practice around here. I haven't found a very satisfactory way of dealing with transition where a river moves from being represented as an explicit area, to be represented symbolically by a line. I've also realised that the data on coastal landscape is a bit patchy around here - so there's some map improvement work to be done.

As I wrote this I noticed some errors in the tests I've been using to extract linear water features. Some of the ways that represent areas of water have ended up included as linear ways as well. I hadn't noticed that earlier, so it obviously hasn't been much of a problem. However, I'm sure I can't be not the only data user who is post-processing data in this kind of way, and I think that goes to show that some collective effort on post-processing techniques would benefit us all.


malenki said...

To the list of wate bodies you should add landuse=reservoir.

> I haven't found a very satisfactory way of dealing with transition where a
> river moves from being represented as an explicit area, to be represented
> symbolically by a line.

The usual and approved mapping standard is to map the river as waterway=river since this is the fastest way and add waterway=riverbank as an extra – this takes more time. So theoretically there should be no problem of a transition from way to area and vice versa. But sadly in the praxis I've already found (and enhanced) waterways which ended on riverbanks or natural=water and continued on the other side.

gom1 said...

Thanks Malenki.

Simply, yes to your comment on reservoirs.

As you say, all the rivers around here are represented as a series of areas. These areas ("riverbank") continue from the sea as long as the river is fairly wide.

Sometimes these separate areas don't join up. I view this as a mappping error, which I have fixed when I know the river.

Eventually the rivers get narrow, and there is a point where they are only represented by a line. The line might tagged as a stream or a river at this point. Contributors can differ on whether they are plotting a stream or a river. This introduces inconsistencies.

And there is (almost) always a point where the representation of a river changes from an area to a line. It is very difficult to choose a line width for streams / rivers which consistently matches the width of the river area at the point where this change occurs.

So a map of a small area at large scale often has a step in the representation of the width. And across a large area, at a small scale, a collection of rivers might look more or less dense where they are represented as lines rather than areas.

The simple solution to this is to use a pale colour for rivers so they become less visible. But I wonder if there is a better way.

gom1 said...
This comment has been removed by the author.
gom1 said...

Thanks for the tip on boatyard. Missed that one!