The city that I live in, Wrocław, provides GPS locations of trams and buses. It’s really cool since at least few mobile apps use it now to provide users more information about their everyday commute. Especially here it’s really helpful — we have a problem with public transport delays, but it’s a topic for a different post.
We’ve been collecting this data with friends (Radek Czemerys, Mateusz Dryzek) for about 9 months hoping, that proper analysis of it may result in interesting conclusions (especially that we also keep track of the delay of every vehicle). If you would like to have a look at our dataset and maybe play with it a little, let me know.
Lots of points
We take a snapshot of all vehicles’ locations every minute and save it our database. Our data shows that on average there’re about 200 vehicles driving around the city (with an assumption that every vehicle has a GPS tracker).
So for every day we have about 300 000 locations (200 vehicles * 24 hours * 60 minutes). Each location also holds data about course number (unique route from first stop to last), type of vehicle (tram/bus), vehicle number, direction (final destination) and delay (updated on every stop).
Let’s visualize data from one month for just one line: tram 1. It’s about 175 000 points. I used awesome carto.com for this visualization, do check it out if you work with spatial data, it makes analyzing and visualizing it really comfortable.
Color of point depends on the direction of the vehicle. As you can see from the map and the table below, two destinations (red and green) are more popular than two others. You can see them on the map in top left and bottom right corners.
Direction Number of points 23022 91190 24416 79777 2 2339 24404 193
Let’s have a closer look at selected parts of the map (including first two directions) below. With more zoom, you can see that a) it’s really lots of points b) sometimes they are not so accurate (I hope, in another case some of the trams ended their journey in the Odra river).
Give me the route
I’d like to get a polyline best representing route of the line heading to specific direction. There are at least few ways I can do it:
a) Go to the website of Municipal Transport Company and look for it
Sadly all I got is a pdf with the schema of public transport in Wrocław and it’s not accurate.
b) Go to the website providing public data of Wrocław and get GTFS files
These files describe public transport in Wrocław quite well, but not all GTFS files are provided, for example shapes.txt, file which contains rules for drawing vehicle routes on the map.
To be fair I could take list of all stops of a given line and just connect them, but the accuracy of that polyline wouldn’t be acceptable.
c) Track one course of each line and direction, connect the points et voilà, I’d have all the routes I wanted
This approach is not bad, but I wouldn’t really trust it:
– GPS accuracy is quite bad sometimes, so taking just one signal every xseconds/minutes seems dangerous
– trams and buses sometimes take different route due to accidents that happen on their original route
d) Take the average path between all the points we have (our choice)
This approach will let us ignore the outliers, in our case points that are outside of wanted route due to a) GPS lack of accuracy b) detour. Also, finally we can have some use of our gigabytes of data, so let’s do it!
After digging some information about approaches to this kind of problem, I’ve found few possibilities:
It’d give me a curve with minimum distance to each location. I need a polyline, but it wouldn’t be a problem to sample it afterward. This approach felt a bit too complicated for our problem, maybe next time.
b) Map matching
It’d let me match my locations to the map of roads in Wrocław. It seems great, has some open source implementations and I will probably use it someday, but this time I’d like to focus on averaging thousands of points, while this approach works well even with one sequence of positions.
c) Cluster analysis (our choice)
Let’s group similar objects (in our case similarity is based on the distance between locations), then find centers of found groups and make a polyline between these points.
There are dozens of clustering algorithms. Everyone probably heard about k-means, which is one of the simplest yet often used clustering. It has few drawbacks, like the fact that you have to specify the number of clusters or that by default it works with euclidean distance and our planet is, well, not flat (or is it?). Let’s use a different algorithm: Density-based spatial clustering of applications with noise, or as everyone calls it…
It’s perfect for us, because:
– it can work with any distance function (we’ll use Haversine)
– it doesn’t need the number of clusters as input, we describe clusters using other parameters (more info later)
– it can ignore noise/outliers
I will use python for working with data and visualizing our results. Full code is available on GitHub here. Python is not my primary weapon, so if you have any remarks or hints about my .py code quality, let me know.
Our data is stored in CSV file (which you can get here). It contains 5000 locations of tram number 1 (with one selected direction):
Latitude, Longitude 51.124065, 17.041012 51.124054, 17.040968 51.124054, 17.041037 51.12404, 17.04099 51.124054, 17.040976 (...)
Let’s load it into pandas dataframe and plot all the points on a map:
df = pandas.read_csv('locations.csv', index_col = False, header=0) superMap = folium.Map(location=[51.107885, 17.038538], zoom_start=14, tiles='OpenStreetMap') drawPointsOnMap(superMap, df.values, '#3186cc', 20)
We clearly see a line which we would like to get. Our hope is that it’s going to be represented by (let’s say) less than 100 points, not 5000. Also, we see a detour on our map:
Compared to the main route, there’re really just a few points on the detour route. We hope that DBSCAN is going to ignore these points.
Now, let’s use dbscan algorithm to create clusters. We will use some snippets from article Clustering to Reduce Spatial Data Set Size written by Geoff Boeing (thanks, Geoff!). We will show centers of clusters as red points:
clustersCenters = getDbScanClustersCenters(df, 0.03, 2) drawPointsOnMap(superMap, clustersCenters, '#ff0000', 30)
Ooops, it doesn’t look so well. Detour points are part of about 15 clusters and we expected to get a bit more points in other parts of the route. Let’s take a look at that line:
clustersCenters = getDbScanClustersCenters(df, 0.03, 2)
Numerical values here are parameters of DBSCAN algorithm:
- 0.03 is the maximum distance (in km, so here it’s 30 meters) that objects can be away from each other while being in the same cluster
- 2 is the minimum number of objects that cluster has to consist. If it has less, it’s considered to be noise
Let’s increase 2 to 10. If cluster will contain less than 10 objects, it’s going to be ignored:
clustersCenters = getDbScanClustersCenters(df, 0.03, 10)
Way better! You can play with parameters to get more accurate representation, for me this is okay.
Let’s make a line
We now have all the points that will create our line, it would be good to have a starting point. We could just find out where all the trams go (direction value) and use this place, but let’s think how we would do it having only cloud of GPS points with no specific order.
To find it, we are going to take a random red point, mark it as visited (so ignore it later), find point closest to it, mark it as visited, find point closest to the second point, mark it as visited… at the end, hopefully, we are going to get a point located at the end of the line.
To get a line, we should just start with found beginning point and connect it to the closest point, then connect the next point to its closest neighbor, then again…
Above you can see the result polyline, it contains 55 points and is exactly what we wanted. It represents the cloud of points well and is comparable to something that a person would draw based on those almost 100 000 points we had for this (line, direction).