I recently published a series of articles on analyzing GPS data from personal sport tracking software using Python. I’ve started using R again lately and, while I like Python, I really like R. R is not a general-purpose language like Python, and is therefore far less commonly studied. It was mission-built for this type of work, and, as is often the case with custom-built tools, it is powerful, comfortable, and, frankly, fun to work with. I thought it might be interesting to compare the process in Python described in the previous articles with the process in R.
Libraries
Python data frames are built on the pandas
and numpy
libraries, with matplotlib
as the primary plotting tool. Data frames and vector processing are native to R. The dplyr
library provides convenience functions for manipulating data. The amazing gplot2
provides the functionality of matplotlib
and seaborn
and more, with simple syntax. The sf
package, which stands for Simple Features, provides the geometry and geospatial functionality which geopandas
and shapely
do in Python. I’ll use ggspatial
here for basemap tiles where I used contextily
in Python, and I’ll add patchwork
for convenient side-by-side display.
library(gpx)
library(sf)
library(dplyr)
library(ggplot2)
library(ggspatial)
library(patchwork)
Syntax
A few notes about syntax to start off: R uses the <-
arrow for assignment, but accepts =
as well. Data frame slicing uses the same [row,column]
approach with explicit or boolean values. Unlike pandas
, which distinguishes between df.loc
and df.iloc
, in R you can slice using numeric indices or strings without distinction. On a similar topic, R uses one-indexing, but is inclusive of the end, so df[1:2]
in R is equivalent to Python’s df.iloc[:2]
, or more explicitly df.iloc[0:2]
, and the second element is directly accessed with df[2]
instead of df.iloc[1]
.
There’s an important twist to this which will catch you multiple times. Even though the syntax is [row,column]
, if you supply only one numeric index, without a comma, it will be interpreted as a column index. So df[1]
gets the first column, while df[1, ]
gets the first row.
The main syntactical difference is R’s extensive use of piping, using the pipe operator |>
, sometimes written as %>%
. This looks similar to accessing a series of an object’s methods through a chain of .
s in Python, but it isn’t. The pipe in R works like the pipe in a Linux shell, simply passing the output of one function to the next function as its first argument. This is one of my personal favorite aspects of working in R, since it allows for natural expression of a series of steps which constitue a workflow. ggplot2
takes a similar syntactical approach, layering elements of the plot by chaining using the +
operator.
Loading the data
Let’s get started. With Python, we needed to parse the raw gpx
data, which is in an XML format, to a CSV formatted file, which could then be imported into a pandas
data frame, and then turned that into a geopandas
data frame. I used beautifulsoup
to do so. R, fortunately, has a gpx
library that allows us to go straight from gpx
into a data frame. Let’s see what that looks like. The str()
command will let us know what’s inside.
trek_data <- read_gpx("data/b3/Workout-2024-09-06-16-29-37.gpx")
str(trek_data)
List of 3
$ routes :List of 1
..$ :'data.frame': 0 obs. of 4 variables:
.. ..$ Elevation: logi(0)
.. ..$ Time : logi(0)
.. ..$ Latitude : logi(0)
.. ..$ Longitude: logi(0)
$ tracks :List of 1
..$ River Vale:'data.frame': 137 obs. of 6 variables:
.. ..$ Elevation : num [1:137] -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 ...
.. ..$ Time : POSIXct[1:137], format: "2024-09-06 16:29:37" "2024-09-06 16:29:37" ...
.. ..$ Latitude : num [1:137] 41 41 41 41 41 ...
.. ..$ Longitude : num [1:137] -74 -74 -74 -74 -74 ...
.. ..$ extensions: logi [1:137] NA NA NA NA NA NA ...
.. ..$ Segment ID: int [1:137] 1 1 1 1 1 1 1 1 1 1 ...
$ waypoints:List of 1
..$ :'data.frame': 0 obs. of 4 variables:
.. ..$ Elevation: logi(0)
.. ..$ Time : logi(0)
.. ..$ Latitude : logi(0)
.. ..$ Longitude: logi(0)
The result is not a data frame, but a list of lists. The second one, called tracks, is the only one with observations, so we can start with that. Don’t forget that R does not zero-index lists, so we use 2
not 1
, and extract it with double square brackets.
trek_tracks <- trek_data[[2]]
str(trek_tracks)
List of 1
$ River Vale:'data.frame': 137 obs. of 6 variables:
..$ Elevation : num [1:137] -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 ...
..$ Time : POSIXct[1:137], format: "2024-09-06 16:29:37" "2024-09-06 16:29:37" ...
..$ Latitude : num [1:137] 41 41 41 41 41 ...
..$ Longitude : num [1:137] -74 -74 -74 -74 -74 ...
..$ extensions: logi [1:137] NA NA NA NA NA NA ...
..$ Segment ID: int [1:137] 1 1 1 1 1 1 1 1 1 1 ...
This gets us closer, now we have list of one. Let’s pull that out and display the first two rows.
trek <- trek_tracks[[1]]
trek[1:2,]
Elevation Time Latitude Longitude extensions Segment ID
1 -4 2024-09-06 16:29:37 41.01128 -74.0101 NA 1
2 -4 2024-09-06 16:29:37 41.01128 -74.0101 NA 1
Note the comma, which is very important. If only one value is supplied, it chooses columns instead of rows.
head(trek[1:2], 2)
Elevation Time
1 -4 2024-09-06 16:29:37
2 -4 2024-09-06 16:29:37
And the final frame looks like:
str(trek)
'data.frame': 137 obs. of 6 variables:
$ Elevation : num -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 ...
$ Time : POSIXct, format: "2024-09-06 16:29:37" "2024-09-06 16:29:37" ...
$ Latitude : num 41 41 41 41 41 ...
$ Longitude : num -74 -74 -74 -74 -74 ...
$ extensions: logi NA NA NA NA NA NA ...
$ Segment ID: int 1 1 1 1 1 1 1 1 1 1 ...
Importing a collection of treks
Now that I know “where” the information is, I can go ahead and import a series of files and combine them into a single data frame. As I did with Python, I will assign a unique identifier to each trek, and then combine them. The R equiva
Keep reading with a 7-day free trial
Subscribe to biscotty's Workshop to keep reading this post and get 7 days of free access to the full post archives.