Accessing Wikimedia Pageviews Data

Oliver Keyes

2020-05-10

The Wikimedia websites (such as Wikipedia) are visited by hundreds of millions of people a year, and so the open datasets of pageviews contain useful information on the subjects that interest people around the globe.

The Wikimedia Foundation recently launched a new API for this data that is officially supported, allows a distinction between different types of users and different types of traffic. The pageviews package serves as a client for that API.

Per-article data

The most granular data available is on a per-article basis, and can be accessed with article_pageviews. This takes a project, in the form language.project_class, an article title (with or without spacing), start and (optionally) end dates, specified as R date objects, and (should you choose) the platform and user type to return. By default, it reformats the resulting data as a data.frame:

library(pageviews)
str(article_pageviews(project = "de.wikipedia", article = "R_(Programmiersprache)"
  , start = as.Date('2015-11-01'), end = as.Date("2015-11-02")
  , user_type = c("user", "automated"), platform = c("desktop", "mobile-web")))
## 'data.frame':    8 obs. of  8 variables:
##  $ project    : chr  "wikipedia" "wikipedia" "wikipedia" "wikipedia" ...
##  $ language   : chr  "de" "de" "de" "de" ...
##  $ article    : chr  "R_(Programmiersprache)" "R_(Programmiersprache)" "R_(Programmiersprache)" "R_(Programmiersprache)" ...
##  $ access     : chr  "desktop" "desktop" "mobile-web" "mobile-web" ...
##  $ agent      : chr  "user" "user" "user" "user" ...
##  $ granularity: chr  "daily" "daily" "daily" "daily" ...
##  $ date       : POSIXct, format: "2015-11-01" "2015-11-02" ...
##  $ views      : num  207 444 86 75 0 0 0 0

With the user_type and platform arguments you can include or exclude spiders and other automata, and switch between overall pageviews, pageviews to the desktop site, or pageviews to the mobile site or app. Both user_type and platform can be character strings or vectors, in which case all combinations of user_type and platform will be queried and returned.

Per-project data

Per-project data can also be retrieved using (you’ve guessed it) project_pageviews. This looks very similar to article_pageviews with one major difference: you can specify the granularity of the data, returning either daily or hourly data. The platform, user type and timestamp options are identical.

str(project_pageviews())
## 'data.frame':    1 obs. of  7 variables:
##  $ project    : chr "wikipedia"
##  $ language   : chr "en"
##  $ access     : chr "all-access"
##  $ agent      : chr "all-agents"
##  $ granularity: chr "daily"
##  $ date       : POSIXct, format: "2015-10-01"
##  $ views      : num 2.72e+08

Top articles

With top_articles you can get data about the top 1,000 articles by pageviews on a project within a given timeframe (and on a given platform). This uses distinct year/month/day arguments, and for month and day, you can select “all” indicating that you want the top articles overall in a given year or month respectively.

str(top_articles())
## 'data.frame':    1000 obs. of  8 variables:
##  $ project    : chr  "wikipedia" "wikipedia" "wikipedia" "wikipedia" ...
##  $ language   : chr  "en" "en" "en" "en" ...
##  $ article    : chr  "Main_Page" "Special:Search" "Annie_Besant" "Landslide_(Fleetwood_Mac_song)" ...
##  $ access     : chr  "all-access" "all-access" "all-access" "all-access" ...
##  $ granularity: chr  "day" "day" "day" "day" ...
##  $ date       : POSIXct, format: "2015-10-01" "2015-10-01" ...
##  $ rank       : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ views      : num  18704051 2897671 292707 289704 194791 ...

The underlying data source here is likely to change, because it currently contains some automata, so be aware that there may be noise or unexpected entries in data from this function.

Bugs and feature suggestions

If there’s something missing in the client that’s supported by the API proper, please submit an issue! And, while the author of this package doesn’t maintain the API, if you see functionality that you need that isn’t covered by the API, you can still submit an issue on the client repo - I’ll just courier them over to the Wikimedia bug-tracker.