Accessing Wikimedia Pageviews Data

The Wikimedia websites (such as Wikipedia) are visited by hundreds of millions of people a year, and so the open datasets of pageviews contain useful information on the subjects that interest people around the globe.

The Wikimedia Foundation recently launched a new API for this data that is officially supported, allows a distinction between different types of users and different types of traffic. The pageviews package serves as a client for that API.

Per-article data

The most granular data available is on a per-article basis, and can be accessed with article_pageviews. This takes a project, in the form language.project_class, an article title (with or without spacing), start and (optionally) end dates, specified as R date objects, and (should you choose) the platform and user type to return. By default, it reformats the resulting data as a data.frame:

library(pageviews)
str(article_pageviews(project = "de.wikipedia", article = "R_(Programmiersprache)"
  , start = as.Date('2015-11-01'), end = as.Date("2015-11-02")
  , user_type = c("user", "automated"), platform = c("desktop", "mobile-web")))

## 'data.frame':    8 obs. of  8 variables:
##  $ project    : chr  "wikipedia" "wikipedia" "wikipedia" "wikipedia" ...
##  $ language   : chr  "de" "de" "de" "de" ...
##  $ article    : chr  "R_(Programmiersprache)" "R_(Programmiersprache)" "R_(Programmiersprache)" "R_(Programmiersprache)" ...
##  $ access     : chr  "desktop" "desktop" "mobile-web" "mobile-web" ...
##  $ agent      : chr  "user" "user" "user" "user" ...
##  $ granularity: chr  "daily" "daily" "daily" "daily" ...
##  $ date       : POSIXct, format: "2015-11-01" "2015-11-02" ...
##  $ views      : num  207 444 86 75 0 0 0 0

With the user_type and platform arguments you can include or exclude spiders and other automata, and switch between overall pageviews, pageviews to the desktop site, or pageviews to the mobile site or app. Both user_type and platform can be character strings or vectors, in which case all combinations of user_type and platform will be queried and returned.

Per-project data

Per-project data can also be retrieved using (you’ve guessed it) project_pageviews. This looks very similar to article_pageviews with one major difference: you can specify the granularity of the data, returning either daily or hourly data. The platform, user type and timestamp options are identical.

str(project_pageviews())

## 'data.frame':    1 obs. of  7 variables:
##  $ project    : chr "wikipedia"
##  $ language   : chr "en"
##  $ access     : chr "all-access"
##  $ agent      : chr "all-agents"
##  $ granularity: chr "daily"
##  $ date       : POSIXct, format: "2015-10-01"
##  $ views      : num 2.72e+08

Top articles

With top_articles you can get data about the top 1,000 articles by pageviews on a project within a given timeframe (and on a given platform). This uses distinct year/month/day arguments, and for month and day, you can select “all” indicating that you want the top articles overall in a given year or month respectively.

str(top_articles())

## 'data.frame':    1000 obs. of  8 variables:
##  $ project    : chr  "wikipedia" "wikipedia" "wikipedia" "wikipedia" ...
##  $ language   : chr  "en" "en" "en" "en" ...
##  $ article    : chr  "Main_Page" "Special:Search" "Annie_Besant" "Landslide_(Fleetwood_Mac_song)" ...
##  $ access     : chr  "all-access" "all-access" "all-access" "all-access" ...
##  $ granularity: chr  "day" "day" "day" "day" ...
##  $ date       : POSIXct, format: "2015-10-01" "2015-10-01" ...
##  $ rank       : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ views      : num  18704051 2897671 292707 289704 194791 ...

The underlying data source here is likely to change, because it currently contains some automata, so be aware that there may be noise or unexpected entries in data from this function.

Accessing Wikimedia Pageviews Data

Oliver Keyes

2020-05-10

Per-article data

Per-project data

Top articles

Bugs and feature suggestions