The Wikimedia websites (such as Wikipedia) are visited by hundreds of millions of people a year, and so the open datasets of pageviews contain useful information on the subjects that interest people around the globe.
The Wikimedia Foundation recently launched a new API for this data that is officially supported, allows a distinction between different types of users and different types of traffic. The pageviews
package serves as a client for that API.
The most granular data available is on a per-article basis, and can be accessed with article_pageviews
. This takes a project, in the form language
.project_class
, an article title (with or without spacing), start and (optionally) end dates, specified as R date objects, and (should you choose) the platform and user type to return. By default, it reformats the resulting data as a data.frame:
library(pageviews)
str(article_pageviews(project = "de.wikipedia", article = "R_(Programmiersprache)"
, start = as.Date('2015-11-01'), end = as.Date("2015-11-02")
, user_type = c("user", "automated"), platform = c("desktop", "mobile-web")))
## 'data.frame': 8 obs. of 8 variables:
## $ project : chr "wikipedia" "wikipedia" "wikipedia" "wikipedia" ...
## $ language : chr "de" "de" "de" "de" ...
## $ article : chr "R_(Programmiersprache)" "R_(Programmiersprache)" "R_(Programmiersprache)" "R_(Programmiersprache)" ...
## $ access : chr "desktop" "desktop" "mobile-web" "mobile-web" ...
## $ agent : chr "user" "user" "user" "user" ...
## $ granularity: chr "daily" "daily" "daily" "daily" ...
## $ date : POSIXct, format: "2015-11-01" "2015-11-02" ...
## $ views : num 207 444 86 75 0 0 0 0
With the user_type
and platform
arguments you can include or exclude spiders and other automata, and switch between overall pageviews, pageviews to the desktop site, or pageviews to the mobile site or app. Both user_type
and platform
can be character strings or vectors, in which case all combinations of user_type
and platform
will be queried and returned.
Per-project data can also be retrieved using (you’ve guessed it) project_pageviews
. This looks very similar to article_pageviews
with one major difference: you can specify the granularity of the data, returning either daily or hourly data. The platform, user type and timestamp options are identical.
## 'data.frame': 1 obs. of 7 variables:
## $ project : chr "wikipedia"
## $ language : chr "en"
## $ access : chr "all-access"
## $ agent : chr "all-agents"
## $ granularity: chr "daily"
## $ date : POSIXct, format: "2015-10-01"
## $ views : num 2.72e+08
With top_articles
you can get data about the top 1,000 articles by pageviews on a project within a given timeframe (and on a given platform). This uses distinct year/month/day arguments, and for month and day, you can select “all” indicating that you want the top articles overall in a given year or month respectively.
## 'data.frame': 1000 obs. of 8 variables:
## $ project : chr "wikipedia" "wikipedia" "wikipedia" "wikipedia" ...
## $ language : chr "en" "en" "en" "en" ...
## $ article : chr "Main_Page" "Special:Search" "Annie_Besant" "Landslide_(Fleetwood_Mac_song)" ...
## $ access : chr "all-access" "all-access" "all-access" "all-access" ...
## $ granularity: chr "day" "day" "day" "day" ...
## $ date : POSIXct, format: "2015-10-01" "2015-10-01" ...
## $ rank : num 1 2 3 4 5 6 7 8 9 10 ...
## $ views : num 18704051 2897671 292707 289704 194791 ...
The underlying data source here is likely to change, because it currently contains some automata, so be aware that there may be noise or unexpected entries in data from this function.
If there’s something missing in the client that’s supported by the API proper, please submit an issue! And, while the author of this package doesn’t maintain the API, if you see functionality that you need that isn’t covered by the API, you can still submit an issue on the client repo - I’ll just courier them over to the Wikimedia bug-tracker.