Public Subject Attention via Wikipedia Page View Statistics

Public attention is an interesting field of study. The internet not only allows to access information in no time on virtually any subject but via page access statistics gathered by website authors the subject of attention as well can be studied. For the omnipresent Wikipedia those access statistics are made available via ' http://stats.grok.se' a server providing the information as file dumps as well as as web API. This package provides an easy to use, consistent and traffic minimizing approach to make those data accessible within R.


Author

Peter Meißner

Last Update

2016-03-20

Status (current version on Github)

Ubuntu build
Windows build
Version on CRAN
Version on Github 1.1.10
Downloads from CRAN.RStudio   

Meta (cranlogs) wikipediatrend

Purpose

The wikipediatrend package is designed to make Wikipedia page access statistics data availible in R in a most convenient way.

Consequently the package provides

  • daily page views as data frames
  • page views for user set time spans
  • page views for multiple articles in one function call
  • page views for articles in different language domains
  • a function to check article titles in other country domains
  • background caching of results to minimize function execution time as well as server burdens

Installation

A stable version of the package can be found on CRAN and installed via ...

install.packages("wikipediatrend")

... while the current developement version can be retrieved by using install_github() from the devtools package ...

devtools::install_github("petermeissner/wikipediatrend")

After loading the package several functions are available.

library(wikipediatrend)

Usage

The workhorse of the package is the wp_trend() function:

wp <- wp_trend(page = c("Fever","Fieber"), 
               from = "2013-08-01", 
               to   = "2015-12-31", 
               lang = c("en","de"))
## http://stats.grok.se/json/en/201308/Fever
## http://stats.grok.se/json/en/201309/Fever
## http://stats.grok.se/json/en/201310/Fever
## http://stats.grok.se/json/de/201308/Fieber
## http://stats.grok.se/json/de/201309/Fieber
## http://stats.grok.se/json/de/201310/Fieber
## http://stats.grok.se/json/de/201311/Fieber
# (... messages shortened)

The function's return is a data frame with six variables date, count, project, title, rank, month paralleling the data provided by the stats.grok.se server:

head(wp)
##   date       count lang page   rank month  title 
## 1 2013-08-01  486  de   Fieber 1391 201308 Fieber
## 2 2013-08-01 2768  en   Fever  5014 201308 Fever 
## 3 2013-08-02  476  de   Fieber 1391 201308 Fieber
## 4 2013-08-02 2529  en   Fever  5014 201308 Fever 
## 5 2013-08-03  429  de   Fieber 1391 201308 Fieber
## 6 2013-08-03 2113  en   Fever  5014 201308 Fever

Furthermore, wikipediatrend provides a helper function wp_linked_pages() which allows to query wikipedia if a particualr article exists in other languages as well:

wp_linked_pages("Hitsche", lang="de")
##    page             lang   title           
## 1  Schame           bar    Schame          
## 2  Reposapeus       ca     Reposapeus      
## 3  Footstool        en     Footstool       
## 4  Reposapi%C3% ... es     Reposapiés      
## 5  %D8%B2%DB%8C ... fa     زیرپایی         
## 6  Pouf             fr     Pouf            
## 7  Skabelo          io     Skabelo         
## 8  Voetenb%C3%A ... nds-NL Voetenbänksi ...
## 9  Fotskammel       nn     Fotskammel      
## 10 Podn%C3%B3%C ... pl     Podnóżek

Vignette

For a more detailed usage have a look at the vignette accompanying the package. vignette("using-wikipediatrend", package="wikipediatrend")

... or GoTo CRAN or build it from scratch from Github.

Some examples for using page view statistics in general

  • politan.ch (2015-10-04): Welche Ständeratskandidaturen interessieren?. politan.ch. http://www.politan.ch/welche-standeratskandidaturen-interessieren/

  • politan.ch (2015-05-25): Wenn Klicks Stimmen wären. politan.ch. http://www.politan.ch/wenn-klicks-stimmen-waren/

  • Munzert, Simon (2015): Using Wikipedia Page View Statistics to Measure Issue Salience. WEBDATANET CONFERENCE 2015. http://conference.webdatanet.eu/uploads/submission/full_paper/35/munzert-wikipedia-webdatanet.pdf

  • Wilkerson, Bill (2015): Post-Republican debate on Wikipedia follow-up: before and after public interest in the candidates. http://www.wrwilkerson.com/ . http://www.wrwilkerson.com/blog/2015/8/15/post-republican-debate-on-wikipedia-follow-up-before-and-after-public-interest-in-the-candidates

  • Taha Yasseri and Jonathan Bright (2015): Predicting elections from online information flows: towards theoretically informed models. http://arxiv.org/abs/1505.01818

  • Mellon, Jonathan (2014) Internet Search Data and Issue Salience: The Properties of Google Trends as a Measure of Issue Salience Journal of Elections, Public Opinion and Parties 24(1):45-72. http://www.tandfonline.com/doi/abs/10.1080/17457289.2013.846346

  • Yla Tausczik, Kate Faasse, James W. Pennebaker, Keith J. Petrie (2012): Public Anxiety and Information Seeking Following the H1N1 Outbreak: Blogs, Newspaper Articles, and Wikipedia Visits. Health Communication, Vol. 27, Iss. 2. http://www.tandfonline.com/doi/pdf/10.1080/10410236.2011.571759

  • Ripberger, Joseph T. (2011): Capturing curiosity: using Internet search trends to measure public attentiveness. Policy Studies Journal 39(2):239-259. http://onlinelibrary.wiley.com/doi/10.1111/j.1541-0072.2011.00406.x/full

(I missed your application? Make a pull request, open an issue, drop me a line and I put it here)

Thanks

Fernando Reis, Eryk Walczak, Simon Munzert, Kristin Lindemann

Credits

  • Parts of the package's code have been shamelessly copied and modified from R base package written by R core team. This concerns the wp_date() generic and its methods and is detailed in the help files.

News

NEWS wikipediatrend

  • BUGFIX
  • tests, README generation as well as vignette generation would fail due to stats.grok.se serving no data after 2016-01-16
  • examples were restricted to data prior to that date to make tests and vignette/README generation run smoothly again
  • FEATURE
  • wp_trend() results came unordered, results are now ordered according to date, lang, title (making it easy to use something like plot( wp_trend("test", from="2015-01-01", to="2015-01-30")[,c("date", "count")], type="b")
  • cut loose unnecessary RCurl dependency
  • BUGFIX
  • wp_linked_pages() would sometimes return links that do not link to wikipedia pages but to other projects - that has been fixed.
  • FEATURE
  • if provided with bad server response data - could not be parsed - wp_trend() would without giving further information: now it reports whatever data was send back from the server along with the warning
  • BUGFIX
  • the package's functions would fail combined with rvest versions >= 0.3.0 - that has been fixed
  • BUGFIX
  • wp_trend() would fail with un-informative error if page or lang input would contain NA - now it fails with more informative error: 'Error: all(!is.na(page)) is not TRUE'

  • vignette would fail due to NA as page/lang input of wp_trend() - code has been changed to prevent such

  • CRAN COMPLIANCE
  • imports from none 'base' R now are made explicit
  • BUGFIX
    • adding checks for empty data returned by server (might cause breaking of wp_trend()) and adding tests for such cases
  • BUGFIX
  • modifying vignette to comply with CRAN policies: replaced non ascii character in R code by its \u-escape sequence ( \u00e4 )

  • modifying vignette to comply with CRAN policies: making code evaluation for code that uses non-mainstream repository hosted packages optional on machines that do not have those installed

  • BUGFIX
    • modifying vignette to comply with CRAN policies (dropping lines installing packages if not present)
  • modifying caching to comply with CRAN policies

  • changing default folder of cache file from temp (basename(tempdir())) to Rtemp ( tempdir() )

  • adding ghrr as additional repo to comply with CRAN policies

  • changing default folder of cache file from home (~) to temp (basename(tempdir()))

  • feature: caching has been overhauled

  • feature: wp_trend() now tries to guess if page was supplied as title with possible special characters or as (url-encoded) URL part and take care of further processing

  • bug-fix: special character support of the packages was lousy and preventing the usage of articles of non-standard languages ( - especially on Windows)

    • introduction of the wp_df class to allow for a print.wp_df that a) shortens long strings on print b) does not use format() (format() causes UTF-8 characters to be replaced by "<U+xxxx>" strings (propably only))
    • using a package specific write_utf8_csv() and read_utf8_csv() to be able to store and cache data for articles with special character names (even under Windows, write.csv() does not allow enforcing a specific encoding)
  • bug-fix / backward compatibility: with version 1.0.0 old parameters for wp_trend() were causing errors

  • bug-fix: wp_cache_reset() would stop with an error if called twice in a row - fixed

  • api-change: option userAgent deleted: the default is to send information on versions of R, wikipediatrend, curl as well as RCurl

  • api-change: option requestFrom deleted: the default is to not send the header

  • feature: wp_trend() now by default caches data retrievals in a temporary file

  • feature: wp_trend(file="save.csv") now allows to specify a file where retrievals are stored (this will always add to the already existing data)

  • feature: wp_trend() now allows to specify more than one page and/or language at a time. data than will be retrieved for every combination of page-language and date

  • feature: caching system is persistant wp_cache_file() will report file used for caching; wp_cache_reset() will reset cache; wp_cache_load() will return its content as data.frame()

  • feature: while wp_trend() now (invisibly) returns only data from the current request at hand the new function wp_cache() will retrieve data from cache files (by default / if no file name is specified it retrieves data from .wp_trend_cache)

  • api-change: the data returned by wp_trend(), cached in cache-file, retrieved by wp_cache() does consist of more variables: date, count, project, title, rank, month

  • feature: testthat tests now check base functionality of the package

  • bug-fix: non-existing page views for a month have led to an error, fixed.

  • bug-fix: wp_trend() now checks date inputs better for logical inconsistencies

  • first puplication on CRAN

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("wikipediatrend")

1.1.10 by Peter Meissner, 2 years ago


Report a bug at https://github.com/petermeissner/wikipediatrend/issues


Browse source code at https://github.com/cran/wikipediatrend


Authors: Peter Meissner [aut, cre], R Core Team [ctb]


Documentation:   PDF Manual  


Task views: Web Technologies and Services


GPL (>= 2) license


Imports jsonlite, stringr, rvest, httr, utils, xml2, hellno

Suggests testthat, knitr, ggplot2, devtools, dplyr, magrittr, AnomalyDetection, BreakoutDetection, cranlogs


See at CRAN