Wikipedia access traces

This directory contains a trace of 10% of all user requests issued to Wikipedia (in all languages) during the period between September 19th 2007 and January 2nd 2008.

Note: only parts of the trace are currently available. This is only temporary, and we are working hard to put the entire trace online as soon as possible.

The trace comprises one request per line. Each line contains:

  • A monotonically increasing counter (useful for sorting the trace in chronological order)
  • The timestamp of the request in Unix notation with milli-second precision
  • The requested URL
  • A flag to indicate if the request resulted in a database update or not

Note that the trace does not contain any information about clients (their IP addresses, location, etc.). No need to ask for them: if I had this information I would have made it public.

If you plan to use this trace, please send me a note and cite the trace as follows in your articles:

@Article{,
  author = 	 {Urdaneta, Guido and Pierre, Guillaume and van Steen, Maarten},
  title = 	 {Wikipedia Workload Analysis for Decentralized Hosting},
  volume =       {53},
  number =       {11},
  pages =        {1830-1845},
  month =        {July},
  year = 	 {2009},
  journal = 	 {Elsevier Computer Networks},
  note = 	 {\url{http://www.globule.org/publi/WWADH_comnet2009.html}}
}

We publish this data with authorization from the WikiMedia foundation in the hope that it will be useful to the scientific community. Special thanks to them for making this trace available to us and allowing us to publish it! In particular, we owe a great deal to Gerard Meijssen and Tim Starling without whom none of this would have been possible.

Comments are closed.