Workloads

From New Wiki

Jump to: navigation, search

This page contains the results of collecting workloads.

Contents

WORKLOADS

Wikipedia

We have 284GB of compressed HTTP requests, which represent 20 bilion requests, corresponding to 10% of the workload of Wikipedia over a 107 days period. This dataset is derived from the "Wikipedia Workload Analysis for decentralized hosting" paper.

Epinions.com

From here.

Waiting for response

  • Nokia (Yekesa): willing to give us data distributions and read/write workloads but not actual data
  • Yahoo (Brian Cooper)

Others

These are uninteresting for whatever reason (e.g. read-only).

  • Ensembl Genetic DB: we have a 4.7GB raw-text file containing (read-only) queries collected from an on-line public installation of the Ensembl Genetic DB.
  • Adam Seering might eventually be able to get us Postgresql query logs for MIT ESP, but not from when it's actively in use (during Splash in March and November)
  • We can get complete access to FeedMe by joining the project, but the workload is very small
  • CarTel data can be obtained from Eugene but it's read-only

Dead-ends

  • Microsoft: haven't actually pressed Phil Bernstein but this seems unlikely.
  • Foursquare: only the firehose
  • Twitter: only sampled stream; have a fuller stream from Jeff Terrace (via Yahoo dev acct); OHR paper authors refusing to release now
  • OkCupid
  • ITA
  • Facebook
  • Wordpress.com
  • Slashdot