Find freelancers. Lose those costly delays.

With 1.7 million freelancers, we'll match you with the perfect talent.

or, Register with Elance »

Run a Pig script with Elastic MapReduce using ngram data on S3
Sign in to Add to Watch List

IT & Programming > Other IT & Programming

View Change History

The client has made the following changes to the job.


Job canceled.

Oct 14, 2012

Job Description

  |  Change History >>


  • Posted: Thu, Oct 11, 2012
  • Time Left: Closed
  • Location: Anywhere
  • Client prefers freelancers from:

    You are still able to submit a proposal for this job.

    The client prefers freelancers from
    a different location.

    You're still able to submit a proposal for this job, regardless of your location.
  • Start: Immediately
  • Budget: Less than $500
  • Fixed Price Job
  • Elance Escrow Protection
  • W9 Not Required
Sign in to view client's details

I'm looking for a way to order Google Book's Ngram's by frequency.

The original dataset is here:   [obscured]  /ngrams/datasets. Inside each file the ngrams are sorted alphabetically and then chronologically.

The data is also hosted on S3 in US East   [obscured]  /datasets/8172056142375670

I need to find the 10,000 most frequent 1grams, 2grams, 3grams, 4grams, and 5grams from 1980 until today.

The datasets contain data for multiple years:

As an example, here are the 30,000,000th and 30,000,001st lines from file 0
of the English 1-grams (googlebooks-eng-all-1gram-20090715-0.csv.zip):

circumvallate 1978 313 215 85
circumvallate 1979 183 147 77

The first line tells us that in 1978, the word "circumvallate" (which means
"surround with a rampart or other fortification", in case you were wondering)
occurred 313 times overall, on 215 distinct pages and in 85 distinct books
from our sample.

Pig makes things like this very easy and straight-...

Sign in or Register to see more

Desired Skills
Amazon Web Services
Job ID: 34257954
Avg $ | High $ | Low $ — Show Pricing
  • Submit Date (Latest)
There are no proposals yet.
Elance is now an Upwork company.
Upwork is the choice of 4M+ clients. Get started working on Upwork today.
Are you ready to post a job like this one?
Post a Similar Job »