With 1.7 million freelancers, we'll match you with the perfect talent.
The client has made the following changes to the job.
Client prefers freelancers from:
You are still able to submit a proposal for this job.
The client prefers freelancers from
a different location.
I'm looking for a way to order Google Book's Ngram's by frequency.
The original dataset is here: [obscured] /ngrams/datasets. Inside each file the ngrams are sorted alphabetically and then chronologically.
The data is also hosted on S3 in US East [obscured] /datasets/8172056142375670
I need to find the 10,000 most frequent 1grams, 2grams, 3grams, 4grams, and 5grams from 1980 until today.
The datasets contain data for multiple years:
As an example, here are the 30,000,000th and 30,000,001st lines from file 0
of the English 1-grams (googlebooks-eng-all-1gram-20090715-0.csv.zip):
circumvallate 1978 313 215 85
circumvallate 1979 183 147 77
The first line tells us that in 1978, the word "circumvallate" (which means
"surround with a rampart or other fortification", in case you were wondering)
occurred 313 times overall, on 215 distinct pages and in 85 distinct books
from our sample.
Pig makes things like this very easy and straight-...
Sign in or Register to see more