Ruby Indexing Code for Solr Search of Patents
We currently have a basic and manual process for ingesting patent data provided by Google, into a Solr search database (which we get here: http://www.google.com/googlebooks/uspto-patents-grants-text.html). This collection of data contains three different types of files, one a custom format, and 2 XML based. The files are a little unusual in that they comprise several xml file blocks (one per patent document) concatenated into one file (this is sometimes a challenge for some of the standard XML tools). We currently have code to parse the 3 different types of patents, load it into a postgresql database, after which we reindex into Solr.
We would like to make this process a lot more efficient and automated. This is how I see it happening, although I am definitely open to suggestions and better solutions in proposals. I would like you to create a set of 2 IronWorkers (iron.io) that solves this problem. First, there should be a worker which we can periodically schedule, which parses the Google patent search website ( http://www.google.com/googlebooks/uspto-patents-grants-text.html), extracts the patent filenames, and consults a postgresql database to see which ones haven't been processed yet. It should then get the files into S3, and start a second IronWorker on each of them. The second IronWorker should pull down / stream the file from S3, parse it, and store the appropriate output in both a postgresql database, as well as a Solr web search. You should make some friendly rake tasks to launch workers so to let me process all new, process a specific file, and reprocess everything.
As I mentioned, we have existing Ruby code to parse the patent files. It is a little separated now, so there will be some work to unify the code of one parser in with the other two, but it should be relatively simple. In this solution, performance is key, as there are 8MM+ patent documents that may have to be reindexed.