Import Latest DMOZ RDF Files Into SQL Server 2012 Database File
This job is pretty simple. We need the contents of the following two large files:
imported into a new SQL Server 2012 database called "DMOZ"
The database should have two tables:
categoryID int (Primary Key)
listingID bigInt (Primary Key, identity)
categoryID int (Foreign Key -> categories.categoryID)
The categories table is populated by structure.rdf.u8.gz and the listings table is populated by content.rdf.u8.gz.
The only tricky part is that the topCategoryID column must be generated by the parser/imported, as it is not contained in the RDF file. A "topCategory" is that category which is one level higher on the breadcrumb than the current category is.
So given the following two hypothetical listings:
categoryID: 12345 path:Top/Sports/Baseball
categoryID: 67890 path: Top/Sports/Baseball/Cards
categoryID 67890's topCategoryID is 12345 since Top/Sports/Baseball is one level higher than Top/Sports/Baseball/Cards. Again, this information is NOT contained in the raw RDF file, so your code needs to take care of this.
If necessary, assign a topCateogryID of 0 to a high level category such as "Top/Sports", if the "Top" category doesn't already have a categoryID assigned to it by DMOZ. We're not sure exactly how they treat this root/Top category. Hopefully this makes sense.
Also keep in mind that a given listing can appear in multiple categories.
We're not sure if the content.rdf.u8.gz file contains actual "listingID"s. If it does, then simply use the listingID supplied by the RDF file. If not, then make this an identity column that starts at 1 and increments as listings are added to it.
We're looking for someone who is familiar with RDF parsing, and parsing these large DMOZ files in particular. For someone who knows what they're doing, this shouldn't take very long. Use your coding language of choice, as we're not needing to have the code you used to do the parsing. We just need the final .mdf file (or .bak if it makes more sense since the database will be huge.) We don't care how you get the job done as long as it gets done correctly with 100% accuracy.
Finally, you MUST use the latest two RDF files contained in the links above.
Quick turnaround time and low cost will be important in who we award the job to. New Elancers are welcome to bid, as this would be a great job to build your Elance resume.