I have a large set (10K) of html documents which have been created by many different authors using many different tools.
This set of documents contains 4 types of documents.
Type 1 (200 pages) fewer of these
Type 2 (100 - 140 pages) fewer of these
Type 3 (20 - 40 pages) more of these
Type 4 (10 - 20 pages) more of these
Each type foll...
Skills: bash, html, java, perl