Unix app to OCR Headline Text from noisy color background
This job is to develop an application to run on large multi-core Linux servers that can read an image in one of the usual formats (starting with JPG and PNG) and extract text it may contain.
The image will typically be as large as 1000x1000 px, but they vary in size. Backgrounds will be in color, and you can expect the background to have a lot of edges to get in the way of character recognition!
Limitation to specific fonts is not practical in this application, and the fonts used in the images are likely to be decorative. You may assume that the text is roman (i.e. not kanji, for instance).
We'll provide several hundred sample images from our specific problem set for you to work on. A sample image is attached. The resulting app must extract correct text from 80% of them for the project to be accepted.
We are looking for someone who knows available off-the-shelf OCR packages (either freeware or relatively low-cost), and who can use these to assemble a complete app for us. We have software expertise in other areas, but this is an initial proof-of-concept project for us, and we'd rather find someone who can get this done easily and quickly.
Our main programming language is Java, but C/C++ is equally acceptable for this project.