The client has made the following changes to the job.
Client prefers freelancers from:
You are still able to submit a proposal for this job.
The client prefers freelancers from
a different location.
I need an experience cURL programmer to write a highly efficient site crawler with the following features:
Takes a domain as start input
Crawls the domain for every (or optionally limited) page on it, by means of:
-Following internal links
-Reading existing sitemap file if one exists
-Scraping Google for "site:domain.com" hits and check for missed pages from the above steps.
The crawler should store the following to a MySQL database:
Anything from the curl_getinfo function.
Page meta tags and values (ie keywords, description and a few other "standard" tags)
All header tags (H1-H5) contents.
All of the above within reasonable character count limitations.
The crawler must be able to read various character sets and convert to UTF8 for storage, with html entities converted to plain text.
The crawler shold be written in well stuctured and commented PHP code, utilizing the cURL library and make use of it's ability to make multiple simultaneous requests, ie curl_multi_*...
Sign in or Register to see more