The Web contains servers that create spider traps, which are generators of web pages that mislead crawlers into getting stuck fetching an infinite number of pages in a particular domain. Crawlers must be designed to be resilient to such traps. Not all such traps are malicious; some are the inadvertent side-effect of faulty website development.
Web servers have both implicit and explicit policies regulating the rate at which a crawler can visit them. These politeness policies must be respected.
The crawler should have the ability to execute in a distributed fashion across multiple machines.
The crawler architecture should permit scaling up the crawl rate by adding extra machines and bandwidth.
• Performance and efficiency:
The crawl system should make efficient use of various system resources including processor, storage and network bandwidth.
Given that a significant fraction of all web pages are of poor utility for serving user query needs, the crawler should be biased towards fetching ``useful'' pages first.
In many applications, the crawler should operate in continuous mode: it should obtain fresh copies of previously fetched pages. A search engine crawler, for instance, can thus ensure that the search engine's index contains a fairly current representation of each indexed web page. For such continuous crawling, a crawler should be able to crawl a page with a frequency that approximates the rate of change of that page.
Crawlers should be designed to be extensible in many ways - to cope with new data formats, new fetch protocols, and so on. This demands that the crawler architecture be modular.
Web Crawler Functions:
• Controller Module - This module focuses on the Graphical User Interface (GUI) designed for the web crawler and is responsible for controlling the operations of the crawler.The GUI enables the user to enter the start URL, enter the maximum number of URL’s to crawl, view the URL’s that are being fetched. It controls the Fetcher and Parser.
o URL frontier – managing the URLs to be fetched
• Fetcher Module - This module starts by fetching the page according to the start URL specified by the user. The fetcher module also retrieves all the links in a particular pageand continues doing that until the maximum number of URL’s is reached.
o DNS resolution – determining the host (web server) from which to fetch a page defined by a URL
o Fetching module – downloading a remote webpage for processing
o Processing URLs - The links in HTML files are represented as URLs. There are two kinds of URLs, absolute and relative. Your program must handle both absolute and relative URLs, as described below.
o Absolute URLs - An absolute URL fully specifies the address of the referenced document. These are the addresses that you would type into a web browser in order to visit a web page. For example, consider the following absolute URL: “http://www.espn.com:80/basketball”
o Restricting the Crawler's Scope - An absolute URL must start with either http:// or file:. If a link does not start with http:// or file:, then it is treated as a relative URL.
o Relative URLs - Relative URLs are all the links that do not begin with either http:// or file:.examples of relative URLs: “../../images/nasdaq.jpg”
o Optimization module – Optimize the data to make the crawler crawl faster by removing images and unnecessary stuff.
o Multi level crawling ¬– User can define the crawling level.
o Query Optimizer - Multi Threaded environment makes the URL to be fetched more efficiently.
• Parser Module - This module parses the URL’s fetched by the Fetcher module and saves the contents of those pages to the disk.
o Parsing module – extracting text and links
o Duplicate elimination – detecting URLs and contents that have been processed a short time ago.