A Web crawler is an Internet bot which helps in Web indexing. They crawl one page at a time through a website until all pages have been indexed. Web crawlers help in collecting information about a website and the links related to them, and also help in validating the HTML code and hyperlinks.
A Web crawler is also known as a Web spider, automatic indexer or simply crawler.
Web crawlers collect information such the URL of the website, the meta tag information, the Web page content, the links in the webpage and the destinations leading from those links, the web page title and any other relevant information. They keep track of the URLs which have already been downloaded to avoid downloading the same page again. A combination of policies such as re-visit policy, selection policy, parallelization policy and politeness policy determines the behavior of the Web crawler. There are many challenges for web crawlers, namely the large and continuously evolving World Wide Web, content selection tradeoffs, social obligations and dealing with adversaries.
Web crawlers are the key components of Web search engines and systems that look into web pages. They help in indexing the Web entries and allow users to send queries against the index and also provide the webpages that match the queries. Another use of Web crawlers is in Web archiving, which involves large sets of webpages to be periodically collected and archived. Web crawlers are also used in data mining, wherein pages are analyzed for different properties like statistics, and data analytic are then performed on them.
Open source web crawler
PHP web spider is an Open Source web crawler.
- supports two traversal algorithms: breadth-first and depth-first
- supports crawl depth limiting, queue size limiting and max downloads limiting
- supports adding custom URI discovery logic, based on XPath, CSS selectors, or plain old PHP
- comes with a useful set of URI filters, such as Domain limiting
- supports custom URI filters, both prefetch (URI) and postfetch (Resource content)
- supports custom request handling logic
- comes with a useful set of persistence handlers (memory, file. Redis soon to follow)
- supports custom persistence handlers
- collects statistics about the crawl for reporting
- dispatches useful events, allowing developers to add even more custom behavior
- supports a politeness policy
- will soon come with many default discoverers: RSS, Atom, RDF, etc.
- will soon support multiple queueing mechanisms (file, memcache, redis)
- will eventually support distributed spidering with a central queue
This is a very simple example. This code can be found in example/example_simple.php. For a more complete example with some logging, caching and filters, see example/example_complex.php. That file contains a more real-world example.