Open Source Web Crawler for Java

Overview

crawler4j

Build Status Maven Central Gitter Chat

crawler4j is an open source web crawler for Java which provides a simple interface for crawling the Web. Using it, you can setup a multi-threaded web crawler in few minutes.

Table of content

Installation

Using Maven

Add the following dependency to your pom.xml:

    <dependency>
        <groupId>edu.uci.ics</groupId>
        <artifactId>crawler4j</artifactId>
        <version>4.4.0</version>
    </dependency>

Using Gradle

Add the following dependency to your build.gradle file:

compile group: 'edu.uci.ics', name: 'crawler4j', version: '4.4.0'

Quickstart

You need to create a crawler class that extends WebCrawler. This class decides which URLs should be crawled and handles the downloaded page. The following is a sample implementation:

public class MyCrawler extends WebCrawler {

    private final static Pattern FILTERS = Pattern.compile(".*(\\.(css|js|gif|jpg"
                                                           + "|png|mp3|mp4|zip|gz))$");

    /**
     * This method receives two parameters. The first parameter is the page
     * in which we have discovered this new url and the second parameter is
     * the new url. You should implement this function to specify whether
     * the given url should be crawled or not (based on your crawling logic).
     * In this example, we are instructing the crawler to ignore urls that
     * have css, js, git, ... extensions and to only accept urls that start
     * with "https://www.ics.uci.edu/". In this case, we didn't need the
     * referringPage parameter to make the decision.
     */
     @Override
     public boolean shouldVisit(Page referringPage, WebURL url) {
         String href = url.getURL().toLowerCase();
         return !FILTERS.matcher(href).matches()
                && href.startsWith("https://www.ics.uci.edu/");
     }

     /**
      * This function is called when a page is fetched and ready
      * to be processed by your program.
      */
     @Override
     public void visit(Page page) {
         String url = page.getWebURL().getURL();
         System.out.println("URL: " + url);

         if (page.getParseData() instanceof HtmlParseData) {
             HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
             String text = htmlParseData.getText();
             String html = htmlParseData.getHtml();
             Set<WebURL> links = htmlParseData.getOutgoingUrls();

             System.out.println("Text length: " + text.length());
             System.out.println("Html length: " + html.length());
             System.out.println("Number of outgoing links: " + links.size());
         }
    }
}

As can be seen in the above code, there are two main functions that should be overridden:

  • shouldVisit: This function decides whether the given URL should be crawled or not. In the above example, this example is not allowing .css, .js and media files and only allows pages within 'www.ics.uci.edu' domain.
  • visit: This function is called after the content of a URL is downloaded successfully. You can easily get the url, text, links, html, and unique id of the downloaded page.

You should also implement a controller class which specifies the seeds of the crawl, the folder in which intermediate crawl data should be stored and the number of concurrent threads:

public class Controller {
    public static void main(String[] args) throws Exception {
        String crawlStorageFolder = "/data/crawl/root";
        int numberOfCrawlers = 7;

        CrawlConfig config = new CrawlConfig();
        config.setCrawlStorageFolder(crawlStorageFolder);

        // Instantiate the controller for this crawl.
        PageFetcher pageFetcher = new PageFetcher(config);
        RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
        RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
        CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);

        // For each crawl, you need to add some seed urls. These are the first
        // URLs that are fetched and then the crawler starts following links
        // which are found in these pages
        controller.addSeed("https://www.ics.uci.edu/~lopes/");
        controller.addSeed("https://www.ics.uci.edu/~welling/");
    	controller.addSeed("https://www.ics.uci.edu/");
    	
    	// The factory which creates instances of crawlers.
        CrawlController.WebCrawlerFactory<BasicCrawler> factory = MyCrawler::new;
        
        // Start the crawl. This is a blocking operation, meaning that your code
        // will reach the line after this only when crawling is finished.
        controller.start(factory, numberOfCrawlers);
    }
}

More Examples

  • Basic crawler: the full source code of the above example with more details.
  • Image crawler: a simple image crawler that downloads image content from the crawling domain and stores them in a folder. This example demonstrates how binary content can be fetched using crawler4j.
  • Collecting data from threads: this example demonstrates how the controller can collect data/statistics from crawling threads.
  • Multiple crawlers: this is a sample that shows how two distinct crawlers can run concurrently. For example, you might want to split your crawling into different domains and then take different crawling policies for each group. Each crawling controller can have its own configurations.
  • Shutdown crawling: this example shows how crawling can be terminated gracefully by sending the 'shutdown' command to the controller.
  • Postgres/JDBC integration: this shows how to save the crawled content into a Postgres database (or any other JDBC repository), thanks rzo1.

Configuration Details

The controller class has a mandatory parameter of type CrawlConfig. Instances of this class can be used for configuring crawler4j. The following sections describe some details of configurations.

Crawl depth

By default there is no limit on the depth of crawling. But you can limit the depth of crawling. For example, assume that you have a seed page "A", which links to "B", which links to "C", which links to "D". So, we have the following link structure:

A -> B -> C -> D

Since, "A" is a seed page, it will have a depth of 0. "B" will have depth of 1 and so on. You can set a limit on the depth of pages that crawler4j crawls. For example, if you set this limit to 2, it won't crawl page "D". To set the maximum depth you can use:

crawlConfig.setMaxDepthOfCrawling(maxDepthOfCrawling);

Enable SSL

To enable SSL simply:

CrawlConfig config = new CrawlConfig();

config.setIncludeHttpsPages(true);

Maximum number of pages to crawl

Although by default there is no limit on the number of pages to crawl, you can set a limit on this:

crawlConfig.setMaxPagesToFetch(maxPagesToFetch);

Enable Binary Content Crawling

By default crawling binary content (i.e. images, audio etc.) is turned off. To enable crawling these files:

crawlConfig.setIncludeBinaryContentInCrawling(true);

See an example here for more details.

Politeness

crawler4j is designed very efficiently and has the ability to crawl domains very fast (e.g., it has been able to crawl 200 Wikipedia pages per second). However, since this is against crawling policies and puts huge load on servers (and they might block you!), since version 1.3, by default crawler4j waits at least 200 milliseconds between requests. However, this parameter can be tuned:

crawlConfig.setPolitenessDelay(politenessDelay);

Proxy

Should your crawl run behind a proxy? If so, you can use:

crawlConfig.setProxyHost("proxyserver.example.com");
crawlConfig.setProxyPort(8080);

If your proxy also needs authentication:

crawlConfig.setProxyUsername(username);
crawlConfig.setProxyPassword(password);

Resumable Crawling

Sometimes you need to run a crawler for a long time. It is possible that the crawler terminates unexpectedly. In such cases, it might be desirable to resume the crawling. You would be able to resume a previously stopped/crashed crawl using the following settings:

crawlConfig.setResumableCrawling(true);

However, you should note that it might make the crawling slightly slower.

User agent string

User-agent string is used for representing your crawler to web servers. See here for more details. By default crawler4j uses the following user agent string:

"crawler4j (https://github.com/yasserg/crawler4j/)"

However, you can overwrite it:

crawlConfig.setUserAgentString(userAgentString);

License

Copyright (c) 2010-2018 Yasser Ganjisaffar

Published under Apache License 2.0, see LICENSE

Comments
  • Converting Concrete Classes to Interfaces

    Converting Concrete Classes to Interfaces

    While working on an internal company crawling project I found your project. Using the tutorials I was able to create a custom crawler and crawl controller to perform the functions that we needed. However, the website crawl did not run as expected, so I dug into the crawler4j source code.

    By changing a number of the project's concrete classes to interfaces, and modifying the hierarchy, I was able to inject our own crawl database that we were able to use to monitor status of the crawl and debug why it was stopping short of our expected number of pages.

    Although a number of the changes made are personal preferences to code format/style, these changes do include improvements that I believe others in the community could benefit from.

    opened by nathanjoyes 12
  • Unhandled TIKA exception:

    Unhandled TIKA exception: "java.lang.NoClassDefFoundError"

    Hi, I updated Crawler4j to version 4.4.0 and while running it I came upon the following exception.

    Exception in thread "Crawler 1" java.lang.NoClassDefFoundError: org/apache/cxf/jaxrs/ext/multipart/ContentDisposition
    	at org.apache.tika.parser.journal.GrobidRESTParser.parse(GrobidRESTParser.java:73)
    	at org.apache.tika.parser.journal.JournalParser.parse(JournalParser.java:60)
    	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
    	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
    	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
    	at edu.uci.ics.crawler4j.parser.BinaryParseData.setBinaryContent(BinaryParseData.java:70)
    	at edu.uci.ics.crawler4j.parser.Parser.parse(Parser.java:56)
    	at edu.uci.ics.crawler4j.crawler.WebCrawler.processPage(WebCrawler.java:472)
    	at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:306)
    	at java.lang.Thread.run(Thread.java:748)
    
    opened by LSmyrnaios 9
  • Authentication not working with Crawler4j

    Authentication not working with Crawler4j

    I tried to set up a crawler project. But the form authentication or basic authentication is not at all working. what changes I need to incorporate with my java class.

    public static void main(String[] args) throws Exception { CrawlConfig config = new CrawlConfig(); String frontier = "/tmp/webCrawler/tmp_" + System.currentTimeMillis(); config.setCrawlStorageFolder(frontier); PageFetcher pageFetcher = new PageFetcher(config); RobotstxtConfig robotstxtConfig = new RobotstxtConfig(); RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig,pageFetcher); AuthInfo authInfo2 = new BasicAuthInfo("admin", "admin", host); config.addAuthInfo(authInfo2); PageFetcher pf = new PageFetcher(config); CrawlController ctrl = new CrawlController(config, pf, robotstxtServer); ctrl.addSeed(+host+"/index.html"); ctrl.startNonBlocking(CustomWebCrawler.class, 5);

    Create an another class which extends the WebCrawler class from Crawler4j.

    Kindly provide a solution or a document to refer.

    opened by abhinkraj7 9
  • Sleepycat dependency?

    Sleepycat dependency?

    Hello,

    looks like the sleepycat dependency needs to be gotten from Oracle Maven repo.

    1. Any alternatives?
    2. If the case, should this be in README.md?

    Thanks, Stephan

    opened by StephanSchmidt 8
  • Factory instead of hardcoded class.newInstance()

    Factory instead of hardcoded class.newInstance()

    I would like to suggest, that adding a the possibility to use a factory to create new web-crawlers would be of great value.

    Since a web-crawler could hold a few custom services (e.g. classifiers, database services) a factory would be a very nice thing to make crawler4j usable for example via Spring.

    A few years ago an issue was created on googlecode (https://code.google.com/p/crawler4j/issues/detail?id=144), which is a duplicate of mine request - but nothing happend. Is there a reason for not including a factory approach in the code-base?

    Thanks in advance.

    opened by rzo1 8
  • Option to specify DefaultCookieStore

    Option to specify DefaultCookieStore

    Not all sites have regular "form" or "basic" auth; and because of that, it would be great to be able to specify our own "cookies" to start with. In our case we did this and it seems to be working fine:

    CrawlConfig.java

      public CookieStore getDefaultCookieStore() {
          return defaultCookieStore;
      }
    
      public void setDefaultCookieStore(CookieStore cookieStore) {
          this.defaultCookieStore = cookieStore;
      }
    

    PageFetcher.java

    clientBuilder.setDefaultCookieStore(config.getDefaultCookieStore());
    

    Controller

                BasicCookieStore cookieStore = new BasicCookieStore();
                for (Map.Entry<String, String> entry : getLoginCookies().entrySet()){
                    BasicClientCookie cookie = new BasicClientCookie(entry.getKey(), entry.getValue());
                    cookie.setSecure(true);
                    cookie.setDomain("127.0.0.1");
                    cookie.setPath("/");
                    cookieStore.addCookie(cookie);
                }
                config.setDefaultCookieStore(cookieStore);
    
                PageFetcher pageFetcher = new PageFetcher(config);
    

    Thanks!

    opened by davidacampos 8
  • Fixed  HttpException: Unsupported cookie policy [DEFAULT]

    Fixed HttpException: Unsupported cookie policy [DEFAULT]

    I occurred the error:

    WARN 2016-04-13/17:00:26.428 [Crawler 1] WebCrawler -|79581|- Unhandled exception while fetching http://mp.weixin.qq.com/s?__biz=MzIzNzA0ODQxOA%3D%3D&mid=401543508&idx=1&sn=292e96fd82ca1207de11af1aacd03353: null INFO 2016-04-13/17:00:26.429 [Crawler 1] WebCrawler -|79581|- Stacktrace: org.apache.http.client.ClientProtocolException at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:186) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106) at edu.uci.ics.crawler4j.fetcher.PageFetcher.fetchPage(PageFetcher.java:274) at edu.uci.ics.crawler4j.crawler.WebCrawler.processPage(WebCrawler.java:323) at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:278) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.http.HttpException: Unsupported cookie policy: default at org.apache.http.client.protocol.RequestAddCookies.process(RequestAddCookies.java:150) at org.apache.http.protocol.ImmutableHttpProcessor.process(ImmutableHttpProcessor.java:132) at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:193) at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:86) at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:108) at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)

    It is because of the error usage of cookie policy, we can find more informs from the links below and I think STANDARD policy is the better choice:

    https://hc.apache.org/httpcomponents-client-ga/httpclient/apidocs/org/apache/http/client/config/CookieSpecs.html

    https://hc.apache.org/httpcomponents-client-ga/httpclient/apidocs/constant-values.html#org.apache.http.client.config.CookieSpecs.DEFAULT

    opened by IVANOPT 7
  • support for script src tag

    support for script src tag

    Hello!

    Thank you for good code! I really liked it.

    Request: I add a tag can not be supported