Apache Nutch is an extensible and scalable web crawler

Overview

Apache Nutch README

For the latest information about Nutch, please visit our website at:

https://nutch.apache.org/

and our wiki, at:

https://cwiki.apache.org/confluence/display/NUTCH/Home

To get started using Nutch read Tutorial:

https://cwiki.apache.org/confluence/display/NUTCH/NutchTutorial

Contributing

To contribute a patch, follow these instructions (note that installing Hub is not strictly required, but is recommended).

0. Download and install hub.github.com
1. File JIRA issue for your fix at https://issues.apache.org/jira/projects/NUTCH/issues
- you will get issue id NUTCH-xxx where xxx is the issue ID.
2. git clone https://github.com/apache/nutch.git
3. cd nutch
4. git checkout -b NUTCH-xxx
5. edit files (please try and include a test case if possible)
6. git status (make sure it shows what files you expected to edit)
7. Make sure that your code complies with the [Nutch codeformatting template](https://raw.githubusercontent.com/apache/nutch/master/eclipse-codeformat.xml), which is basially two space indents
8. git add 
   
    
9. git commit -m “fix for NUTCH-xxx contributed by 
    
     ”
10. git fork
11. git push -u 
     
       NUTCH-xxx
12. git pull-request

     
    
   

IDE setup

Generate Eclipse project files

ant eclipse

and follow the instructions in Importing existing projects.

For Intellij IDEA, first install the IvyIDEA Plugin. then run ant eclipse.

Then open the project in IntelliJ. You may see popups like "Ant build scripts found", "Frameworks detected - IvyIDEA Framework detected". Just follow the simple steps in these dialogs.

You must configure the nutch-site.xml before running. Make sure, you've added http.agent.name and plugin.folders properties. The plugin.folders normally points to /build/plugins .

Now create a Java Application Configuration, choose org.apache.nutch.crawl.Injector, add two paths as arguments. First one is the crawldb directory, second one is the URL directory where, the injector can read urls. Now run your configuration.

If we still see the No plugins found on paths of property plugin.folders="plugins", update the plugin.folders in the nutch-default.xml, this is a quick fix, but should not be used.

Export Control

This distribution includes cryptographic software. The country in which you currently reside may have restrictions on the import, possession, use, and/or re-export to another country, of encryption software. BEFORE using any encryption software, please check your country's laws, regulations and policies concerning the import, possession, or use, and re-export of encryption software, to see if this is permitted. See https://www.wassenaar.org/ for more information.

The U.S. Government Department of Commerce, Bureau of Industry and Security (BIS), has classified this software as Export Commodity Control Number (ECCN) 5D002.C.1, which includes information security software using or performing cryptographic functions with asymmetric algorithms. The form and manner of this Apache Software Foundation distribution makes it eligible for export under the License Exception ENC Technology Software Unrestricted (TSU) exception (see the BIS Export Administration Regulations, Section 740.13) for both object code and source code.

The following provides more details on the included cryptographic software:

Apache Nutch uses the PDFBox API in its parse-tika plugin for extracting textual content and metadata from encrypted PDF files. See https://pdfbox.apache.org/ for more details on PDFBox.

Comments
  • WARC exporter for the CommonCrawlDataDumper

    WARC exporter for the CommonCrawlDataDumper

    This adds the possibility of exporting the nutch segments to a WARC files.

    From the usage point of view a couple of new command line options are available:

    • -warc: enables the functionality to export into WARC files, if not specified the default JACKSON formatter is used.
    • -warcSize: enable the option to define a max file size for each WARC file, if not specified a default of 1GB per file is used as recommended by the WARC ISO standard.

    The usual -gzip flag can be used to enable compression on the WARC files, which allow to compress the output files.

    Some changes to the default CommonCrawlDataDumper were done, essentially some changes to the Factory and to the Formats. This changes avoid creating a new instance of a CommmonCrawlFormat on each URL read from the segments.

    opened by jorgelbg 51
  • [DO NOT MERGE/DISCUSSION] add cleaned up version of momer's protocol-selenium plugin

    [DO NOT MERGE/DISCUSSION] add cleaned up version of momer's protocol-selenium plugin

    Hi,

    For some time, I have been using @momer's Selenium plugin for Nutch 2.3, which has worked wonders in AJAX crawling. Lately however, I've noticed the following issues with it:

    1. It does not support HTTPS. While adding support for HTTPS into it, I realized that:
    2. It does a lot of non-Selenium stuff that makes it inefficient, including making every request twice.

    I've cleaned up the code(unfortunately, the update history pointing to him is gone because of the lazy copy), making sure it uses Selenium and only selenium.

    So far it has the following weaknesses:

    • Selenium's Wait does not appear to be properly used(also an issue in @momer's code)
    • It does not fill out the WebPage fields according to snuff. I do think someone more experienced with Selenium could get this done, or I might add support in the future.
    • Does not fetch robots.txt properly.

    Due to these issues, I think merging is a bit premature. However, I'd like to point out that I do not think @momer's code should be merged either.

    opened by eivindveg 19
  • fix for Nutch 1973 by sujen1412

    fix for Nutch 1973 by sujen1412

    API calls documented at - https://wiki.apache.org/nutch/Nutch_1.X_RESTAPI. This pull request includes Index, Generate, Fetch, Parse, Update, InvertLinks, Dedup and Readdb jobs.

    opened by sujen1412 17
  • NUTCH-2144 : override db.ignore.external to exempt interesting external domain URLs

    NUTCH-2144 : override db.ignore.external to exempt interesting external domain URLs

    • Add extension point org.apache.nutch.net.URLExemptionFilter
    • Modify FetcherThread and ParseOutputFormat to integrate new extension point
    • Add extension urlfilter-ignoreexempt
    • build configs modified to include new extension

    Resolves https://issues.apache.org/jira/browse/NUTCH-2144

    opened by thammegowda 14
  • NUTCH-1541 Indexer plugin to write CSV

    NUTCH-1541 Indexer plugin to write CSV

    • adds an indexer plugin which writes a configurable CSV index
    • works only in local mode in combination with -noCommit
    CSVIndexWriter - write index as CSV file (comma separated values)
      indexer.csv.fields       : ordered list of fields (columns) in the CSV file
      indexer.csv.separator    : separator between fields (columns), default: , (U+002C, comma)
      indexer.csv.quotechar    : quote character used to quote fields containing separators or quotes, default: " (U+0022, quotation mark)
      indexer.csv.escapechar   : escape character used to escape a quote character, default: " (U+0022, quotation mark)
      indexer.csv.recordsep    : separator between records (rows) resp. documents, default: \r\n (DOS-style line breaks)
      indexer.csv.valuesep     : separator between multiple values of one field, default: | (U+007C)
      indexer.csv.maxfieldvalues : max. number of values of one field, useful for,  e.g., the anchor texts field, default: 12
      indexer.csv.maxfieldlength : max. length of a single field value in characters, default: 4096.
      indexer.csv.charset      : encoding of CSV file, default: UTF-8
      indexer.csv.header       : write CSV column headers, default: true
      indexer.csv.outpath      : output path / directory, default: csvindexwriter. 
        CAVEAT: existing output directories are removed!
    
    opened by sebastian-nagel 13
  • fix for NUTCH-1480 contributed by r0ann3l

    fix for NUTCH-1480 contributed by r0ann3l

    With this patch now we can have many instances of the same IndexWriter class, but with different configurations. Also, we can copy, rename or remove fields of documents for every index writer individually. Besides, the parameters needed by the index writers will be into separated XML files, so them will be not into nutch-site.xml anymore.

    opened by r0ann3l 12
  • NUTCH-2373 Index writer plugin for hbase implemented

    NUTCH-2373 Index writer plugin for hbase implemented

    An index writer for HBase like index writer for Solr, ElasticSearch etc. Expected HBase table description and NutchDocument to HBase mapping is read using a indexer-solr alike mapping file and write NutchDocument fields into the table of a HBase server.

    TODO: Functionality to send and set kerberos authentication configuration for secure hdfs

    opened by kaidul 10
  • fix for NUTCH-2460 contributed by Hussein Alahmad

    fix for NUTCH-2460 contributed by Hussein Alahmad

    use the headless option of firefox and chrome in protocol-selenium

    the --headless option is added to firefox in version 55 or later , and in chrome in version 59 or later ... this is much better than relying on xvfb and its associates . we can add it as a property in the config file . I'm trying it on my local machine , and will create a pull request when I finish testing it

    I've tested it using firefox 57.0 , gecodriver 0.19.1 and selenium 3.7.1

    Important note : you need to add the following property to nutch-default.xml or nutch-site.xml for the headless option to work

    selenium.firefox.headless true A Boolean value representing if firefox should run headless . make sure that firefox version is 55 or later, and selenium webDriver version is 3.6.0 or later. The default value is false. Currently this option exist for - 'firefox'
    opened by hussein-alahmad 9
  • fix for NUTCH-2234 and NUTCH-2236

    fix for NUTCH-2234 and NUTCH-2236

    Upgrade Elasticsearch and Lucene dependencies, which, in turn, requires updates to Guava and Hadoop dependencies:

    • Elasticsearch 1.4.1 -> Elasticsearch 2.3.3
    • Lucene 4.10.2 -> 5.5.0
    • Guava 16.0.1 -> Guava 18.0
    • Hadoop 2.4.0 -> 2.7.2
    opened by naegelejd 9
  • NUTCH-2248 CSS parser plugin

    NUTCH-2248 CSS parser plugin

    As described on JIRA:

    This plugin allows for collecting uri links from CSS (stylesheets). This is useful for collecting parent stylesheets, fonts, and images needed to display web pages as intended.

    Parsed Outlinks do not have associated anchors, and no additional text/content is parsed from the stylesheet.

    opened by naegelejd 9
  • NUTCH-2184 Enable IndexingJob to function with no crawldb

    NUTCH-2184 Enable IndexingJob to function with no crawldb

    OK folks, this issue addresses https://issues.apache.org/jira/browse/NUTCH-2184 by

    • rebasing the NUTCH-2184v2.patch against master branch
    • making the IndexerMapReduceMapper and IndexerMapReduceReducer in IndexerMapReduce code explicit so that these functions can be tested
    • adding in some mrunit tests for testing the IndexerMapReduceMapper and IndexerMapReduceReducer
    • removing some trivial imports which are unsed
    • formatting ivy.xml which has somehow (again) become a dogs dinner
    • adding default constructor to NutchIndexAction()

    Any questions, then please let me know. I would really appreciate if people could pull this code and try it out within your test or local environment. Thanks, also thanks Markus for the original suggestions for tests, etc.

    opened by lewismc 9
  • NUTCH-2490 Develop Gradle Core Build for Apache Nutch

    NUTCH-2490 Develop Gradle Core Build for Apache Nutch

    This is a WIP for https://issues.apache.org/jira/browse/NUTCH-2940. The work was conducted by @AzureTriple @imanzanganeh @jbsimmon @LilyPerr and @Lirongxuan1 from the 2022 USC Senior CS Capstone Program.

    Most of the core build is in place. No plugin sub-projects have been implemented yet.

    I intend to continue work on the code build until it is completed. I will then move on to the plugin sub-projects.

    opened by lewismc 1
  • NUTCH-2938 Use Any23's RepositoryWriter to write structured data to Rdf4j repository

    NUTCH-2938 Use Any23's RepositoryWriter to write structured data to Rdf4j repository

    PR addresses https://issues.apache.org/jira/browse/NUTCH-2938 We could improve the performance of this plugin if we could reuse the repository connection however I am not entirely sure how to do that right now because this is done down in the Any23 layer.

    opened by lewismc 0
  • WIP StatsD metrics example

    WIP StatsD metrics example

    Until I complete the Nutch Metrics and we agree on NUTCH-2909 I would ask that you don't look at this with too much sincerity. I'm submitting this early so anyone interested can see where I thought we could go with this one. Thanks for any feedback.

    opened by lewismc 0
  • NUTCH-2793 indexer-csv: make it work in distributed mode

    NUTCH-2793 indexer-csv: make it work in distributed mode

    Before the change, the output file name was hard-coded to "nutch.csv". When running in distributed mode, multiple reducers would clobber each other output.

    After the change, the filename is taken from the first open(cfg, name) initialization call, where name is a unique file name generated by IndexerOutputFormat, derived from hadoop FileOutputFormat. The CSV files are now named like part-r-000xx.

    opened by pmezard 6
  • NUTCH-1870 XSL parse filter

    NUTCH-1870 XSL parse filter

    • apply patch contributed by @albinscode
    • load configuration files from classpath and address thread-safety

    Note: not ready yet:

    • TODOs in code
    • unit tests fail (with DOM built by tagsoup parser)
    • see also open points in NUTCH-1870
    opened by sebastian-nagel 2
Owner
The Apache Software Foundation
The Apache Software Foundation
A scalable web crawler framework for Java.

Readme in Chinese A scalable crawler framework. It covers the whole lifecycle of crawler: downloading, url management, content extraction and persiste

Yihua Huang 10.7k Jan 5, 2023
Open Source Web Crawler for Java

crawler4j crawler4j is an open source web crawler for Java which provides a simple interface for crawling the Web. Using it, you can setup a multi-thr

Yasser Ganjisaffar 4.3k Jan 3, 2023
An implementation of darcy-web that uses Selenium WebDriver as the automation library backend.

darcy-webdriver An implementation of darcy-ui and darcy-web that uses Selenium WebDriver as the automation library backend. maven <dependency> <gr

darcy framework 20 Aug 22, 2020
Nokogiri (鋸) is a Rubygem providing HTML, XML, SAX, and Reader parsers with XPath and CSS selector support.

Nokogiri Nokogiri (鋸) makes it easy and painless to work with XML and HTML from Ruby. It provides a sensible, easy-to-understand API for reading, writ

Sparkle Motion 6k Jan 8, 2023
jsoup: the Java HTML parser, built for HTML editing, cleaning, scraping, and XSS safety.

jsoup: Java HTML Parser jsoup is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting a

Jonathan Hedley 9.9k Jan 4, 2023
Elegant parsing in Java and Scala - lightweight, easy-to-use, powerful.

Please see https://repo1.maven.org/maven2/org/parboiled/ for download access to the artifacts https://github.com/sirthias/parboiled/wiki for all docum

Mathias 1.2k Dec 21, 2022
Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.

Sparkler A web crawler is a bot program that fetches resources from the web for the sake of building applications like search engines, knowledge bases

USC Information Retrieval & Data Science 396 Dec 17, 2022
A scalable, mature and versatile web crawler based on Apache Storm

StormCrawler is an open source collection of resources for building low-latency, scalable web crawlers on Apache Storm. It is provided under Apache Li

DigitalPebble Ltd 776 Jan 2, 2023
A scalable web crawler framework for Java.

Readme in Chinese A scalable crawler framework. It covers the whole lifecycle of crawler: downloading, url management, content extraction and persiste

Yihua Huang 10.7k Jan 5, 2023
a fast, scalable, multi-language and extensible build system

Bazel {Fast, Correct} - Choose two Build and test software of any size, quickly and reliably. Speed up your builds and tests: Bazel rebuilds only what

Bazel 20k Jan 4, 2023
Open Source Web Crawler for Java

crawler4j crawler4j is an open source web crawler for Java which provides a simple interface for crawling the Web. Using it, you can setup a multi-thr

Yasser Ganjisaffar 4.3k Jan 3, 2023
Cadence is a distributed, scalable, durable, and highly available orchestration engine to execute asynchronous long-running business logic in a scalable and resilient way.

Cadence This repo contains the source code of the Cadence server and other tooling including CLI, schema tools, bench and canary. You can implement yo

Uber Open Source 6.5k Jan 4, 2023
source code of the live coding demo for "Building resilient and scalable API backends with Apache Pulsar and Spring Reactive" talk held at ApacheCon@Home 2021

reactive-iot-backend The is the source code of the live coding demo for "Building resilient and scalable API backends with Apache Pulsar and Spring Re

Lari Hotari 4 Jan 13, 2022
Microhttp - a fast, scalable, event-driven, self-contained Java web server

Microhttp is a fast, scalable, event-driven, self-contained Java web server that is small enough for a programmer to understand and reason about.

Elliot Barlas 450 Dec 23, 2022
Firehose is an extensible, no-code, and cloud-native service to load real-time streaming data from Kafka to data stores, data lakes, and analytical storage systems.

Firehose - Firehose is an extensible, no-code, and cloud-native service to load real-time streaming data from Kafka to data stores, data lakes, and analytical storage systems.

Open DataOps Foundation 279 Dec 22, 2022
An extensible Java library for HTTP request and response logging

Logbook: HTTP request and response logging Logbook noun, /lɑɡ bʊk/: A book in which measurements from the ship's log are recorded, along with other sa

Zalando SE 1.3k Dec 29, 2022
An extensible Java framework for building XML and non-XML streaming applications

Smooks Framework This is the Git source code repository for the Smooks Project. Build Status Building Pre-requisites JDK 8 Apache Maven 3.2.x Maven gi

Smooks Framework 353 Dec 1, 2022
A lightweight and extensible library to resolve application properties from various external sources.

Externalized Properties A lightweight and extensible library to resolve application properties from various external sources. Twelve Factor Methodolog

Joel Jeremy Marquez 20 Nov 29, 2022