Apache Nutch is an extensible and scalable web crawler

The Apache Software Foundation

Last update: Dec 31, 2022

Overview

Apache Nutch README

For the latest information about Nutch, please visit our website at:

https://nutch.apache.org/

and our wiki, at:

https://cwiki.apache.org/confluence/display/NUTCH/Home

To get started using Nutch read Tutorial:

https://cwiki.apache.org/confluence/display/NUTCH/NutchTutorial

Contributing

To contribute a patch, follow these instructions (note that installing Hub is not strictly required, but is recommended).

0. Download and install hub.github.com
1. File JIRA issue for your fix at https://issues.apache.org/jira/projects/NUTCH/issues
- you will get issue id NUTCH-xxx where xxx is the issue ID.
2. git clone https://github.com/apache/nutch.git
3. cd nutch
4. git checkout -b NUTCH-xxx
5. edit files (please try and include a test case if possible)
6. git status (make sure it shows what files you expected to edit)
7. Make sure that your code complies with the [Nutch codeformatting template](https://raw.githubusercontent.com/apache/nutch/master/eclipse-codeformat.xml), which is basially two space indents
8. git add 
   
    
9. git commit -m “fix for NUTCH-xxx contributed by 
    
     ”
10. git fork
11. git push -u 
     
       NUTCH-xxx
12. git pull-request

IDE setup

Generate Eclipse project files

ant eclipse

and follow the instructions in Importing existing projects.

For Intellij IDEA, first install the IvyIDEA Plugin. then run ant eclipse.

Then open the project in IntelliJ. You may see popups like "Ant build scripts found", "Frameworks detected - IvyIDEA Framework detected". Just follow the simple steps in these dialogs.

You must configure the nutch-site.xml before running. Make sure, you've added http.agent.name and plugin.folders properties. The plugin.folders normally points to /build/plugins.

Now create a Java Application Configuration, choose org.apache.nutch.crawl.Injector, add two paths as arguments. First one is the crawldb directory, second one is the URL directory where, the injector can read urls. Now run your configuration.

If we still see the No plugins found on paths of property plugin.folders="plugins", update the plugin.folders in the nutch-default.xml, this is a quick fix, but should not be used.

Export Control

This distribution includes cryptographic software. The country in which you currently reside may have restrictions on the import, possession, use, and/or re-export to another country, of encryption software. BEFORE using any encryption software, please check your country's laws, regulations and policies concerning the import, possession, or use, and re-export of encryption software, to see if this is permitted. See https://www.wassenaar.org/ for more information.

The U.S. Government Department of Commerce, Bureau of Industry and Security (BIS), has classified this software as Export Commodity Control Number (ECCN) 5D002.C.1, which includes information security software using or performing cryptographic functions with asymmetric algorithms. The form and manner of this Apache Software Foundation distribution makes it eligible for export under the License Exception ENC Technology Software Unrestricted (TSU) exception (see the BIS Export Administration Regulations, Section 740.13) for both object code and source code.

The following provides more details on the included cryptographic software:

Apache Nutch uses the PDFBox API in its parse-tika plugin for extracting textual content and metadata from encrypted PDF files. See https://pdfbox.apache.org/ for more details on PDFBox.

Comments

WARC exporter for the CommonCrawlDataDumper
This adds the possibility of exporting the nutch segments to a WARC files.

From the usage point of view a couple of new command line options are available:

-warc: enables the functionality to export into WARC files, if not specified the default JACKSON formatter is used.

-warcSize: enable the option to define a max file size for each WARC file, if not specified a default of 1GB per file is used as recommended by the WARC ISO standard.

The usual -gzip flag can be used to enable compression on the WARC files, which allow to compress the output files.

Some changes to the default CommonCrawlDataDumper were done, essentially some changes to the Factory and to the Formats. This changes avoid creating a new instance of a CommmonCrawlFormat on each URL read from the segments.
opened by jorgelbg 51
[DO NOT MERGE/DISCUSSION] add cleaned up version of momer's protocol-selenium plugin
Hi,

For some time, I have been using @momer's Selenium plugin for Nutch 2.3, which has worked wonders in AJAX crawling. Lately however, I've noticed the following issues with it:

It does not support HTTPS. While adding support for HTTPS into it, I realized that:

It does a lot of non-Selenium stuff that makes it inefficient, including making every request twice.

I've cleaned up the code(unfortunately, the update history pointing to him is gone because of the lazy copy), making sure it uses Selenium and only selenium.

So far it has the following weaknesses:

Selenium's Wait does not appear to be properly used(also an issue in @momer's code)

It does not fill out the WebPage fields according to snuff. I do think someone more experienced with Selenium could get this done, or I might add support in the future.

Does not fetch robots.txt properly.

Due to these issues, I think merging is a bit premature. However, I'd like to point out that I do not think @momer's code should be merged either.
opened by eivindveg 19
fix for Nutch 1973 by sujen1412

API calls documented at - https://wiki.apache.org/nutch/Nutch_1.X_RESTAPI. This pull request includes Index, Generate, Fetch, Parse, Update, InvertLinks, Dedup and Readdb jobs.

opened by sujen1412 17
NUTCH-2144 : override db.ignore.external to exempt interesting external domain URLs
Add extension point org.apache.nutch.net.URLExemptionFilter

Modify FetcherThread and ParseOutputFormat to integrate new extension point

Add extension urlfilter-ignoreexempt

build configs modified to include new extension

Resolves https://issues.apache.org/jira/browse/NUTCH-2144
opened by thammegowda 14

NUTCH-1541 Indexer plugin to write CSV

adds an indexer plugin which writes a configurable CSV index
works only in local mode in combination with -noCommit

CSVIndexWriter - write index as CSV file (comma separated values)
  indexer.csv.fields       : ordered list of fields (columns) in the CSV file
  indexer.csv.separator    : separator between fields (columns), default: , (U+002C, comma)
  indexer.csv.quotechar    : quote character used to quote fields containing separators or quotes, default: " (U+0022, quotation mark)
  indexer.csv.escapechar   : escape character used to escape a quote character, default: " (U+0022, quotation mark)
  indexer.csv.recordsep    : separator between records (rows) resp. documents, default: \r\n (DOS-style line breaks)
  indexer.csv.valuesep     : separator between multiple values of one field, default: | (U+007C)
  indexer.csv.maxfieldvalues : max. number of values of one field, useful for,  e.g., the anchor texts field, default: 12
  indexer.csv.maxfieldlength : max. length of a single field value in characters, default: 4096.
  indexer.csv.charset      : encoding of CSV file, default: UTF-8
  indexer.csv.header       : write CSV column headers, default: true
  indexer.csv.outpath      : output path / directory, default: csvindexwriter. 
    CAVEAT: existing output directories are removed!

opened by sebastian-nagel 13

fix for NUTCH-1480 contributed by r0ann3l

With this patch now we can have many instances of the same IndexWriter class, but with different configurations. Also, we can copy, rename or remove fields of documents for every index writer individually. Besides, the parameters needed by the index writers will be into separated XML files, so them will be not into nutch-site.xml anymore.

opened by r0ann3l 12
NUTCH-2373 Index writer plugin for hbase implemented

An index writer for HBase like index writer for Solr, ElasticSearch etc. Expected HBase table description and NutchDocument to HBase mapping is read using a indexer-solr alike mapping file and write NutchDocument fields into the table of a HBase server.

TODO: Functionality to send and set kerberos authentication configuration for secure hdfs

opened by kaidul 10
fix for NUTCH-2460 contributed by Hussein Alahmad

use the headless option of firefox and chrome in protocol-selenium

the --headless option is added to firefox in version 55 or later , and in chrome in version 59 or later ... this is much better than relying on xvfb and its associates . we can add it as a property in the config file . I'm trying it on my local machine , and will create a pull request when I finish testing it

I've tested it using firefox 57.0 , gecodriver 0.19.1 and selenium 3.7.1

Important note : you need to add the following property to nutch-default.xml or nutch-site.xml for the headless option to work
selenium.firefox.headless true A Boolean value representing if firefox should run headless . make sure that firefox version is 55 or later, and selenium webDriver version is 3.6.0 or later. The default value is false. Currently this option exist for - 'firefox'

opened by hussein-alahmad 9
fix for NUTCH-2234 and NUTCH-2236
Upgrade Elasticsearch and Lucene dependencies, which, in turn, requires updates to Guava and Hadoop dependencies:

Elasticsearch 1.4.1 -> Elasticsearch 2.3.3

Lucene 4.10.2 -> 5.5.0

Guava 16.0.1 -> Guava 18.0

Hadoop 2.4.0 -> 2.7.2
opened by naegelejd 9
NUTCH-2248 CSS parser plugin

As described on JIRA:

This plugin allows for collecting uri links from CSS (stylesheets). This is useful for collecting parent stylesheets, fonts, and images needed to display web pages as intended.

Parsed Outlinks do not have associated anchors, and no additional text/content is parsed from the stylesheet.

opened by naegelejd 9
NUTCH-2184 Enable IndexingJob to function with no crawldb
OK folks, this issue addresses https://issues.apache.org/jira/browse/NUTCH-2184 by

rebasing the NUTCH-2184v2.patch against master branch

making the IndexerMapReduceMapper and IndexerMapReduceReducer in IndexerMapReduce code explicit so that these functions can be tested

adding in some mrunit tests for testing the IndexerMapReduceMapper and IndexerMapReduceReducer

removing some trivial imports which are unsed

formatting ivy.xml which has somehow (again) become a dogs dinner

adding default constructor to NutchIndexAction()

Any questions, then please let me know. I would really appreciate if people could pull this code and try it out within your test or local environment. Thanks, also thanks Markus for the original suggestions for tests, etc.
opened by lewismc 9
NUTCH-2490 Develop Gradle Core Build for Apache Nutch

This is a WIP for https://issues.apache.org/jira/browse/NUTCH-2940. The work was conducted by @AzureTriple @imanzanganeh @jbsimmon @LilyPerr and @Lirongxuan1 from the 2022 USC Senior CS Capstone Program.

Most of the core build is in place. No plugin sub-projects have been implemented yet.

I intend to continue work on the code build until it is completed. I will then move on to the plugin sub-projects.

opened by lewismc 1
NUTCH-2938 Use Any23's RepositoryWriter to write structured data to Rdf4j repository

PR addresses https://issues.apache.org/jira/browse/NUTCH-2938 We could improve the performance of this plugin if we could reuse the repository connection however I am not entirely sure how to do that right now because this is done down in the Any23 layer.

opened by lewismc 0
WIP StatsD metrics example

Until I complete the Nutch Metrics and we agree on NUTCH-2909 I would ask that you don't look at this with too much sincerity. I'm submitting this early so anyone interested can see where I thought we could go with this one. Thanks for any feedback.

opened by lewismc 0
NUTCH-2793 indexer-csv: make it work in distributed mode

Before the change, the output file name was hard-coded to "nutch.csv". When running in distributed mode, multiple reducers would clobber each other output.

After the change, the filename is taken from the first open(cfg, name) initialization call, where name is a unique file name generated by IndexerOutputFormat, derived from hadoop FileOutputFormat. The CSV files are now named like part-r-000xx.

opened by pmezard 6
NUTCH-1870 XSL parse filter
apply patch contributed by @albinscode

load configuration files from classpath and address thread-safety

Note: not ready yet:

TODOs in code

unit tests fail (with DOM built by tagsoup parser)

see also open points in NUTCH-1870
opened by sebastian-nagel 2

Owner

The Apache Software Foundation

GitHub https://nutch.apache.org/

A scalable web crawler framework for Java.

Readme in Chinese A scalable crawler framework. It covers the whole lifecycle of crawler: downloading, url management, content extraction and persiste

10.7k Jan 5, 2023

Open Source Web Crawler for Java

crawler4j crawler4j is an open source web crawler for Java which provides a simple interface for crawling the Web. Using it, you can setup a multi-thr

4.3k Jan 3, 2023

An implementation of darcy-web that uses Selenium WebDriver as the automation library backend.

darcy-webdriver An implementation of darcy-ui and darcy-web that uses Selenium WebDriver as the automation library backend. maven <dependency> <gr

20 Aug 22, 2020

Nokogiri (鋸) is a Rubygem providing HTML, XML, SAX, and Reader parsers with XPath and CSS selector support.

Nokogiri Nokogiri (鋸) makes it easy and painless to work with XML and HTML from Ruby. It provides a sensible, easy-to-understand API for reading, writ

6k Jan 8, 2023

jsoup: the Java HTML parser, built for HTML editing, cleaning, scraping, and XSS safety.

jsoup: Java HTML Parser jsoup is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting a

9.9k Jan 4, 2023

Elegant parsing in Java and Scala - lightweight, easy-to-use, powerful.

Please see https://repo1.maven.org/maven2/org/parboiled/ for download access to the artifacts https://github.com/sirthias/parboiled/wiki for all docum

1.2k Dec 21, 2022

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.

Sparkler A web crawler is a bot program that fetches resources from the web for the sake of building applications like search engines, knowledge bases

USC Information Retrieval & Data Science

396 Dec 17, 2022

A scalable, mature and versatile web crawler based on Apache Storm

StormCrawler is an open source collection of resources for building low-latency, scalable web crawlers on Apache Storm. It is provided under Apache Li

776 Jan 2, 2023

A scalable web crawler framework for Java.

Readme in Chinese A scalable crawler framework. It covers the whole lifecycle of crawler: downloading, url management, content extraction and persiste

10.7k Jan 5, 2023

a fast, scalable, multi-language and extensible build system

Bazel {Fast, Correct} - Choose two Build and test software of any size, quickly and reliably. Speed up your builds and tests: Bazel rebuilds only what

20k Jan 4, 2023

Open Source Web Crawler for Java

crawler4j crawler4j is an open source web crawler for Java which provides a simple interface for crawling the Web. Using it, you can setup a multi-thr

4.3k Jan 3, 2023

Cadence is a distributed, scalable, durable, and highly available orchestration engine to execute asynchronous long-running business logic in a scalable and resilient way.

Cadence This repo contains the source code of the Cadence server and other tooling including CLI, schema tools, bench and canary. You can implement yo

6.5k Jan 4, 2023

source code of the live coding demo for "Building resilient and scalable API backends with Apache Pulsar and Spring Reactive" talk held at ApacheCon@Home 2021

reactive-iot-backend The is the source code of the live coding demo for "Building resilient and scalable API backends with Apache Pulsar and Spring Re

4 Jan 13, 2022

JPF is an extensible software analysis framework for Java bytecode. jpf-core is the basis for all JPF projects; you always need to install it. It contains the basic VM and model checking infrastructure, and can be used to check for concurrency defects like deadlocks, and unhandled exceptions like NullPointerExceptions and AssertionErrors.

Java Pathfinder (JPF) An extensible software model checking framework for Java bytecode programs General Information about JPF All the latest developm

426 Dec 25, 2022

Microhttp - a fast, scalable, event-driven, self-contained Java web server

Microhttp is a fast, scalable, event-driven, self-contained Java web server that is small enough for a programmer to understand and reason about.

450 Dec 23, 2022

Google Firing range Apache 2 Google Firing range Firing Range is a test bed for web application security scanners, providing synthetic, wide coverage for an array of vulnerabilities. It can be deployed as a Google App Engine application. License: Apache 2 , .

What is Firing Range? Firing Range is a test bed for web application security scanners, providing synthetic, wide coverage for an array of vulnerabili

1.3k Jan 7, 2023

Firehose is an extensible, no-code, and cloud-native service to load real-time streaming data from Kafka to data stores, data lakes, and analytical storage systems.

Firehose - Firehose is an extensible, no-code, and cloud-native service to load real-time streaming data from Kafka to data stores, data lakes, and analytical storage systems.

279 Dec 22, 2022

Apache Nutch is an extensible and scalable web crawler

Related tags

Overview

Apache Nutch README

Contributing

IDE setup

Export Control

Comments

Owner

The Apache Software Foundation

A scalable web crawler framework for Java.

Open Source Web Crawler for Java

An implementation of darcy-web that uses Selenium WebDriver as the automation library backend.

Nokogiri (鋸) is a Rubygem providing HTML, XML, SAX, and Reader parsers with XPath and CSS selector support.

jsoup: the Java HTML parser, built for HTML editing, cleaning, scraping, and XSS safety.

Elegant parsing in Java and Scala - lightweight, easy-to-use, powerful.

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.

A scalable, mature and versatile web crawler based on Apache Storm

A scalable web crawler framework for Java.

a fast, scalable, multi-language and extensible build system

Open Source Web Crawler for Java

Cadence is a distributed, scalable, durable, and highly available orchestration engine to execute asynchronous long-running business logic in a scalable and resilient way.

source code of the live coding demo for "Building resilient and scalable API backends with Apache Pulsar and Spring Reactive" talk held at ApacheCon@Home 2021

Microhttp - a fast, scalable, event-driven, self-contained Java web server

Google Firing range Apache 2 Google Firing range Firing Range is a test bed for web application security scanners, providing synthetic, wide coverage for an array of vulnerabilities. It can be deployed as a Google App Engine application. License: Apache 2 , .

Firehose is an extensible, no-code, and cloud-native service to load real-time streaming data from Kafka to data stores, data lakes, and analytical storage systems.

An extensible Java library for HTTP request and response logging

An extensible Java framework for building XML and non-XML streaming applications

A lightweight and extensible library to resolve application properties from various external sources.