A scalable, mature and versatile web crawler based on Apache Storm

Last update: Jan 2, 2023

Overview

StormCrawler is an open source collection of resources for building low-latency, scalable web crawlers on Apache Storm. It is provided under Apache License and is written mostly in Java.

Quickstart

NOTE: These instructions assume that you have Apache Maven installed. You will need to install Apache Storm to run the crawler.

The version of Storm to use must match the one defined in the pom.xml file of your topology. The major version of StormCrawler mirrors the one from Apache Storm, i.e whereas StormCrawler 1.x used Storm 1.2.3, the current version now requires Storm 2.3.0. Our Ansible-Storm repository contains resources to install Apache Storm using Ansible.

Once Storm is installed, the easiest way to get started is to generate a brand new StormCrawler project using :

mvn archetype:generate -DarchetypeGroupId=com.digitalpebble.stormcrawler -DarchetypeArtifactId=storm-crawler-archetype -DarchetypeVersion=2.2

You'll be asked to enter a groupId (e.g. com.mycompany.crawler), an artefactId (e.g. stormcrawler), a version and package name.

This will not only create a fully formed project containing a POM with the dependency above but also the default resource files, a default CrawlTopology class and a configuration file. Enter the directory you just created (should be the same as the artefactId you specified earlier) and follow the instructions on the README file.

Alternatively if you can't or don't want to use the Maven archetype above, you can simply copy the files from archetype-resources.

Have a look at the code of the CrawlTopology class, the crawler-conf.yaml file as well as the files in src/main/resources/, they are all that is needed to run a crawl topology : all the other components come from the core module.

Getting help

The WIKI is a good place to start your investigations but if you are stuck please use the tag stormcrawler on StackOverflow or ask a question in the discussions section.

DigitalPebble Ltd provide commercial support and consulting for StormCrawler.

Thanks

YourKit supports open source projects with its full-featured Java Profiler. YourKit, LLC is the creator of YourKit Java Profiler and YourKit .NET Profiler, innovative and intelligent tools for profiling Java and .NET applications.

We are very grateful to our sponsors for their continued support.

Comments

Project reorganization

Given our discussion in #24, and previous discussions on Kafka spouts, HBase indexing, etc, we should think about reorganizing the project so that we have a core SDK and external SDK(s).

I was thinking something like

root pom.xml |-> crawler-core pom.xml |-> crawler-external pom.xml

The external sub-project would include things that depend on external technologies and libraries.

opened by jakekdodd 34

Java topology doesn't read configurations with Storm 2

[x ] Bug report. If you’ve found a bug, please include a test if you can, it makes it a lot easier to fix things. Use the label 'bug' on the issue.

I am getting the following :-

16:33:27.858 [Thread-35-spout-executor[12, 12]] ERROR c.d.s.e.p.AbstractSpout - Can't connect to ElasticSearch
java.lang.IllegalArgumentException: hosts must not be null nor empty
        at org.elasticsearch.client.RestClient.builder(RestClient.java:173) ~[stormcrawler-1.0-SNAPSHOT.jar:?]
        at com.digitalpebble.stormcrawler.elasticsearch.ElasticSearchConnection.getClient(ElasticSearchConnection.java:117) ~[stormcrawler-1.0-SNAPSHOT.jar:?]

ElasticSearch is up and running. I cannot find a 'hosts:' parameter anywhere and 'es-conf.yaml' parameters are set to localhost, or http://locahost:9200

bug

opened by AaronNGray 22

Batch PreparedStatements in SQL status updater bolt, fixes #610

This PR changes the behaviour of the SQL StatusUpdaterBolt by batching the INSERTs, i.e. the discovered URLs which are far more frequent than updates.

@cruftex plz let me know what you think. we can make it SQL agnostic in a separate step.
SQL

opened by jnioche 22

Add unified way of initializing classes via string and configuring them.

Hello @jnioche,

I extracted the code for initializing classes from string and configuring them from my fork https://github.com/FelixEngl/storm-crawler/tree/local_version. (This branch contains all fixes/changes that i made. I'll try to extract PR after PR until both are on a same level.)

I don't know how far you got with #937, but I this one has a more reasonable size with the trade-off that it introduces some new warnings due to missing @Contract, @NotNull and @Nullable annotations in various sub-classes and sub-interfaces. These warnings will be fixed by either #937 or a PR in the future.

Best Regards

Felix

Signed-off-by: Felix Engl [email protected]

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

enhancement

opened by FelixEngl 15

[External][Solr] The Solr storage should use 'autoSoftCommit' and 'autoCommit'

Currently the Solr storage modules send hard commits to Solr.

I think this is not the good way to proceed. These commits cause too much pressure on Solr. See the following screenshot.

Instead, these modules should let Solr commit the documents with 'autoCommit' and 'autoSoftCommit' sections in the 'solrconfig' file.
external SOLR

opened by ludovic-boutros 15

Add a ScrollSpout to read all the documents from a shard

Implements #688 and fixes #684

This adds a ScrollSpout for ES as well as a mechanism to the AbstractStatusUpdaterBolt so that it stores a tuple without modifying any of its content.

The following Flux illustrates its use

name: "reindexer"

includes:
    - resource: true
      file: "/crawler-default.yaml"
      override: false

    - resource: false
      file: "crawler-conf.yaml"
      override: true

    - resource: false
      file: "es-conf.yaml"
      override: true

config:
  es.status2.addresses: "localhost"
  es.status2.index.name: "status2"
  es.status2.doc.type: "status"
  es.status2.routing: true
  es.status2.routing.fieldname: "key"
  es.status2.bulkActions: 500
  es.status2.flushInterval: "1s"
  es.status2.concurrentRequests: 5
  es.status2.settings:
    cluster.name: "elasticsearch"
  topology.max.spout.pending: 5000
  topology.workers: 1

spouts:
  - id: "spout"
    className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.ScrollSpout"
    parallelism: 10

bolts:
  - id: "status"
    className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.StatusUpdaterBolt"
    parallelism: 4
    constructorArgs:
      - "status2"

streams:
  - from: "spout"
    to: "status"
    grouping:
      streamId: "status"
      type: CUSTOM
      customClass:
        className: "com.digitalpebble.stormcrawler.util.URLStreamGrouping"
        constructorArgs:
          - "byDomain"

The target index (here 'status2)' has to be initialised just like the source one. It can of course live on a separate cluster.

It is a good idea to set "refresh_interval": "-1" in the configuration of the target index to speed up the writes. This can be set to any value afterwards when crawling with the new index.

Copying the content of a status index can be useful e.g. when changing the number of shards or the way the documents are assigned to them - for instance if using domains with a different version of crawler-commons, see #684

elasticsearch

opened by jnioche 14

Update to Elasticsearch 2.x

Based on #257, uses ES 2.3.1

The only limitation is that the node client does not work for now, but the transport one does.

@w0mbat thanks for your work on #257 - I've built a new branch but yours was a great starting point. Any chance you could give this one a try?
elasticsearch

opened by jnioche 14
Allowing the parse output to output more than one document

This address the https://github.com/DigitalPebble/storm-crawler/issues/117 issue at some extent this also encapsulate in a ParseResult the parsing data/metadata of a URL. Each URL has only one ParseResult which at the same time has at least 1 ParseData instance to hold the data of the "parent" URL in case subdocuments are extracted in any ParseFilter the URLs and ParseData of each subdocument is added to the parent ParseResult. So the ParserBolt emits each ParseData container in the ParseResult as a tuple. So at the very least 1 ParseData, or any number if subdocuments are extracted, also this implies that there is no difference between the extracted information about the parent URL or any subdocument each ParseData gets emitted as a tuple.

I made some simple graph to try to explain this :) I hope it makes things a little more clearer than my explanation.

opened by jorgelbg 14
Sitemaps parser

See #38

This adds a SiteMapParserBolt and a test class. It also reorganizes the test resources and makes the ParsingTester more generic.

As discussed, this Bolt uses a non-default stream 'status' to output the newly discovered URLs and the default stream for documents that are not marked as being sitemaps.

Could you please review this? Thanks!
enhancement

opened by jnioche 14
OkHttp protocol: make connection pool configurable
OkHttp's ConnectionPool by default "holds up to 5 idle connections which will be evicted after 5 minutes of inactivity." A pool of this size is suitable for site crawls but not for larger crawls over many different sites.

Note: in the current version (4.9.2) the connection pool is implemented as a linked queue and searching for a pooled connection does not scale up. In order to scale beyond pool sizes exceeding 1000 a set of clients must be used each with its own connection pool.

Notes:

so far, only partially tested: need to increase the pool size and run a test crawl to measure the impact

proxied connections are unchanged, that is, for every fetch a client is created anew and no connection pool is used. Depending on the proxy manager, it could make sense to define a connection pool ahead and pass it to the client builder. Since proxy information is included in the okhttp address (stored in the connection pool) it should be possible to pool proxied connections.

the documentation could be moved from crawler-default.yaml to the wiki

enhancement fetcher
opened by sebastian-nagel 13

Multi proxy support

Thanks for contributing to StormCrawler, your efforts are appreciated!

Developer Certificate of Origin

By contributing to StormCrawler, you accept and agree to the following terms and conditions (the Developer Certificate of Origin) for your present and future contributions submitted to StormCrawler. Please refer to the Developer Certificate of Origin section in CONTRIBUTING.md for details.

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

Before opening a PR, please check that:

You've applied the formatting used by the project with mvn java-formatter:format
You've squashed your commits into a single one
You've described what the PR does or at least point to a related issue
You've signed-ff your commits with 'git commit -s'

Thanks!

enhancement core

opened by sam-ulrich1 13

If stormcrawler above 2.5 uses Jdk 11 why the archetypes pom are not updated to 11

The pom files in the archetype should be update to java 11 version rather than jdk 8 https://github.com/DigitalPebble/storm-crawler/blob/master/external/elasticsearch/archetype/src/main/resources/archetype-resources/pom.xml
archetype

opened by msghasan 14
Blocking fetcher thread

Hi @jnioche !

Thanks again for all your work ! Now, let me expose you our fetcher thread issue.

Resume

Our cluster have 6 worker nodes. We are fetching more than 3 million URLs per day with our topology. It is deployed on 16 worker slots and use 16 fetchers, one by worker slot.

OkClient.HttpProtocol

The worst issue was spotted with the OkClient.HttpProtocol. Sometime, one of the worker nodes step up to 100% CPU usage. For example, the worker 5 in this case:

On StromCrawler board, we can see the fetcher count increase up to 50 (our fetcher limit) :

Worst, in another case, all the topologies are impacted :

All fetchers are impacted, and the topology is running slowly. The only way to fix the problem, is to kill and redeploy the topology. On kill phase, the log confirms some blocking thread:

2022-05-30 06:37:06.557 O.A.S.D.W.WORKER SHUTDOWNHOOK-SHUTDOWNFUNC [INFO] SHUTTING DOWN EXECUTORS ... 2022-05-30 06:37:07.028 O.A.S.E.EXECUTORSHUTDOWN SHUTDOWNHOOK-SHUTDOWNFUNC [INFO] SHUTTING DOWN EXECUTOR FETCHER:[30, 30] 2022-05-30 06:37:07.077 C.D.S.B.FETCHERBOLT THREAD-21-FETCHER-EXECUTOR[30, 30] [ERROR] INTERRUPTED EXCEPTION CAUGHT IN EXECUTE METHOD 2022-05-30 06:37:07.077 C.D.S.B.FETCHERBOLT THREAD-21-FETCHER-EXECUTOR[30, 30] [ERROR] INTERRUPTED EXCEPTION CAUGHT IN EXECUTE METHOD 2022-05-30 06:37:07.077 C.D.S.B.FETCHERBOLT THREAD-21-FETCHER-EXECUTOR[30, 30] [ERROR] INTERRUPTED EXCEPTION CAUGHT IN EXECUTE METHOD 2022-05-30 06:37:07.077 C.D.S.B.FETCHERBOLT THREAD-21-FETCHER-EXECUTOR[30, 30] [ERROR] INTERRUPTED EXCEPTION CAUGHT IN EXECUTE METHOD

HttpClient.HttpProtocol

We had tried to change the protocol to fix this issue. The CPU has never reach again 100%. But periodically, some fetcher threads are not released.

After some days, those “zombie” threads increase. We are often redeploying the topology (for functional update) and obviously, a new deployment reset thread count.

For now, the issue is less critical then the OkClient one, but we are trying to understand. Do you have any ideas or similar case?
bug fetcher

opened by Mikwiss 3
Delete redirected pages

From a user

Links that were once pages and then turn to redirects are our issue. Our content management system auto creates clean URLs. If the title of the page is changed the clean URL is changed and the old URL is redirected to the new URL. The old URL stays in our index unless manually removed. When a link is changed from FETCHED to REDIRECT it would be ideal if the index is removed.
core

opened by jnioche 1

ES IndexerBold - Fix behaviour of afterBulk

Hi @jnioche,

I was looking into https://github.com/DigitalPebble/storm-crawler/pull/989#discussion_r918581042 and reviewed the old code in order to make sure, that I get the wanted behaviour. (see https://github.com/FelixEngl/storm-crawler/blob/834347e53f79376d3a79f125a6203c91d062e04f/external/elasticsearch/src/main/java/com/digitalpebble/stormcrawler/elasticsearch/bolt/IndexerBolt.java)

Now I am wondering, shouldn't it be enough to only process the first encounter of a BulkResponseElement with a specific id and otherwise just print the required LOG-events and update the counters accordingly?

Because the old code worked like this (if I got that right):

:START afterBulk

:ITERATION 1
+ waitAck ---------------+
| "A" | [tuple1, tuple3] |
| "B" | [tuple2]         |
+------------------------+

+ bulk_response ---------------+
| 1. (id: "A", state: SUCCESS) |
| 2. (id: "B", state: SUCCESS) |
| 3. (id: "A", state: FAILURE) |
+------------------------------+

respone = bulk_respose.removeFirst() : (id: "A", state: SUCCESS)
tuples = waitAck.getIfPresent(response.id) : [tuple1, tuple3]
for(tuple in tuples){
    // process all tuples as state: SUCCESS
    ...
}
waitAck.invalidate(response.id) // Immediate removal
:ITERATION 1

:ITERATION 2
+ waitAck -------+
| "B" | [tuple2] |
+----------------+

+ bulk_response ---------------+
| 2. (id: "B", state: SUCCESS) |
| 3. (id: "A", state: FAILURE) |
+------------------------------+

respone = bulk_respose.removeFirst() : (id: "B", state: SUCCESS)
tuples = waitAck.getIfPresent(response.id) : [tuple2]
for(tuple in tuples){
    // process all tuples as state: SUCCESS
    ...
}
waitAck.invalidate(response.id) // Immediate removal
:ITERATION 2

:ITERATION 3
+ waitAck -------+
+----------------+

+ bulk_response ---------------+
| 3. (id: "A", state: FAILURE) |
+------------------------------+

respone = bulk_respose.removeFirst() : (id: "A", state: FAILURE)
tuples = waitAck.getIfPresent(response.id) : null
LOG.warn("could not find unacked tuple for A")
:ITERATION 3

:STOP afterBulk

Best Regards

Felix

opened by FelixEngl 6

ConcurrentModificationException thrown by metrics in Fetcher executor

022-07-15 09:57:16.851 o.a.s.e.e.ReportError Thread-43-fetcher-executor[15, 15] [ERROR] Error
java.lang.RuntimeException: java.lang.RuntimeException: java.util.ConcurrentModificationException
	at org.apache.storm.utils.Utils$1.run(Utils.java:411) ~[storm-client-2.4.0.jar:2.4.0]
	at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: java.lang.RuntimeException: java.util.ConcurrentModificationException
	at org.apache.storm.executor.Executor.accept(Executor.java:301) ~[storm-client-2.4.0.jar:2.4.0]
	at org.apache.storm.utils.JCQueue.consumeImpl(JCQueue.java:113) ~[storm-client-2.4.0.jar:2.4.0]
	at org.apache.storm.utils.JCQueue.consume(JCQueue.java:89) ~[storm-client-2.4.0.jar:2.4.0]
	at org.apache.storm.executor.bolt.BoltExecutor$1.call(BoltExecutor.java:154) ~[storm-client-2.4.0.jar:2.4.0]
	at org.apache.storm.executor.bolt.BoltExecutor$1.call(BoltExecutor.java:140) ~[storm-client-2.4.0.jar:2.4.0]
	at org.apache.storm.utils.Utils$1.run(Utils.java:396) ~[storm-client-2.4.0.jar:2.4.0]
	... 1 more
Caused by: java.util.ConcurrentModificationException
	at java.util.HashMap$HashIterator.nextNode(HashMap.java:1511) ~[?:?]
	at java.util.HashMap$EntryIterator.next(HashMap.java:1544) ~[?:?]
	at java.util.HashMap$EntryIterator.next(HashMap.java:1542) ~[?:?]
	at org.apache.storm.metric.api.MultiCountMetric.getValueAndReset(MultiCountMetric.java:35) ~[storm-client-2.4.0.jar:2.4.0]
	at org.apache.storm.metric.api.MultiCountMetric.getValueAndReset(MultiCountMetric.java:18) ~[storm-client-2.4.0.jar:2.4.0]
	at org.apache.storm.executor.Executor.metricsTick(Executor.java:339) ~[storm-client-2.4.0.jar:2.4.0]
	at org.apache.storm.executor.bolt.BoltExecutor.tupleActionFn(BoltExecutor.java:200) ~[storm-client-2.4.0.jar:2.4.0]
	at org.apache.storm.executor.Executor.accept(Executor.java:297) ~[storm-client-2.4.0.jar:2.4.0]
	at org.apache.storm.utils.JCQueue.consumeImpl(JCQueue.java:113) ~[storm-client-2.4.0.jar:2.4.0]
	at org.apache.storm.utils.JCQueue.consume(JCQueue.java:89) ~[storm-client-2.4.0.jar:2.4.0]
	at org.apache.storm.executor.bolt.BoltExecutor$1.call(BoltExecutor.java:154) ~[storm-client-2.4.0.jar:2.4.0]
	at org.apache.storm.executor.bolt.BoltExecutor$1.call(BoltExecutor.java:140) ~[storm-client-2.4.0.jar:2.4.0]
	at org.apache.storm.utils.Utils$1.run(Utils.java:396) ~[storm-client-2.4.0.jar:2.4.0]
	... 1 more

bug

opened by jnioche 0

Releases(2.7)

2.7(Dec 20, 2022)
What's Changed

Dependency upgrades #1016

Opensearch module in https://github.com/DigitalPebble/storm-crawler/pull/1011

Maven archetype for Opensearch

[WARC] Backward compatible storage of HTTP/2 headers by @sebastian-nagel in https://github.com/DigitalPebble/storm-crawler/pull/1010

Ignore empty fields indexer in https://github.com/DigitalPebble/storm-crawler/pull/1019

Handle single quotes in value of http-equiv="refresh" #1020

Full Changelog: https://github.com/DigitalPebble/storm-crawler/compare/2.6...2.7
Source code(tar.gz)
Source code(zip)
2.6(Nov 28, 2022)
Highlights

Using URLFrontier in archetype

URLFilter becomes an abstract class

Fixed deactivation of maxDepthFilter

JSoupParserBolt improve performance of link extraction

Multiple dependency upgrades

Full Changelog: https://github.com/DigitalPebble/storm-crawler/compare/storm-crawler-2.5...2.6
Source code(tar.gz)
Source code(zip)
storm-crawler-2.5(Aug 31, 2022)
In a nutshell

various dependency upgrades (JSoup, CrawlerCommons, Tika, Elasticsearch)

Java 11

bugfix AggregationSpout does not release IsInQuery boolean sometimes

various improvements to URLFrontier module

In more details

FEATURE-964: custom crawl delay per page by @juli-alvarez in https://github.com/DigitalPebble/storm-crawler/pull/967

Issue 970 HttpProtocol doesn't consider http.content.limit in test for filesize by @wowasa in https://github.com/DigitalPebble/storm-crawler/pull/972

Add ChannelManager for local channel management and constants to Spout.java by @FelixEngl in https://github.com/DigitalPebble/storm-crawler/pull/982

Fix error when spaces in path to test-resources of StatusBoltTest in ElasticSearch-Module by @FelixEngl in https://github.com/DigitalPebble/storm-crawler/pull/985

Add unit test basics for URLFrontier. by @FelixEngl in https://github.com/DigitalPebble/storm-crawler/pull/984

Fix starvation and busy waiting of StatusUpdaterBolt.java, add Constants. by @FelixEngl in https://github.com/DigitalPebble/storm-crawler/pull/983

Fix starvation and busy waiting of ES StatusUpdaterBolt (Fixes #986) by @FelixEngl in https://github.com/DigitalPebble/storm-crawler/pull/988

Fix starvation and busy waiting of ES IndexerBolt by @FelixEngl in https://github.com/DigitalPebble/storm-crawler/pull/989

HttpProtocol use the md protocol.set-headers to add custom header by url by @Mikwiss in https://github.com/DigitalPebble/storm-crawler/pull/993

New Contributors

@wowasa made their first contribution in https://github.com/DigitalPebble/storm-crawler/pull/972

Full Changelog: https://github.com/DigitalPebble/storm-crawler/compare/2.4...storm-crawler-2.5
Source code(tar.gz)
Source code(zip)
2.4(Apr 13, 2022)

Upgrade to Apache Storm 2.4 Upgrade to Elasticsearch 7.17.2 bugfix Setting "maxDepth": 0 in urlfilter.json prevents ES seed injection #959 Allow compatibility.mode for rest client to connect to ES8+ #962

Full Changelog: https://github.com/DigitalPebble/storm-crawler/compare/2.3...2.4
Source code(tar.gz)
Source code(zip)
2.3(Mar 21, 2022)
https://digitalpebble.blogspot.com/2022/03/whats-new-in-stormcrawler-23.html

What's Changed

Bump xercesImpl from 2.12.1 to 2.12.2 in /core by @dependabot in https://github.com/DigitalPebble/storm-crawler/pull/942

General Code Refactoring and Good Practices by @FelixEngl in https://github.com/DigitalPebble/storm-crawler/pull/937

Add unified way of initializing classes via string and configuring them. by @FelixEngl in https://github.com/DigitalPebble/storm-crawler/pull/943

Rewrote LinkParseFUlter + added XPathFilter + tests for JSOUPFilters by @jnioche in https://github.com/DigitalPebble/storm-crawler/pull/953

ISSUE-954: Issue with the order of emit and emitOutlink for redirections in FetcherBolt by @juli-alvarez in https://github.com/DigitalPebble/storm-crawler/pull/955

New Contributors

@FelixEngl made their first contribution in https://github.com/DigitalPebble/storm-crawler/pull/937

Full Changelog: https://github.com/DigitalPebble/storm-crawler/compare/2.2...2.3
Source code(tar.gz)
Source code(zip)
2.2(Jan 11, 2022)

https://digitalpebble.blogspot.com/2022/01/whats-new-in-stormcrawler-22.html
Source code(tar.gz)
Source code(zip)
2.1(May 5, 2021)

Source code(tar.gz)
Source code(zip)
1.18(May 5, 2021)

https://digitalpebble.blogspot.com/2021/05/whats-new-in-stormcrawler-118.html
Source code(tar.gz)
Source code(zip)
2.0(Jul 20, 2020)

Source code(tar.gz)
Source code(zip)
storm-crawler-1.17(Jul 20, 2020)

Source code(tar.gz)
Source code(zip)
1.16(Jan 14, 2020)

https://digitalpebble.blogspot.com/2020/01/whats-new-in-stormcrawler-116.html
Source code(tar.gz)
Source code(zip)
1.15(Sep 19, 2019)

https://digitalpebble.blogspot.com/2019/09/stormcrawler-1.html
Source code(tar.gz)
Source code(zip)
1.14(May 12, 2019)

Source code(tar.gz)
Source code(zip)
1.13(Jan 6, 2019)

Source code(tar.gz)
Source code(zip)
1.12.1(Nov 30, 2018)

Source code(tar.gz)
Source code(zip)
1.12(Nov 22, 2018)

https://digitalpebble.blogspot.com/2018/11/whats-new-in-stormcrawler-112.html
Source code(tar.gz)
Source code(zip)
1.11(Oct 18, 2018)

https://digitalpebble.blogspot.com/2018/10/whats-new-in-stormcrawler-111.html
Source code(tar.gz)
Source code(zip)
1.10(Jun 14, 2018)

http://digitalpebble.blogspot.com/2018/06/whats-new-in-stormcrawler-110.html
Source code(tar.gz)
Source code(zip)
1.9(May 25, 2018)

https://digitalpebble.blogspot.co.uk/2018/05/whats-new-in-stormcrawler-19.html
Source code(tar.gz)
Source code(zip)
1.8(Mar 19, 2018)

https://digitalpebble.blogspot.co.uk/2018/03/whats-new-in-stormcrawler-18.html
Source code(tar.gz)
Source code(zip)
1.7(Nov 28, 2017)

Dependencies updates crawler-commons 0.9 #513 Core (bugfix) ParserBolts should use outlinks from parsefilters #498 LD_JSON parsefilter #501 okhttp : store request and response headers verbatim in metadata #506 (bugfix) okhttp protocol does not store headers in metadata #507 HTTP clients should handle http.accept.language and http.accept #499 Selenium protocol follows redirections #514 RemoteDriverProtocol needs multiple instances #505 SitemapParserBolt should force mime-type based on the clue #515 Elasticsearch ES Spout : define filter query via config #502 Upgrade to ES 6.0 #517 We recommend all users to move to this version. If you wish to remain on an older version of Elasticsearch, you can simply keep your existing version of the stormcrawler elasticsearch module while upgrading stormcrawler core.

This version improves the processing of sitemaps, via #515 and the use of the crawler-commons 0.9 where we fixed the SAX parsing and extended its coverage. We also added improvements to our okhttp-based protocol implementation. If your crawl is a wide one with potentially any sort of content then you should go for okhttp over the default httpclient one. See our comparison of protocol implementations on the WIKI.

Finally, if you want to extract semantic data represented in ld-json then you'll love #501.
Source code(tar.gz)
Source code(zip)
1.5.1(Jun 2, 2017)
Minor release

Improvement FetcherBolt to limit max size of internal queues #470

Bugfix Can't get Sitemaps from robots.txt #471

Upgrade Tika 1.15 #473

Source code(tar.gz)
Source code(zip)

Owner

DigitalPebble Ltd

GitHub http://stormcrawler.net

Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

Apache Zeppelin Documentation: User Guide Mailing Lists: User and Dev mailing list Continuous Integration: Contributing: Contribution Guide Issue Trac

5.9k Jan 8, 2023

SAMOA (Scalable Advanced Massive Online Analysis) is an open-source platform for mining big data streams.

SAMOA: Scalable Advanced Massive Online Analysis. This repository is discontinued. The development of SAMOA has moved over to the Apache Software Foun

424 Dec 28, 2022

Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

1.8k Dec 28, 2022

Apache Heron (Incubating) is a realtime, distributed, fault-tolerant stream processing engine from Twitter

Heron is a realtime analytics platform developed by Twitter. It has a wide array of architectural improvements over it's predecessor. Heron in Apache

3.6k Dec 28, 2022

Hadoop library for large-scale data processing, now an Apache Incubator project

Apache DataFu Follow @apachedatafu Apache DataFu is a collection of libraries for working with large-scale data in Hadoop. The project was inspired by

589 Apr 1, 2022

Apache Flink

Apache Flink Apache Flink is an open source stream processing framework with powerful stream- and batch-processing capabilities. Learn more about Flin

20.4k Jan 5, 2023

Apache Druid: a high performance real-time analytics database.

12.3k Jan 9, 2023

Apache Hive

Apache Hive (TM) The Apache Hive (TM) data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storag

4.6k Dec 28, 2022

This code base is retained for historical interest only, please visit Apache Incubator Repo for latest one

Apache Kylin Apache Kylin is an open source Distributed Analytics Engine to provide SQL interface and multi-dimensional analysis (OLAP) on Hadoop supp

561 Dec 4, 2022

Real-time Query for Hadoop; mirror of Apache Impala

Welcome to Impala Lightning-fast, distributed SQL queries for petabytes of data stored in Apache Hadoop clusters. Impala is a modern, massively-distri

27 Dec 28, 2022

Apache Dubbo漏洞测试Demo及其POC

DubboPOC Apache Dubbo 漏洞POC 持续更新中 CVE-2019-17564 CVE-2020-1948 CVE-2020-1948绕过 CVE-2021-25641 CVE-2021-30179 others 免责声明项目仅供学习使用，任何未授权检测造成的直接或者间接的后果及

19 Dec 12, 2022

Flink CDC Connectors is a set of source connectors for Apache Flink

Flink CDC Connectors is a set of source connectors for Apache Flink, ingesting changes from different databases using change data capture (CDC). The Flink CDC Connectors integrates Debezium as the engine to capture data changes.

6 Mar 23, 2022

Distributed and fault-tolerant realtime computation: stream processing, continuous computation, distributed RPC, and more

IMPORTANT NOTE!!! Storm has Moved to Apache. The official Storm git repository is now hosted by Apache, and is mirrored on github here: https://github

8.9k Dec 26, 2022

Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.

Elephant Bird About Elephant Bird is Twitter's open source library of LZO, Thrift, and/or Protocol Buffer-related Hadoop InputFormats, OutputFormats,

1.1k Jan 5, 2023

A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, organization and lifecycle management for both streaming and batch data ecosystems.

Apache Gobblin Apache Gobblin is a highly scalable data management solution for structured and byte-oriented data in heterogeneous data ecosystems. Ca

2.1k Jan 4, 2023

A scalable, mature and versatile web crawler based on Apache Storm

Related tags

Overview

Quickstart

Getting help

Thanks

Comments

Developer Certificate of Origin

Developer Certificate of Origin

Resume

OkClient.HttpProtocol

HttpClient.HttpProtocol

Releases(2.7)

2.7(Dec 20, 2022)

What's Changed

2.6(Nov 28, 2022)

Highlights

storm-crawler-2.5(Aug 31, 2022)

In a nutshell

In more details

New Contributors

2.4(Apr 13, 2022)

2.3(Mar 21, 2022)

What's Changed

New Contributors

2.2(Jan 11, 2022)

2.1(May 5, 2021)

1.18(May 5, 2021)

2.0(Jul 20, 2020)

storm-crawler-1.17(Jul 20, 2020)

1.16(Jan 14, 2020)

1.15(Sep 19, 2019)

1.14(May 12, 2019)

1.13(Jan 6, 2019)

1.12.1(Nov 30, 2018)

1.12(Nov 22, 2018)

1.11(Oct 18, 2018)

1.10(Jun 14, 2018)

1.9(May 25, 2018)

1.8(Mar 19, 2018)

1.7(Nov 28, 2017)

1.5.1(Jun 2, 2017)

Owner

DigitalPebble Ltd

Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

SAMOA (Scalable Advanced Massive Online Analysis) is an open-source platform for mining big data streams.

Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Apache Heron (Incubating) is a realtime, distributed, fault-tolerant stream processing engine from Twitter

Hadoop library for large-scale data processing, now an Apache Incubator project

Apache Flink

Apache Druid: a high performance real-time analytics database.

Apache Hive

This code base is retained for historical interest only, please visit Apache Incubator Repo for latest one

Real-time Query for Hadoop; mirror of Apache Impala

Apache Dubbo漏洞测试Demo及其POC

Flink CDC Connectors is a set of source connectors for Apache Flink

Distributed and fault-tolerant realtime computation: stream processing, continuous computation, distributed RPC, and more

Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.

A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, organization and lifecycle management for both streaming and batch data ecosystems.

OpenRefine is a free, open source power tool for working with messy data and improving it

Stream summarizer and cardinality estimator.

Machine Learning Platform and Recommendation Engine built on Kubernetes

:elephant: Elasticsearch real-time search and analytics natively integrated with Hadoop