A scalable, mature and versatile web crawler based on Apache Storm

Overview

storm-crawler

license Build Status javadoc

StormCrawler is an open source collection of resources for building low-latency, scalable web crawlers on Apache Storm. It is provided under Apache License and is written mostly in Java.

Quickstart

NOTE: These instructions assume that you have Apache Maven installed. You will need to install Apache Storm to run the crawler.

The version of Storm to use must match the one defined in the pom.xml file of your topology. The major version of StormCrawler mirrors the one from Apache Storm, i.e whereas StormCrawler 1.x used Storm 1.2.3, the current version now requires Storm 2.3.0. Our Ansible-Storm repository contains resources to install Apache Storm using Ansible.

Once Storm is installed, the easiest way to get started is to generate a brand new StormCrawler project using :

mvn archetype:generate -DarchetypeGroupId=com.digitalpebble.stormcrawler -DarchetypeArtifactId=storm-crawler-archetype -DarchetypeVersion=2.2

You'll be asked to enter a groupId (e.g. com.mycompany.crawler), an artefactId (e.g. stormcrawler), a version and package name.

This will not only create a fully formed project containing a POM with the dependency above but also the default resource files, a default CrawlTopology class and a configuration file. Enter the directory you just created (should be the same as the artefactId you specified earlier) and follow the instructions on the README file.

Alternatively if you can't or don't want to use the Maven archetype above, you can simply copy the files from archetype-resources.

Have a look at the code of the CrawlTopology class, the crawler-conf.yaml file as well as the files in src/main/resources/, they are all that is needed to run a crawl topology : all the other components come from the core module.

Getting help

The WIKI is a good place to start your investigations but if you are stuck please use the tag stormcrawler on StackOverflow or ask a question in the discussions section.

DigitalPebble Ltd provide commercial support and consulting for StormCrawler.

Thanks

alt tag

YourKit supports open source projects with its full-featured Java Profiler. YourKit, LLC is the creator of YourKit Java Profiler and YourKit .NET Profiler, innovative and intelligent tools for profiling Java and .NET applications.

We are very grateful to our sponsors for their continued support.

Comments
  • Project reorganization

    Project reorganization

    Given our discussion in #24, and previous discussions on Kafka spouts, HBase indexing, etc, we should think about reorganizing the project so that we have a core SDK and external SDK(s).

    I was thinking something like

    root      pom.xml     |-> crawler-core         pom.xml     |-> crawler-external         pom.xml

    The external sub-project would include things that depend on external technologies and libraries.

    opened by jakekdodd 34
  • Java topology doesn't read configurations with Storm 2

    Java topology doesn't read configurations with Storm 2

    • [x ] Bug report. If you’ve found a bug, please include a test if you can, it makes it a lot easier to fix things. Use the label 'bug' on the issue.

    I am getting the following :-

    16:33:27.858 [Thread-35-spout-executor[12, 12]] ERROR c.d.s.e.p.AbstractSpout - Can't connect to ElasticSearch
    java.lang.IllegalArgumentException: hosts must not be null nor empty
            at org.elasticsearch.client.RestClient.builder(RestClient.java:173) ~[stormcrawler-1.0-SNAPSHOT.jar:?]
            at com.digitalpebble.stormcrawler.elasticsearch.ElasticSearchConnection.getClient(ElasticSearchConnection.java:117) ~[stormcrawler-1.0-SNAPSHOT.jar:?]
    

    ElasticSearch is up and running. I cannot find a 'hosts:' parameter anywhere and 'es-conf.yaml' parameters are set to localhost, or http://locahost:9200

    bug 
    opened by AaronNGray 22
  • Batch PreparedStatements in SQL status updater bolt, fixes #610

    Batch PreparedStatements in SQL status updater bolt, fixes #610

    This PR changes the behaviour of the SQL StatusUpdaterBolt by batching the INSERTs, i.e. the discovered URLs which are far more frequent than updates.

    @cruftex plz let me know what you think. we can make it SQL agnostic in a separate step.

    SQL 
    opened by jnioche 22
  • Add unified way of initializing classes via string and configuring them.

    Add unified way of initializing classes via string and configuring them.

    Hello @jnioche,

    I extracted the code for initializing classes from string and configuring them from my fork https://github.com/FelixEngl/storm-crawler/tree/local_version. (This branch contains all fixes/changes that i made. I'll try to extract PR after PR until both are on a same level.)

    I don't know how far you got with #937, but I this one has a more reasonable size with the trade-off that it introduces some new warnings due to missing @Contract, @NotNull and @Nullable annotations in various sub-classes and sub-interfaces. These warnings will be fixed by either #937 or a PR in the future.

    Best Regards

    Felix

    Signed-off-by: Felix Engl [email protected]

    Developer Certificate of Origin

    Developer Certificate of Origin
    Version 1.1
    
    Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
    1 Letterman Drive
    Suite D4700
    San Francisco, CA, 94129
    
    Everyone is permitted to copy and distribute verbatim copies of this
    license document, but changing it is not allowed.
    
    
    Developer's Certificate of Origin 1.1
    
    By making a contribution to this project, I certify that:
    
    (a) The contribution was created in whole or in part by me and I
        have the right to submit it under the open source license
        indicated in the file; or
    
    (b) The contribution is based upon previous work that, to the best
        of my knowledge, is covered under an appropriate open source
        license and I have the right under that license to submit that
        work with modifications, whether created in whole or in part
        by me, under the same open source license (unless I am
        permitted to submit under a different license), as indicated
        in the file; or
    
    (c) The contribution was provided directly to me by some other
        person who certified (a), (b) or (c) and I have not modified
        it.
    
    (d) I understand and agree that this project and the contribution
        are public and that a record of the contribution (including all
        personal information I submit with it, including my sign-off) is
        maintained indefinitely and may be redistributed consistent with
        this project or the open source license(s) involved.
    
    enhancement 
    opened by FelixEngl 15
  • [External][Solr] The Solr storage should use 'autoSoftCommit' and 'autoCommit'

    [External][Solr] The Solr storage should use 'autoSoftCommit' and 'autoCommit'

    Currently the Solr storage modules send hard commits to Solr.

    I think this is not the good way to proceed. These commits cause too much pressure on Solr. See the following screenshot.

    capture d ecran 2015-08-12 a 10 10 18

    Instead, these modules should let Solr commit the documents with 'autoCommit' and 'autoSoftCommit' sections in the 'solrconfig' file.

    external SOLR 
    opened by ludovic-boutros 15
  • Add a ScrollSpout to read all the documents from a shard

    Add a ScrollSpout to read all the documents from a shard

    Implements #688 and fixes #684

    This adds a ScrollSpout for ES as well as a mechanism to the AbstractStatusUpdaterBolt so that it stores a tuple without modifying any of its content.

    The following Flux illustrates its use

    name: "reindexer"
    
    includes:
        - resource: true
          file: "/crawler-default.yaml"
          override: false
    
        - resource: false
          file: "crawler-conf.yaml"
          override: true
    
        - resource: false
          file: "es-conf.yaml"
          override: true
    
    config:
      es.status2.addresses: "localhost"
      es.status2.index.name: "status2"
      es.status2.doc.type: "status"
      es.status2.routing: true
      es.status2.routing.fieldname: "key"
      es.status2.bulkActions: 500
      es.status2.flushInterval: "1s"
      es.status2.concurrentRequests: 5
      es.status2.settings:
        cluster.name: "elasticsearch"
      topology.max.spout.pending: 5000
      topology.workers: 1
    
    spouts:
      - id: "spout"
        className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.ScrollSpout"
        parallelism: 10
    
    bolts:
      - id: "status"
        className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.StatusUpdaterBolt"
        parallelism: 4
        constructorArgs:
          - "status2"
    
    streams:
      - from: "spout"
        to: "status"
        grouping:
          streamId: "status"
          type: CUSTOM
          customClass:
            className: "com.digitalpebble.stormcrawler.util.URLStreamGrouping"
            constructorArgs:
              - "byDomain"
    

    The target index (here 'status2)' has to be initialised just like the source one. It can of course live on a separate cluster.

    It is a good idea to set "refresh_interval": "-1" in the configuration of the target index to speed up the writes. This can be set to any value afterwards when crawling with the new index.

    Copying the content of a status index can be useful e.g. when changing the number of shards or the way the documents are assigned to them - for instance if using domains with a different version of crawler-commons, see #684

    elasticsearch 
    opened by jnioche 14
  • Update to Elasticsearch 2.x

    Update to Elasticsearch 2.x

    Based on #257, uses ES 2.3.1

    The only limitation is that the node client does not work for now, but the transport one does.

    @w0mbat thanks for your work on #257 - I've built a new branch but yours was a great starting point. Any chance you could give this one a try?

    elasticsearch 
    opened by jnioche 14
  • Allowing the parse output to output more than one document

    Allowing the parse output to output more than one document

    This address the https://github.com/DigitalPebble/storm-crawler/issues/117 issue at some extent this also encapsulate in a ParseResult the parsing data/metadata of a URL. Each URL has only one ParseResult which at the same time has at least 1 ParseData instance to hold the data of the "parent" URL in case subdocuments are extracted in any ParseFilter the URLs and ParseData of each subdocument is added to the parent ParseResult. So the ParserBolt emits each ParseData container in the ParseResult as a tuple. So at the very least 1 ParseData, or any number if subdocuments are extracted, also this implies that there is no difference between the extracted information about the parent URL or any subdocument each ParseData gets emitted as a tuple.

    I made some simple graph to try to explain this :) I hope it makes things a little more clearer than my explanation. storm-parsing

    opened by jorgelbg 14
  • Sitemaps parser

    Sitemaps parser

    See #38

    This adds a SiteMapParserBolt and a test class. It also reorganizes the test resources and makes the ParsingTester more generic.

    As discussed, this Bolt uses a non-default stream 'status' to output the newly discovered URLs and the default stream for documents that are not marked as being sitemaps.

    Could you please review this? Thanks!

    enhancement 
    opened by jnioche 14
  • OkHttp protocol: make connection pool configurable

    OkHttp protocol: make connection pool configurable

    OkHttp's ConnectionPool by default "holds up to 5 idle connections which will be evicted after 5 minutes of inactivity." A pool of this size is suitable for site crawls but not for larger crawls over many different sites.

    Note: in the current version (4.9.2) the connection pool is implemented as a linked queue and searching for a pooled connection does not scale up. In order to scale beyond pool sizes exceeding 1000 a set of clients must be used each with its own connection pool.

    Notes:

    • so far, only partially tested: need to increase the pool size and run a test crawl to measure the impact
    • proxied connections are unchanged, that is, for every fetch a client is created anew and no connection pool is used. Depending on the proxy manager, it could make sense to define a connection pool ahead and pass it to the client builder. Since proxy information is included in the okhttp address (stored in the connection pool) it should be possible to pool proxied connections.
    • the documentation could be moved from crawler-default.yaml to the wiki
    enhancement fetcher 
    opened by sebastian-nagel 13
  • Multi proxy support

    Multi proxy support

    Thanks for contributing to StormCrawler, your efforts are appreciated!

    Developer Certificate of Origin

    By contributing to StormCrawler, you accept and agree to the following terms and conditions (the Developer Certificate of Origin) for your present and future contributions submitted to StormCrawler. Please refer to the Developer Certificate of Origin section in CONTRIBUTING.md for details.

    Developer Certificate of Origin
    Version 1.1
    
    Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
    1 Letterman Drive
    Suite D4700
    San Francisco, CA, 94129
    
    Everyone is permitted to copy and distribute verbatim copies of this
    license document, but changing it is not allowed.
    
    
    Developer's Certificate of Origin 1.1
    
    By making a contribution to this project, I certify that:
    
    (a) The contribution was created in whole or in part by me and I
        have the right to submit it under the open source license
        indicated in the file; or
    
    (b) The contribution is based upon previous work that, to the best
        of my knowledge, is covered under an appropriate open source
        license and I have the right under that license to submit that
        work with modifications, whether created in whole or in part
        by me, under the same open source license (unless I am
        permitted to submit under a different license), as indicated
        in the file; or
    
    (c) The contribution was provided directly to me by some other
        person who certified (a), (b) or (c) and I have not modified
        it.
    
    (d) I understand and agree that this project and the contribution
        are public and that a record of the contribution (including all
        personal information I submit with it, including my sign-off) is
        maintained indefinitely and may be redistributed consistent with
        this project or the open source license(s) involved.
    

    Before opening a PR, please check that:

    • You've applied the formatting used by the project with mvn java-formatter:format
    • You've squashed your commits into a single one
    • You've described what the PR does or at least point to a related issue
    • You've signed-ff your commits with 'git commit -s'

    Thanks!

    enhancement core 
    opened by sam-ulrich1 13
  • If stormcrawler above 2.5 uses Jdk 11 why the archetypes pom are not updated to 11

    If stormcrawler above 2.5 uses Jdk 11 why the archetypes pom are not updated to 11

    The pom files in the archetype should be update to java 11 version rather than jdk 8 https://github.com/DigitalPebble/storm-crawler/blob/master/external/elasticsearch/archetype/src/main/resources/archetype-resources/pom.xml

    archetype 
    opened by msghasan 14
  • Blocking fetcher thread

    Blocking fetcher thread

    Hi @jnioche !

    Thanks again for all your work ! Now, let me expose you our fetcher thread issue.

    Resume

    Our cluster have 6 worker nodes. We are fetching more than 3 million URLs per day with our topology. It is deployed on 16 worker slots and use 16 fetchers, one by worker slot.

    OkClient.HttpProtocol

    The worst issue was spotted with the OkClient.HttpProtocol. Sometime, one of the worker nodes step up to 100% CPU usage. For example, the worker 5 in this case:

    image

    On StromCrawler board, we can see the fetcher count increase up to 50 (our fetcher limit) :

    image

    Worst, in another case, all the topologies are impacted :

    image

    All fetchers are impacted, and the topology is running slowly. The only way to fix the problem, is to kill and redeploy the topology. On kill phase, the log confirms some blocking thread:

    2022-05-30 06:37:06.557 O.A.S.D.W.WORKER SHUTDOWNHOOK-SHUTDOWNFUNC [INFO] SHUTTING DOWN EXECUTORS ... 2022-05-30 06:37:07.028 O.A.S.E.EXECUTORSHUTDOWN SHUTDOWNHOOK-SHUTDOWNFUNC [INFO] SHUTTING DOWN EXECUTOR FETCHER:[30, 30] 2022-05-30 06:37:07.077 C.D.S.B.FETCHERBOLT THREAD-21-FETCHER-EXECUTOR[30, 30] [ERROR] INTERRUPTED EXCEPTION CAUGHT IN EXECUTE METHOD 2022-05-30 06:37:07.077 C.D.S.B.FETCHERBOLT THREAD-21-FETCHER-EXECUTOR[30, 30] [ERROR] INTERRUPTED EXCEPTION CAUGHT IN EXECUTE METHOD 2022-05-30 06:37:07.077 C.D.S.B.FETCHERBOLT THREAD-21-FETCHER-EXECUTOR[30, 30] [ERROR] INTERRUPTED EXCEPTION CAUGHT IN EXECUTE METHOD 2022-05-30 06:37:07.077 C.D.S.B.FETCHERBOLT THREAD-21-FETCHER-EXECUTOR[30, 30] [ERROR] INTERRUPTED EXCEPTION CAUGHT IN EXECUTE METHOD

    HttpClient.HttpProtocol

    We had tried to change the protocol to fix this issue. The CPU has never reach again 100%. But periodically, some fetcher threads are not released.

    image

    After some days, those “zombie” threads increase. We are often redeploying the topology (for functional update) and obviously, a new deployment reset thread count.

    For now, the issue is less critical then the OkClient one, but we are trying to understand. Do you have any ideas or similar case?

    bug fetcher 
    opened by Mikwiss 3
  • Delete redirected pages

    Delete redirected pages

    From a user

    Links that were once pages and then turn to redirects are our issue. Our content management system auto creates clean URLs. If the title of the page is changed the clean URL is changed and the old URL is redirected to the new URL. The old URL stays in our index unless manually removed. When a link is changed from FETCHED to REDIRECT it would be ideal if the index is removed.

    core 
    opened by jnioche 1
  • ES IndexerBold - Fix behaviour of afterBulk

    ES IndexerBold - Fix behaviour of afterBulk

    Hi @jnioche,

    I was looking into https://github.com/DigitalPebble/storm-crawler/pull/989#discussion_r918581042 and reviewed the old code in order to make sure, that I get the wanted behaviour. (see https://github.com/FelixEngl/storm-crawler/blob/834347e53f79376d3a79f125a6203c91d062e04f/external/elasticsearch/src/main/java/com/digitalpebble/stormcrawler/elasticsearch/bolt/IndexerBolt.java)

    Now I am wondering, shouldn't it be enough to only process the first encounter of a BulkResponseElement with a specific id and otherwise just print the required LOG-events and update the counters accordingly?

    Because the old code worked like this (if I got that right):

    :START afterBulk
    
    :ITERATION 1
    + waitAck ---------------+
    | "A" | [tuple1, tuple3] |
    | "B" | [tuple2]         |
    +------------------------+
    
    + bulk_response ---------------+
    | 1. (id: "A", state: SUCCESS) |
    | 2. (id: "B", state: SUCCESS) |
    | 3. (id: "A", state: FAILURE) |
    +------------------------------+
    
    respone = bulk_respose.removeFirst() : (id: "A", state: SUCCESS)
    tuples = waitAck.getIfPresent(response.id) : [tuple1, tuple3]
    for(tuple in tuples){
        // process all tuples as state: SUCCESS
        ...
    }
    waitAck.invalidate(response.id) // Immediate removal
    :ITERATION 1
    
    :ITERATION 2
    + waitAck -------+
    | "B" | [tuple2] |
    +----------------+
    
    + bulk_response ---------------+
    | 2. (id: "B", state: SUCCESS) |
    | 3. (id: "A", state: FAILURE) |
    +------------------------------+
    
    respone = bulk_respose.removeFirst() : (id: "B", state: SUCCESS)
    tuples = waitAck.getIfPresent(response.id) : [tuple2]
    for(tuple in tuples){
        // process all tuples as state: SUCCESS
        ...
    }
    waitAck.invalidate(response.id) // Immediate removal
    :ITERATION 2
    
    :ITERATION 3
    + waitAck -------+
    +----------------+
    
    + bulk_response ---------------+
    | 3. (id: "A", state: FAILURE) |
    +------------------------------+
    
    respone = bulk_respose.removeFirst() : (id: "A", state: FAILURE)
    tuples = waitAck.getIfPresent(response.id) : null
    LOG.warn("could not find unacked tuple for A")
    :ITERATION 3
    
    :STOP afterBulk
    

    Best Regards

    Felix

    opened by FelixEngl 6
  • ConcurrentModificationException thrown by metrics in Fetcher executor

    ConcurrentModificationException thrown by metrics in Fetcher executor

    022-07-15 09:57:16.851 o.a.s.e.e.ReportError Thread-43-fetcher-executor[15, 15] [ERROR] Error
    java.lang.RuntimeException: java.lang.RuntimeException: java.util.ConcurrentModificationException
    	at org.apache.storm.utils.Utils$1.run(Utils.java:411) ~[storm-client-2.4.0.jar:2.4.0]
    	at java.lang.Thread.run(Thread.java:829) [?:?]
    Caused by: java.lang.RuntimeException: java.util.ConcurrentModificationException
    	at org.apache.storm.executor.Executor.accept(Executor.java:301) ~[storm-client-2.4.0.jar:2.4.0]
    	at org.apache.storm.utils.JCQueue.consumeImpl(JCQueue.java:113) ~[storm-client-2.4.0.jar:2.4.0]
    	at org.apache.storm.utils.JCQueue.consume(JCQueue.java:89) ~[storm-client-2.4.0.jar:2.4.0]
    	at org.apache.storm.executor.bolt.BoltExecutor$1.call(BoltExecutor.java:154) ~[storm-client-2.4.0.jar:2.4.0]
    	at org.apache.storm.executor.bolt.BoltExecutor$1.call(BoltExecutor.java:140) ~[storm-client-2.4.0.jar:2.4.0]
    	at org.apache.storm.utils.Utils$1.run(Utils.java:396) ~[storm-client-2.4.0.jar:2.4.0]
    	... 1 more
    Caused by: java.util.ConcurrentModificationException
    	at java.util.HashMap$HashIterator.nextNode(HashMap.java:1511) ~[?:?]
    	at java.util.HashMap$EntryIterator.next(HashMap.java:1544) ~[?:?]
    	at java.util.HashMap$EntryIterator.next(HashMap.java:1542) ~[?:?]
    	at org.apache.storm.metric.api.MultiCountMetric.getValueAndReset(MultiCountMetric.java:35) ~[storm-client-2.4.0.jar:2.4.0]
    	at org.apache.storm.metric.api.MultiCountMetric.getValueAndReset(MultiCountMetric.java:18) ~[storm-client-2.4.0.jar:2.4.0]
    	at org.apache.storm.executor.Executor.metricsTick(Executor.java:339) ~[storm-client-2.4.0.jar:2.4.0]
    	at org.apache.storm.executor.bolt.BoltExecutor.tupleActionFn(BoltExecutor.java:200) ~[storm-client-2.4.0.jar:2.4.0]
    	at org.apache.storm.executor.Executor.accept(Executor.java:297) ~[storm-client-2.4.0.jar:2.4.0]
    	at org.apache.storm.utils.JCQueue.consumeImpl(JCQueue.java:113) ~[storm-client-2.4.0.jar:2.4.0]
    	at org.apache.storm.utils.JCQueue.consume(JCQueue.java:89) ~[storm-client-2.4.0.jar:2.4.0]
    	at org.apache.storm.executor.bolt.BoltExecutor$1.call(BoltExecutor.java:154) ~[storm-client-2.4.0.jar:2.4.0]
    	at org.apache.storm.executor.bolt.BoltExecutor$1.call(BoltExecutor.java:140) ~[storm-client-2.4.0.jar:2.4.0]
    	at org.apache.storm.utils.Utils$1.run(Utils.java:396) ~[storm-client-2.4.0.jar:2.4.0]
    	... 1 more
    
    
    bug 
    opened by jnioche 0
Releases(2.7)
  • 2.7(Dec 20, 2022)

    What's Changed

    • Dependency upgrades #1016
    • Opensearch module in https://github.com/DigitalPebble/storm-crawler/pull/1011
    • Maven archetype for Opensearch
    • [WARC] Backward compatible storage of HTTP/2 headers by @sebastian-nagel in https://github.com/DigitalPebble/storm-crawler/pull/1010
    • Ignore empty fields indexer in https://github.com/DigitalPebble/storm-crawler/pull/1019
    • Handle single quotes in value of http-equiv="refresh" #1020

    Full Changelog: https://github.com/DigitalPebble/storm-crawler/compare/2.6...2.7

    Source code(tar.gz)
    Source code(zip)
  • 2.6(Nov 28, 2022)

  • storm-crawler-2.5(Aug 31, 2022)

    In a nutshell

    • various dependency upgrades (JSoup, CrawlerCommons, Tika, Elasticsearch)
    • Java 11
    • bugfix AggregationSpout does not release IsInQuery boolean sometimes
    • various improvements to URLFrontier module

    In more details

    • FEATURE-964: custom crawl delay per page by @juli-alvarez in https://github.com/DigitalPebble/storm-crawler/pull/967
    • Issue 970 HttpProtocol doesn't consider http.content.limit in test for filesize by @wowasa in https://github.com/DigitalPebble/storm-crawler/pull/972
    • Add ChannelManager for local channel management and constants to Spout.java by @FelixEngl in https://github.com/DigitalPebble/storm-crawler/pull/982
    • Fix error when spaces in path to test-resources of StatusBoltTest in ElasticSearch-Module by @FelixEngl in https://github.com/DigitalPebble/storm-crawler/pull/985
    • Add unit test basics for URLFrontier. by @FelixEngl in https://github.com/DigitalPebble/storm-crawler/pull/984
    • Fix starvation and busy waiting of StatusUpdaterBolt.java, add Constants. by @FelixEngl in https://github.com/DigitalPebble/storm-crawler/pull/983
    • Fix starvation and busy waiting of ES StatusUpdaterBolt (Fixes #986) by @FelixEngl in https://github.com/DigitalPebble/storm-crawler/pull/988
    • Fix starvation and busy waiting of ES IndexerBolt by @FelixEngl in https://github.com/DigitalPebble/storm-crawler/pull/989
    • HttpProtocol use the md protocol.set-headers to add custom header by url by @Mikwiss in https://github.com/DigitalPebble/storm-crawler/pull/993

    New Contributors

    • @wowasa made their first contribution in https://github.com/DigitalPebble/storm-crawler/pull/972

    Full Changelog: https://github.com/DigitalPebble/storm-crawler/compare/2.4...storm-crawler-2.5

    Source code(tar.gz)
    Source code(zip)
  • 2.4(Apr 13, 2022)

    Upgrade to Apache Storm 2.4 Upgrade to Elasticsearch 7.17.2 bugfix Setting "maxDepth": 0 in urlfilter.json prevents ES seed injection #959 Allow compatibility.mode for rest client to connect to ES8+ #962

    Full Changelog: https://github.com/DigitalPebble/storm-crawler/compare/2.3...2.4

    Source code(tar.gz)
    Source code(zip)
  • 2.3(Mar 21, 2022)

    https://digitalpebble.blogspot.com/2022/03/whats-new-in-stormcrawler-23.html

    What's Changed

    • Bump xercesImpl from 2.12.1 to 2.12.2 in /core by @dependabot in https://github.com/DigitalPebble/storm-crawler/pull/942
    • General Code Refactoring and Good Practices by @FelixEngl in https://github.com/DigitalPebble/storm-crawler/pull/937
    • Add unified way of initializing classes via string and configuring them. by @FelixEngl in https://github.com/DigitalPebble/storm-crawler/pull/943
    • Rewrote LinkParseFUlter + added XPathFilter + tests for JSOUPFilters by @jnioche in https://github.com/DigitalPebble/storm-crawler/pull/953
    • ISSUE-954: Issue with the order of emit and emitOutlink for redirections in FetcherBolt by @juli-alvarez in https://github.com/DigitalPebble/storm-crawler/pull/955

    New Contributors

    • @FelixEngl made their first contribution in https://github.com/DigitalPebble/storm-crawler/pull/937

    Full Changelog: https://github.com/DigitalPebble/storm-crawler/compare/2.2...2.3

    Source code(tar.gz)
    Source code(zip)
  • 2.2(Jan 11, 2022)

  • 1.18(May 5, 2021)

  • storm-crawler-1.17(Jul 20, 2020)

  • 1.16(Jan 14, 2020)

  • 1.15(Sep 19, 2019)

  • 1.12(Nov 22, 2018)

  • 1.11(Oct 18, 2018)

  • 1.10(Jun 14, 2018)

  • 1.9(May 25, 2018)

  • 1.8(Mar 19, 2018)

  • 1.7(Nov 28, 2017)

    Dependencies updates crawler-commons 0.9 #513 Core (bugfix) ParserBolts should use outlinks from parsefilters #498 LD_JSON parsefilter #501 okhttp : store request and response headers verbatim in metadata #506 (bugfix) okhttp protocol does not store headers in metadata #507 HTTP clients should handle http.accept.language and http.accept #499 Selenium protocol follows redirections #514 RemoteDriverProtocol needs multiple instances #505 SitemapParserBolt should force mime-type based on the clue #515 Elasticsearch ES Spout : define filter query via config #502 Upgrade to ES 6.0 #517 We recommend all users to move to this version. If you wish to remain on an older version of Elasticsearch, you can simply keep your existing version of the stormcrawler elasticsearch module while upgrading stormcrawler core.

    This version improves the processing of sitemaps, via #515 and the use of the crawler-commons 0.9 where we fixed the SAX parsing and extended its coverage. We also added improvements to our okhttp-based protocol implementation. If your crawl is a wide one with potentially any sort of content then you should go for okhttp over the default httpclient one. See our comparison of protocol implementations on the WIKI.

    Finally, if you want to extract semantic data represented in ld-json then you'll love #501.

    Source code(tar.gz)
    Source code(zip)
  • 1.5.1(Jun 2, 2017)

    Minor release

    • Improvement FetcherBolt to limit max size of internal queues #470
    • Bugfix Can't get Sitemaps from robots.txt #471
    • Upgrade Tika 1.15 #473
    Source code(tar.gz)
    Source code(zip)
Owner
DigitalPebble Ltd
DigitalPebble Ltd
Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

Apache Zeppelin Documentation: User Guide Mailing Lists: User and Dev mailing list Continuous Integration: Contributing: Contribution Guide Issue Trac

The Apache Software Foundation 5.9k Jan 8, 2023
SAMOA (Scalable Advanced Massive Online Analysis) is an open-source platform for mining big data streams.

SAMOA: Scalable Advanced Massive Online Analysis. This repository is discontinued. The development of SAMOA has moved over to the Apache Software Foun

Yahoo Archive 424 Dec 28, 2022
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

Oryx Project 1.8k Dec 28, 2022
Apache Heron (Incubating) is a realtime, distributed, fault-tolerant stream processing engine from Twitter

Heron is a realtime analytics platform developed by Twitter. It has a wide array of architectural improvements over it's predecessor. Heron in Apache

The Apache Software Foundation 3.6k Dec 28, 2022
Hadoop library for large-scale data processing, now an Apache Incubator project

Apache DataFu Follow @apachedatafu Apache DataFu is a collection of libraries for working with large-scale data in Hadoop. The project was inspired by

LinkedIn's Attic 589 Apr 1, 2022
Apache Flink

Apache Flink Apache Flink is an open source stream processing framework with powerful stream- and batch-processing capabilities. Learn more about Flin

The Apache Software Foundation 20.4k Jan 5, 2023
Apache Druid: a high performance real-time analytics database.

Website | Documentation | Developer Mailing List | User Mailing List | Slack | Twitter | Download Apache Druid Druid is a high performance real-time a

The Apache Software Foundation 12.3k Jan 9, 2023
Apache Hive

Apache Hive (TM) The Apache Hive (TM) data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storag

The Apache Software Foundation 4.6k Dec 28, 2022
This code base is retained for historical interest only, please visit Apache Incubator Repo for latest one

Apache Kylin Apache Kylin is an open source Distributed Analytics Engine to provide SQL interface and multi-dimensional analysis (OLAP) on Hadoop supp

Kylin OLAP Engine 561 Dec 4, 2022
Real-time Query for Hadoop; mirror of Apache Impala

Welcome to Impala Lightning-fast, distributed SQL queries for petabytes of data stored in Apache Hadoop clusters. Impala is a modern, massively-distri

Cloudera 27 Dec 28, 2022
Apache Dubbo漏洞测试Demo及其POC

DubboPOC Apache Dubbo 漏洞POC 持续更新中 CVE-2019-17564 CVE-2020-1948 CVE-2020-1948绕过 CVE-2021-25641 CVE-2021-30179 others 免责声明 项目仅供学习使用,任何未授权检测造成的直接或者间接的后果及

lz2y 19 Dec 12, 2022
Flink CDC Connectors is a set of source connectors for Apache Flink

Flink CDC Connectors is a set of source connectors for Apache Flink, ingesting changes from different databases using change data capture (CDC). The Flink CDC Connectors integrates Debezium as the engine to capture data changes.

null 6 Mar 23, 2022
Distributed and fault-tolerant realtime computation: stream processing, continuous computation, distributed RPC, and more

IMPORTANT NOTE!!! Storm has Moved to Apache. The official Storm git repository is now hosted by Apache, and is mirrored on github here: https://github

Nathan Marz 8.9k Dec 26, 2022
Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.

Elephant Bird About Elephant Bird is Twitter's open source library of LZO, Thrift, and/or Protocol Buffer-related Hadoop InputFormats, OutputFormats,

Twitter 1.1k Jan 5, 2023
A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, organization and lifecycle management for both streaming and batch data ecosystems.

Apache Gobblin Apache Gobblin is a highly scalable data management solution for structured and byte-oriented data in heterogeneous data ecosystems. Ca

The Apache Software Foundation 2.1k Jan 4, 2023
OpenRefine is a free, open source power tool for working with messy data and improving it

OpenRefine OpenRefine is a Java-based power tool that allows you to load data, understand it, clean it up, reconcile it, and augment it with data comi

OpenRefine 9.2k Jan 1, 2023
Stream summarizer and cardinality estimator.

Description A Java library for summarizing data in streams for which it is infeasible to store all events. More specifically, there are classes for es

AddThis 2.2k Dec 30, 2022
Machine Learning Platform and Recommendation Engine built on Kubernetes

Update January 2018 Seldon Core open sourced. Seldon Core focuses purely on deploying a wide range of ML models on Kubernetes, allowing complex runtim

Seldon 1.5k Dec 15, 2022
:elephant: Elasticsearch real-time search and analytics natively integrated with Hadoop

Elasticsearch Hadoop Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Apache Hive, Apache Pig, Apach

elastic 1.9k Dec 22, 2022