Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.

Overview

Sparkler

Slack

A web crawler is a bot program that fetches resources from the web for the sake of building applications like search engines, knowledge bases, etc. Sparkler (contraction of Spark-Crawler) is a new web crawler that makes use of recent advancements in distributed computing and information retrieval domains by conglomerating various Apache projects like Spark, Kafka, Lucene/Solr, Tika, and pf4j. Sparkler is an extensible, highly scalable, and high-performance web crawler that is an evolution of Apache Nutch and runs on Apache Spark Cluster.

NOTE:

Sparkler is being proposed to Apache Incubator. Review the proposal document and provide your suggestions here here Will be done later, eventually!

Notable features of Sparkler:

  • Provides Higher performance and fault tolerance: The crawl pipeline has been redesigned to take advantage of the caching and fault tolerance capability of Apache Spark.
  • Supports complex and near real-time analytics: The internal data-structure is an indexed store powered by Apache Lucene and has the functionality to answer complex queries in near real time. Apache Solr (Supporting standalone for a quick start and cloud mode to scale horizontally) is used to expose the crawler analytics via HTTP API. These analytics can be visualized using intuitive charts in Admin dashboard (coming soon).
  • Streams out the content in real-time: Optionally, Apache Kafka can be configured to retrieve the output content as and when the content becomes available.
  • Java Script Rendering Executes the javascript code in webpages to create final state of the page. The setup is easy and painless, scales by distributing the work on Spark. It preserves the sessions and cookies for the subsequent requests made to a host.
  • Extensible plugin framework: Sparkler is designed to be modular. It supports plugins to extend and customize the runtime behaviour.
  • Universal Parser: Apache Tika, the most popular content detection, and content analysis toolkit that can deal with thousands of file formats, is used to discover links to the outgoing web resources and also to perform analysis on fetched resources.

Quick Start: Running your first crawl job in minutes

To use sparkler, install docker and run the below commands:

# Step 0. Get this script
wget https://raw.githubusercontent.com/USCDataScience/sparkler/master/sparkler-core/bin/dockler.sh
# Step 1. Run the script - it starts docker container and forwards ports to host
bash dockler.sh 
# Step 2. Inject seed urls
/data/sparkler/bin/sparkler.sh inject -id 1 -su 'http://www.bbc.com/news'
# Step 3. Start the crawl job
/data/sparkler/bin/sparkler.sh crawl -id 1 -tn 100 -i 2     # id=1, top 100 URLs, do -i=2 iterations

Running Sparkler with seed urls file:

> seedfile.txt command. 3. Inject seed urls using the following command, (assuming you are in sparkler/bin directory) $bash sparkler.sh inject -id 1 -sf seed-urls.txt 4. Start the crawl job.">
1. Follow Steps 0-1
2. Create a file name seed-urls.txt using Emacs editor as follows:     
       a. emacs sparkler/bin/seed-urls.txt 
       b. copy paste your urls 
       c. Ctrl+x Ctrl+s to save  
       d. Ctrl+x Ctrl+c to quit the editor [Reference: http://mally.stanford.edu/~sr/computing/emacs.html]

* Note: You can use Vim and Nano editors also or use: echo -e "http://example1.com\nhttp://example2.com" >> seedfile.txt command.

3. Inject seed urls using the following command, (assuming you are in sparkler/bin directory) 
$bash sparkler.sh inject -id 1 -sf seed-urls.txt
4. Start the crawl job.

To crawl until the end of all new URLS, use -i -1, Example: /data/sparkler/bin/sparkler.sh crawl -id 1 -i -1

Access the dashboard http://localhost:8983/banana/ (forwarded from docker image). The dashboard should look like the one in the below:

Dashboard

Making Contributions:

Contact Us

Any questions or suggestions are welcomed in our mailing list [email protected] Alternatively, you may use the slack channel for getting help http://irds.usc.edu/sparkler/#slack

Comments
  • Caught Server refused connection at: http://localhost:8983/solr/crawldb

    Caught Server refused connection at: http://localhost:8983/solr/crawldb

    Issue Description

    Please describe our issue, along with: Is very easy, I the second command I run on your guide didn't worked

    How to reproduce it

    I run bash dockler.sh and the result I had was:

    root@DS1515:/volume3/Docker_Volume/Sparkler# bash dockler.sh
    Cant find docker image sparkler-local. Going to Fetch it
    Fetching uscdatascience/sparkler:latest and tagging as sparkler-local
    latest: Pulling from uscdatascience/sparkler
    Digest: sha256:4395aa8e69a220cd3bf52ada94aa6dc2ed3e84919470a007faf9cf80f89308eb
    Status: Image is up to date for uscdatascience/sparkler:latest
    docker.io/uscdatascience/sparkler:latest
    Found image: 7bf3f592ca23
    Going to launch the shell inside sparkler's docker container.
    You can press CTRL-D to exit.
    You can rerun this script to resume.
    You can access solr at http://localhost:8983/solr when solr is running
    You can spark master UI at http://localhost:4041/ when spark master is running
    
    Some useful queries:
    
    - Get stats on groups, status, depth:
        http://localhost:8983/solr/crawldb/query?q=*:*&rows=0&facet=true&&facet.field=crawl_id&facet.field=status&facet.field=group&facet.field=discover_depth
    
    Inside docker, you can do the following:
    
    /data/solr/bin/solr - command line tool for administering solr
        start -force -> start solr
        stop -force -> stop solr
        status -force -> get status of solr
        restart -force -> restart solr
    
    /data/sparkler/bin/sparkler.sh - command line interface to sparkler
       inject - inject seed urls
       crawl - launch a crawl job
    

    As second step I run /data/sparkler/bin/sparkler.sh inject -id 1 -su 'http://www.bbc.com/news' and as result I have:

    bash-4.2$ /data/sparkler/bin/sparkler.sh inject -id 1 -su 'http://www.bbc.com/news'
    SLF4J: Class path contains multiple SLF4J bindings.
    SLF4J: Found binding in [jar:file:/data/sparkler/sparkler-app-0.3.1-SNAPSHOT/lib/org.apache.logging.log4j.log4j-slf4j-impl-2.11.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/data/sparkler/sparkler-app-0.3.1-SNAPSHOT/lib/org.slf4j.slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
    SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
    2021-11-27 23:18:42 INFO  PluginService$:53 - Loading plugins...
    2021-11-27 23:18:42 INFO  PluginService$:62 - 2 plugin(s) Active: [urlfilter-regex, urlfilter-samehost]
    2021-11-27 23:18:42 WARN  PluginService$:65 - 4 extra plugin(s) available but not activated: Set(fetcher-chrome, scorer-dd-svn, fetcher-jbrowser, fetcher-htmlunit)
    2021-11-27 23:18:42 DEBUG PluginService$:68 - Loading urlfilter-regex
    2021-11-27 23:18:42 INFO  PluginService$:73 - Extensions found: []
    2021-11-27 23:18:42 DEBUG PluginService$:68 - Loading urlfilter-samehost
    2021-11-27 23:18:42 INFO  PluginService$:73 - Extensions found: []
    2021-11-27 23:18:42 INFO  PluginService$:82 - Recognised Plugins: Map()
    2021-11-27 23:18:42 INFO  Injector$:108 - Injecting 1 seeds
    2021-11-27 23:18:43 WARN  SolrProxy:93 - Caught Server refused connection at: http://localhost:8983/solr/crawldb while adding beans, trying to add one by one
    2021-11-27 23:18:43 WARN  SolrProxy:100 - (SKIPPED) Server refused connection at: http://localhost:8983/solr/crawldb while adding [!!!edu.usc.irds.sparkler.model.Resource@26a529dc=>java.util.IllegalFormatConversionException:f != java.util.HashMap!!!]
    2021-11-27 23:18:43 DEBUG SolrProxy:101 - Server refused connection at: http://localhost:8983/solr/crawldb
    org.apache.solr.client.solrj.SolrServerException: Server refused connection at: http://localhost:8983/solr/crawldb
            at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:672) ~[org.apache.solr.solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 - romseygeek - 2020-03-13 09:38:26]
            at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:265) ~[org.apache.solr.solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 - romseygeek - 2020-03-13 09:38:26]
            at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:248) ~[org.apache.solr.solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 - romseygeek - 2020-03-13 09:38:26]
            at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:211) ~[org.apache.solr.solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 - romseygeek - 2020-03-13 09:38:26]
            at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:177) ~[org.apache.solr.solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 - romseygeek - 2020-03-13 09:38:26]
            at org.apache.solr.client.solrj.SolrClient.addBean(SolrClient.java:285) ~[org.apache.solr.solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 - romseygeek - 2020-03-13 09:38:26]
            at org.apache.solr.client.solrj.SolrClient.addBean(SolrClient.java:267) ~[org.apache.solr.solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 - romseygeek - 2020-03-13 09:38:26]
            at edu.usc.irds.sparkler.storage.solr.SolrProxy.addResources(SolrProxy.scala:97) [sparkler-app.sparkler-app-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
            at edu.usc.irds.sparkler.service.Injector.run(Injector.scala:111) [sparkler-app.sparkler-app-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
            at edu.usc.irds.sparkler.base.CliTool.run(CliTool.scala:34) [sparkler-app.sparkler-app-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
            at edu.usc.irds.sparkler.base.CliTool.run$(CliTool.scala:32) [sparkler-app.sparkler-app-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
            at edu.usc.irds.sparkler.service.Injector.run(Injector.scala:43) [sparkler-app.sparkler-app-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
            at edu.usc.irds.sparkler.service.Injector$.main(Injector.scala:162) [sparkler-app.sparkler-app-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
            at edu.usc.irds.sparkler.service.Injector.main(Injector.scala) [sparkler-app.sparkler-app-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
            at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
            at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:?]
            at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
            at java.lang.reflect.Method.invoke(Method.java:567) ~[?:?]
            at edu.usc.irds.sparkler.Main$.main(Main.scala:50) [sparkler-app.sparkler-app-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
            at edu.usc.irds.sparkler.Main.main(Main.scala) [sparkler-app.sparkler-app-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
    Caused by: org.apache.http.conn.HttpHostConnectException: Connect to localhost:8983 [localhost/127.0.0.1] failed: Connection refused
            at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:156) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
            at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
            at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
            at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
            at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
            at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
            at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
            at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
            at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
            at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
            at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:564) ~[org.apache.solr.solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 - romseygeek - 2020-03-13 09:38:26]
            ... 19 more
    Caused by: java.net.ConnectException: Connection refused
            at sun.nio.ch.Net.pollConnect(Native Method) ~[?:?]
            at sun.nio.ch.Net.pollConnectNow(Net.java:579) ~[?:?]
            at sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:542) ~[?:?]
            at sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:597) ~[?:?]
            at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:339) ~[?:?]
            at java.net.Socket.connect(Socket.java:603) ~[?:?]
            at org.apache.http.conn.socket.PlainConnectionSocketFactory.connectSocket(PlainConnectionSocketFactory.java:75) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
            at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
            at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
            at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
            at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
            at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
            at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
            at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
            at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
            at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
            at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
            at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:564) ~[org.apache.solr.solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 - romseygeek - 2020-03-13 09:38:26]
            ... 19 more
    Exception in thread "main" java.lang.reflect.InvocationTargetException
            at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
            at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
            at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
            at java.base/java.lang.reflect.Method.invoke(Method.java:567)
            at edu.usc.irds.sparkler.Main$.main(Main.scala:50)
            at edu.usc.irds.sparkler.Main.main(Main.scala)
    Caused by: org.apache.solr.client.solrj.SolrServerException: Server refused connection at: http://localhost:8983/solr/crawldb
            at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:672)
            at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:265)
            at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:248)
            at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:211)
            at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:504)
            at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:479)
            at edu.usc.irds.sparkler.storage.solr.SolrProxy.commitCrawlDb(SolrProxy.scala:112)
            at edu.usc.irds.sparkler.service.Injector.run(Injector.scala:112)
            at edu.usc.irds.sparkler.base.CliTool.run(CliTool.scala:34)
            at edu.usc.irds.sparkler.base.CliTool.run$(CliTool.scala:32)
            at edu.usc.irds.sparkler.service.Injector.run(Injector.scala:43)
            at edu.usc.irds.sparkler.service.Injector$.main(Injector.scala:162)
            at edu.usc.irds.sparkler.service.Injector.main(Injector.scala)
            ... 6 more
    Caused by: org.apache.http.conn.HttpHostConnectException: Connect to localhost:8983 [localhost/127.0.0.1] failed: Connection refused
            at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:156)
            at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376)
            at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)
            at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
            at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
            at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
            at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
            at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
            at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
            at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
            at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:564)
            ... 18 more
    Caused by: java.net.ConnectException: Connection refused
            at java.base/sun.nio.ch.Net.pollConnect(Native Method)
            at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:579)
            at java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:542)
            at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:597)
            at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:339)
            at java.base/java.net.Socket.connect(Socket.java:603)
            at org.apache.http.conn.socket.PlainConnectionSocketFactory.connectSocket(PlainConnectionSocketFactory.java:75)
            at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
            ... 28 more
    2021-11-27 23:18:43 WARN  PluginService$:49 - Stopping all plugins... Runtime is about to exit.
    

    Environment and Version Information

    Please indicate relevant versions, including, if relevant:

    • Java Version: 1.8.0_275 but I taught it was provided inside your Docker
    • Spark Version: I taught it was already instaled inside your Docker. If not I haven't installed it
    • Operating System name and version: is Docker installed on a Synology DS1515+ . If I run docker version I receive Client: Version: 20.10.3 API version: 1.41 Go version: go1.15.6 Git commit: b35e731 Built: Fri Jun 18 08:25:45 2021 OS/Arch: linux/amd64 Context: default Experimental: true

    Server: Engine: Version: 20.10.3 API version: 1.41 (minimum version 1.12) Go version: go1.15.6 Git commit: e7f7c95 Built: Fri Jun 18 08:26:10 2021 OS/Arch: linux/amd64 Experimental: false containerd: Version: v1.4.3 GitCommit: b1dc45ec561bd867c4805eee786caab7cc83acae runc: Version: v1.0.0-rc93 GitCommit: 89783e1862a2cc04647ab15b6e88a0af3d66fac3 docker-init: Version: 0.19.0 GitCommit: 12b6a20

    An external links for reference

    Nah, just tell me if Java and Spark are inside your Docker image or not. If they are not and I have to install them you can close this ticket

    Contributing

    I'm willing to contribute

    opened by francesco1119 19
  • De-Duplicate documents in CrawlDB (Solr)

    De-Duplicate documents in CrawlDB (Solr)

    Since now we have a more sophisticated definition of the id field (with timestamp included), we have to think on de-duplication of the documents.

    I am opening a discussion channel here to define de-duplication. Some of the suggestions are:

    • Compare SHA256 hash of the raw_content i.e. signature field (but this will enforce fetching of the duplicate document even though we are not storing it)
    • Compare the url field

    We can refer here for the implementation.

    Discussion 
    opened by karanjeets 16
  • Error when injecting urls

    Error when injecting urls

    I am new to Sparkler. I begun with the requirements provided and everything worked well. Then, when I started to inject some urls, I got an issue.

    Here are the steps that I followed to inject urls:

    1. cd to the root directory of the project
    2. docker run -it sparkler-local
    3. /data/solr/bin/solr
    4. /data/sparkler/bin/sparkler.sh
    5. java -jar sparkler-app-0.1-SNAPSHOT.jar inject -sf seed.txt

    The output that it must appear is:

    2016-06-07 19:22:49 INFO  Injector$:70 [main] - Injecting 2 seeds
    >>jobId = sparkler-job-1465352569649
    
    

    But, I got this output: Error: Unable to access jarfile sparkler-app-0.1-SNAPSHOT.jar

    I don't understand what caused this problem and how can I resolve it. I will be very gratefull if I get an answer. I look forward to your reply.

    Thank you.

    opened by User12300 15
  • Mvn2sbt

    Mvn2sbt

    Works, but needs some finishing off, but to expose it here, we have a working SBT build now.

    Why move from Maven to SBT you ask? Because then IDE's can load up the code and work properly with the Java and Scala mix rather than the crazy state we were in with the old build.

    opened by buggtb 13
  • Sparkler Configuration Update

    Sparkler Configuration Update

    I have added validations and type check in sparkler configuration. Also, I have added some TODOs comment which I think should be given a thought to. @thammegowda @karanjeets do give it a look. Let me know any suggestions, will do the needful. Also, I found a very interesting thing, there are multiple sources of the same type of files present.

    ./resources/sparkler-default.yaml
    ./sparkler-app/target/test-classes/sparkler-default.yaml
    ./sparkler-app/src/test/resources/sparkler-default.yaml
    ./conf/sparkler-default.yaml
    ./sparkler-api/target/classes/sparkler-default.yaml
    

    There are multiple files like sparkler-default.yaml. #93 : bit of work

    opened by SHASHANK-PRO-05 13
  • banana dashboard does not  load the injected jobs

    banana dashboard does not load the injected jobs

    Issue Description

    Hello .I am trying to build the sparkler using the docker image and after I inject the jobs with id and try to see index and other behaviors in the banana dash board it always say zero .I don't know the reason why this happen

    I use ubuntu 16.04 lts version and the docker version is Docker version 17.09.0-ce, build afdb6d4 and i use jdk 1.8 and the spark version is the latest ones. Thank you!!

    opened by YehualashetGit 12
  • Working with remote spark.

    Working with remote spark.

    Has anyone tried this with a non local spark?

    I ask because when I try and run on a remote spark I get class mismatch errors:

    java.lang.RuntimeException: java.io.InvalidClassException: org.apache.spark.rpc.netty.RequestMessage; local class incompatible: stream classdesc serialVersionUID = -5447855329526097695, local class serialVersionUID = -2221986757032131007

    But then when I check the versions, you guys use Spark 1.6.1 which requires Scala 2.10.x, but you use Scala 2.11.x and if you try and downgrade to 2.10 it doesn't compile.

    opened by buggtb 11
  • Sparkler Elasticsearch storage engine

    Sparkler Elasticsearch storage engine

    We are moving ahead with this project within the USC CSCI 401 Senior Capstone Program. My stakeholder needs are as follows

    1. As a Crawler Administrator, I need to be able to specify and configure the Sparkler storage engine through configuration parameters.
    1. As a Data Analyst, I need access to a dashboard which displays crawl information.
    1. As a DevOps Engineer, I need to deploy Sparkler (configured with the Elasticsearch storage engine) via Docker and as a Helm Chart.
    1. As a Test and Quality Assurance Engineer, I need to integrate Sparkler tests for the Elasticsearch storage engine into my CI process.
    1. As a Development Lead, I need access to developer documentation covering the Elasticsearch storage engine for Sparkler.
    

    Sparkler Committers, I wonder if it is possible for us to use the Github projects feature to manage the project?

    Capstone Team, please reply with your Github ID here.

    opened by lewismc 10
  • Java null pointer error in fetch()

    Java null pointer error in fetch()

    Hi!

    I encounter some errors. The program is crashing for 10 crawls and I have the following errors (i put bold chars). Can you help me to figure out why ?

    Best,

    1st

    2016-12-26 16:40:24 ERROR Executor:95 [Executor task launch worker-1] - Exception in task 3.0 in stage 1.0 (TID 8) org.openqa.selenium.WebDriverException: Build info: version: 'unknown', revision: 'unknown', time: 'unknown'** System info: host: '', ip: '', os.name: 'Mac OS X', os.arch: 'x86_64', os.version: '10.12.2', java.version: '1.8.0_91' Driver info: driver.version: JBrowserDriver at com.machinepublishers.jbrowserdriver.Util.handleException(Util.java:140) at com.machinepublishers.jbrowserdriver.JBrowserDriver.get(JBrowserDriver.java:646) at edu.usc.irds.sparkler.plugin.FetcherJBrowser.fetch(FetcherJBrowser.java:81) at edu.usc.irds.sparkler.util.FetcherDefault$FetchIterator.next(FetcherDefault.java:77) at edu.usc.irds.sparkler.util.FetcherDefault$FetchIterator.next(FetcherDefault.java:60) at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43) at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:52) at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:28) at scala.collection.Iterator$$anon$12.next(Iterator.scala:444) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:285) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.EOFException at java.io.DataInputStream.readByte(DataInputStream.java:267) at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:215) at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:162) at java.rmi.server.RemoteObjectInvocationHandler.invokeRemoteMethod(RemoteObjectInvocationHandler.java:227) at java.rmi.server.RemoteObjectInvocationHandler.invoke(RemoteObjectInvocationHandler.java:179) at com.machinepublishers.jbrowserdriver.$Proxy18.get(Unknown Source) at com.machinepublishers.jbrowserdriver.JBrowserDriver.get(JBrowserDriver.java:643) ... 20 more

    2nd

    2016-12-26 16:40:24 ERROR TaskSetManager:74 [task-result-getter-3] - Task 3 in stage 1.0 failed 1 times; aborting job Exception in thread "main" java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at edu.usc.irds.sparkler.Main$.main(Main.scala:47) at edu.usc.irds.sparkler.Main.main(Main.scala) Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 1.0 failed 1 times, most recent failure: Lost task 3.0 in stage 1.0 (TID 8, localhost): org.openqa.selenium.WebDriverException: Build info: version: 'unknown', revision: 'unknown', time: 'unknown'** System info: host: '', ip: '', os.name: 'Mac OS X', os.arch: 'x86_64', os.version: '10.12.2', java.version: '1.8.0_91' Driver info: driver.version: JBrowserDriver at com.machinepublishers.jbrowserdriver.Util.handleException(Util.java:140) at com.machinepublishers.jbrowserdriver.JBrowserDriver.get(JBrowserDriver.java:646) at edu.usc.irds.sparkler.plugin.FetcherJBrowser.fetch(FetcherJBrowser.java:81) at edu.usc.irds.sparkler.util.FetcherDefault$FetchIterator.next(FetcherDefault.java:77) at edu.usc.irds.sparkler.util.FetcherDefault$FetchIterator.next(FetcherDefault.java:60) at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43) at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:52) at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:28) at scala.collection.Iterator$$anon$12.next(Iterator.scala:444) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:285) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.EOFException at java.io.DataInputStream.readByte(DataInputStream.java:267) at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:215) at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:162) at java.rmi.server.RemoteObjectInvocationHandler.invokeRemoteMethod(RemoteObjectInvocationHandler.java:227) at java.rmi.server.RemoteObjectInvocationHandler.invoke(RemoteObjectInvocationHandler.java:179) at com.machinepublishers.jbrowserdriver.$Proxy18.get(Unknown Source) at com.machinepublishers.jbrowserdriver.JBrowserDriver.get(JBrowserDriver.java:643) ... 20 more

    Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1922) at edu.usc.irds.sparkler.pipeline.Crawler$$anonfun$run$1.apply$mcVI$sp(Crawler.scala:139) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) at edu.usc.irds.sparkler.pipeline.Crawler.run(Crawler.scala:121) at edu.usc.irds.sparkler.base.CliTool$class.run(CliTool.scala:34) at edu.usc.irds.sparkler.pipeline.Crawler.run(Crawler.scala:40) at edu.usc.irds.sparkler.pipeline.Crawler$.main(Crawler.scala:211) at edu.usc.irds.sparkler.pipeline.Crawler.main(Crawler.scala) ... 6 more Caused by: org.openqa.selenium.WebDriverException: Build info: version: 'unknown', revision: 'unknown', time: 'unknown' System info: host: '', ip: '', os.name: 'Mac OS X', os.arch: 'x86_64', os.version: '10.12.2', java.version: '1.8.0_91' Driver info: driver.version: JBrowserDriver at com.machinepublishers.jbrowserdriver.Util.handleException(Util.java:140) at com.machinepublishers.jbrowserdriver.JBrowserDriver.get(JBrowserDriver.java:646) at edu.usc.irds.sparkler.plugin.FetcherJBrowser.fetch(FetcherJBrowser.java:81) at edu.usc.irds.sparkler.util.FetcherDefault$FetchIterator.next(FetcherDefault.java:77) at edu.usc.irds.sparkler.util.FetcherDefault$FetchIterator.next(FetcherDefault.java:60) at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43) at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:52) at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:28) at scala.collection.Iterator$$anon$12.next(Iterator.scala:444) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:285) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.EOFException at java.io.DataInputStream.readByte(DataInputStream.java:267) at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:215) at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:162) at java.rmi.server.RemoteObjectInvocationHandler.invokeRemoteMethod(RemoteObjectInvocationHandler.java:227) at java.rmi.server.RemoteObjectInvocationHandler.invoke(RemoteObjectInvocationHandler.java:179) at com.machinepublishers.jbrowserdriver.$Proxy18.get(Unknown Source) at com.machinepublishers.jbrowserdriver.JBrowserDriver.get(JBrowserDriver.java:643) ... 20 more

    3rd

    ERROR Utils:95 [Executor task launch worker-2] - Uncaught exception in thread Executor task launch worker-2 java.lang.NullPointerException at org.apache.spark.scheduler.Task$$anonfun$run$1.apply$mcV$sp(Task.scala:95) at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1229) at org.apache.spark.scheduler.Task.run(Task.scala:93) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2016-12-26 16:40:57 ERROR Executor:95 [Executor task launch worker-2] - Exception in task 1.0 in stage 1.0 (TID 6) java.util.NoSuchElementException: key not found: 6** at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:59) at scala.collection.mutable.HashMap.apply(HashMap.scala:65) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:322) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Exception in thread "Executor task launch worker-2" java.lang.IllegalStateException: RpcEnv already stopped. at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:159) at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:131) at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:192) at org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:516) at org.apache.spark.scheduler.local.LocalBackend.statusUpdate(LocalBackend.scala:151) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:317) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Exception in thread "Executor task launch worker-4" java.lang.IllegalStateException: RpcEnv already stopped. at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:159) at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:131) at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:192) at org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:516) at org.apache.spark.scheduler.local.LocalBackend.statusUpdate(LocalBackend.scala:151) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:317) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

    ...

    2016-12-26 16:42:35 DEBUG FetcherJBrowser:153 [FelixStartLevel] - Exception Connection refused Build info: version: 'unknown', revision: 'unknown', time: 'unknown' System info: host: '', ip: '', os.name: 'Mac OS X', os.arch: 'x86_64', os.version: '10.12.2', java.version: '1.8.0_91' Driver info: driver.version: JBrowserDriver raised. The driver is either already closed or this is an unknown exception

    Process finished with exit code 1

    opened by arelaxend 9
  • Plugin for fetching pages using a headless browser

    Plugin for fetching pages using a headless browser

    • This is a very first version of plugin
    • We are looking to improve further by
      • restricting resources (like css, fonts) as per url filter
      • Adding resources already downloaded in crawled data. For example while loading a web page on browser images on that web page are also downloaded. We can send these downloaded images to crawldb
    opened by smadha 9
  • Removing sparkler-app-0.2.0-SNAPSHOT.jar

    Removing sparkler-app-0.2.0-SNAPSHOT.jar

    First i've got an issue with the sparkler-app-0.2.0-SNAPSHOT.jar so i deleted it, then i couldn't rebuild the project, Is there any particular reason to keep the sparkler-app-0.2.0-SNAPSHOT.jar. Well fyi, removing the:

    <dependency>
        <groupId>edu.usc.irds.sparkler</groupId>
        <artifactId>sparkler-api</artifactId>
        <version>0.2.0-SNAPSHOT</version>
    </dependency>
    

    from sparkler-app/pom.xml fix the issue, I'll do it myself if you confirm that it is not useful anymore & when i'll finish my work

    opened by misterpilou 8
  • Bump json5 and react-scripts in /retired/sparkler-ui

    Bump json5 and react-scripts in /retired/sparkler-ui

    Bumps json5 to 2.2.3 and updates ancestor dependency react-scripts. These dependencies need to be updated together.

    Updates json5 from 0.5.1 to 2.2.3

    Release notes

    Sourced from json5's releases.

    v2.2.3

    v2.2.2

    • Fix: Properties with the name __proto__ are added to objects and arrays. (#199) This also fixes a prototype pollution vulnerability reported by Jonathan Gregson! (#295).

    v2.2.1

    • Fix: Removed dependence on minimist to patch CVE-2021-44906. (#266)

    v2.2.0

    • New: Accurate and documented TypeScript declarations are now included. There is no need to install @types/json5. (#236, #244)

    v2.1.3 [code, diff]

    • Fix: An out of memory bug when parsing numbers has been fixed. (#228, #229)

    v2.1.2

    • Fix: Bump minimist to v1.2.5. (#222)

    v2.1.1

    • New: package.json and package.json5 include a module property so bundlers like webpack, rollup and parcel can take advantage of the ES Module build. (#208)
    • Fix: stringify outputs \0 as \\x00 when followed by a digit. (#210)
    • Fix: Spelling mistakes have been fixed. (#196)

    v2.1.0

    • New: The index.mjs and index.min.mjs browser builds in the dist directory support ES6 modules. (#187)

    v2.0.1

    • Fix: The browser builds in the dist directory support ES5. (#182)

    v2.0.0

    • Major: JSON5 officially supports Node.js v6 and later. Support for Node.js v4 has been dropped. Since Node.js v6 supports ES5 features, the code has been rewritten in native ES5, and the dependence on Babel has been eliminated.

    • New: Support for Unicode 10 has been added.

    • New: The test framework has been migrated from Mocha to Tap.

    • New: The browser build at dist/index.js is no longer minified by default. A minified version is available at dist/index.min.js. (#181)

    • Fix: The warning has been made clearer when line and paragraph separators are

    ... (truncated)

    Changelog

    Sourced from json5's changelog.

    v2.2.3 [code, diff]

    v2.2.2 [code, diff]

    • Fix: Properties with the name __proto__ are added to objects and arrays. (#199) This also fixes a prototype pollution vulnerability reported by Jonathan Gregson! (#295).

    v2.2.1 [code, diff]

    • Fix: Removed dependence on minimist to patch CVE-2021-44906. (#266)

    v2.2.0 [code, diff]

    • New: Accurate and documented TypeScript declarations are now included. There is no need to install @types/json5. (#236, #244)

    v2.1.3 [code, diff]

    • Fix: An out of memory bug when parsing numbers has been fixed. (#228, #229)

    v2.1.2 [code, diff]

    • Fix: Bump minimist to v1.2.5. (#222)

    v2.1.1 [code, [diff][d2.1.1]]

    ... (truncated)

    Commits
    • c3a7524 2.2.3
    • 94fd06d docs: update CHANGELOG for v2.2.3
    • 3b8cebf docs(security): use GitHub security advisories
    • f0fd9e1 docs: publish a security policy
    • 6a91a05 docs(template): bug -> bug report
    • 14f8cb1 2.2.2
    • 10cc7ca docs: update CHANGELOG for v2.2.2
    • 7774c10 fix: add proto to objects and arrays
    • edde30a Readme: slight tweak to intro
    • 97286f8 Improve example in readme
    • Additional commits viewable in compare view

    Updates react-scripts from 2.1.8 to 5.0.1

    Changelog

    Sourced from react-scripts's changelog.

    3.0.0 and Newer Versions

    Please refer to CHANGELOG.md for the newer versions.

    Commits

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies javascript 
    opened by dependabot[bot] 1
  • Bump express from 4.16.4 to 4.18.2 in /retired/sparkler-ui

    Bump express from 4.16.4 to 4.18.2 in /retired/sparkler-ui

    Bumps express from 4.16.4 to 4.18.2.

    Release notes

    Sourced from express's releases.

    4.18.2

    4.18.1

    • Fix hanging on large stack of sync routes

    4.18.0

    ... (truncated)

    Changelog

    Sourced from express's changelog.

    4.18.2 / 2022-10-08

    4.18.1 / 2022-04-29

    • Fix hanging on large stack of sync routes

    4.18.0 / 2022-04-25

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies javascript 
    opened by dependabot[bot] 1
  • Bump decode-uri-component from 0.2.0 to 0.2.2 in /retired/sparkler-ui

    Bump decode-uri-component from 0.2.0 to 0.2.2 in /retired/sparkler-ui

    Bumps decode-uri-component from 0.2.0 to 0.2.2.

    Release notes

    Sourced from decode-uri-component's releases.

    v0.2.2

    • Prevent overwriting previously decoded tokens 980e0bf

    https://github.com/SamVerschueren/decode-uri-component/compare/v0.2.1...v0.2.2

    v0.2.1

    • Switch to GitHub workflows 76abc93
    • Fix issue where decode throws - fixes #6 746ca5d
    • Update license (#1) 486d7e2
    • Tidelift tasks a650457
    • Meta tweaks 66e1c28

    https://github.com/SamVerschueren/decode-uri-component/compare/v0.2.0...v0.2.1

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies javascript 
    opened by dependabot[bot] 1
  • Bump certifi from 2020.4.5.1 to 2022.12.7 in /retired/sparkler-sce/webui

    Bump certifi from 2020.4.5.1 to 2022.12.7 in /retired/sparkler-sce/webui

    Bumps certifi from 2020.4.5.1 to 2022.12.7.

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies python 
    opened by dependabot[bot] 1
  • Bump loader-utils and react-scripts in /retired/sparkler-ui

    Bump loader-utils and react-scripts in /retired/sparkler-ui

    Bumps loader-utils to 2.0.4 and updates ancestor dependency react-scripts. These dependencies need to be updated together.

    Updates loader-utils from 1.2.3 to 2.0.4

    Release notes

    Sourced from loader-utils's releases.

    v2.0.4

    2.0.4 (2022-11-11)

    Bug Fixes

    v2.0.3

    2.0.3 (2022-10-20)

    Bug Fixes

    • security: prototype pollution exploit (#217) (a93cf6f)

    v2.0.2

    2.0.2 (2021-11-04)

    Bug Fixes

    • base64 generation and unicode characters (#197) (8c2d24e)

    v2.0.1

    2.0.1 (2021-10-29)

    Bug Fixes

    v2.0.0

    2.0.0 (2020-03-17)

    ⚠ BREAKING CHANGES

    • minimum required Node.js version is 8.9.0 (#166) (c937e8c)
    • the getOptions method returns empty object on empty query (#167) (b595cfb)
    • Use md4 by default

    v1.4.2

    1.4.2 (2022-11-11)

    Bug Fixes

    ... (truncated)

    Changelog

    Sourced from loader-utils's changelog.

    2.0.4 (2022-11-11)

    Bug Fixes

    2.0.3 (2022-10-20)

    Bug Fixes

    • security: prototype pollution exploit (#217) (a93cf6f)

    2.0.2 (2021-11-04)

    Bug Fixes

    • base64 generation and unicode characters (#197) (8c2d24e)

    2.0.1 (2021-10-29)

    Bug Fixes

    2.0.0 (2020-03-17)

    ⚠ BREAKING CHANGES

    • minimum required Node.js version is 8.9.0 (#166) (c937e8c)
    • the getOptions method returns empty object on empty query (#167) (b595cfb)
    • Use md4 by default

    1.4.0 (2020-02-19)

    Features

    • the resourceQuery is passed to the interpolateName method (#163) (cd0e428)

    1.3.0 (2020-02-19)

    ... (truncated)

    Commits

    Updates react-scripts from 2.1.8 to 5.0.1

    Changelog

    Sourced from react-scripts's changelog.

    3.0.0 and Newer Versions

    Please refer to CHANGELOG.md for the newer versions.

    Commits

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies javascript 
    opened by dependabot[bot] 1
  • Bump joblib from 0.14.1 to 1.2.0 in /retired/sparkler-sce/webui

    Bump joblib from 0.14.1 to 1.2.0 in /retired/sparkler-sce/webui

    Bumps joblib from 0.14.1 to 1.2.0.

    Changelog

    Sourced from joblib's changelog.

    Release 1.2.0

    • Fix a security issue where eval(pre_dispatch) could potentially run arbitrary code. Now only basic numerics are supported. joblib/joblib#1327

    • Make sure that joblib works even when multiprocessing is not available, for instance with Pyodide joblib/joblib#1256

    • Avoid unnecessary warnings when workers and main process delete the temporary memmap folder contents concurrently. joblib/joblib#1263

    • Fix memory alignment bug for pickles containing numpy arrays. This is especially important when loading the pickle with mmap_mode != None as the resulting numpy.memmap object would not be able to correct the misalignment without performing a memory copy. This bug would cause invalid computation and segmentation faults with native code that would directly access the underlying data buffer of a numpy array, for instance C/C++/Cython code compiled with older GCC versions or some old OpenBLAS written in platform specific assembly. joblib/joblib#1254

    • Vendor cloudpickle 2.2.0 which adds support for PyPy 3.8+.

    • Vendor loky 3.3.0 which fixes several bugs including:

      • robustly forcibly terminating worker processes in case of a crash (joblib/joblib#1269);

      • avoiding leaking worker processes in case of nested loky parallel calls;

      • reliability spawn the correct number of reusable workers.

    Release 1.1.0

    • Fix byte order inconsistency issue during deserialization using joblib.load in cross-endian environment: the numpy arrays are now always loaded to use the system byte order, independently of the byte order of the system that serialized the pickle. joblib/joblib#1181

    • Fix joblib.Memory bug with the ignore parameter when the cached function is a decorated function.

    ... (truncated)

    Commits
    • 5991350 Release 1.2.0
    • 3fa2188 MAINT cleanup numpy warnings related to np.matrix in tests (#1340)
    • cea26ff CI test the future loky-3.3.0 branch (#1338)
    • 8aca6f4 MAINT: remove pytest.warns(None) warnings in pytest 7 (#1264)
    • 067ed4f XFAIL test_child_raises_parent_exits_cleanly with multiprocessing (#1339)
    • ac4ebd5 MAINT add back pytest warnings plugin (#1337)
    • a23427d Test child raises parent exits cleanly more reliable on macos (#1335)
    • ac09691 [MAINT] various test updates (#1334)
    • 4a314b1 Vendor loky 3.2.0 (#1333)
    • bdf47e9 Make test_parallel_with_interactively_defined_functions_default_backend timeo...
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies python 
    opened by dependabot[bot] 1
Owner
USC Information Retrieval & Data Science
USC Information Retrieval and Data Science Group
USC Information Retrieval & Data Science
Open Source Web Crawler for Java

crawler4j crawler4j is an open source web crawler for Java which provides a simple interface for crawling the Web. Using it, you can setup a multi-thr

Yasser Ganjisaffar 4.3k Jan 3, 2023
A scalable web crawler framework for Java.

Readme in Chinese A scalable crawler framework. It covers the whole lifecycle of crawler: downloading, url management, content extraction and persiste

Yihua Huang 10.7k Jan 5, 2023
jQuery-like cross-driver interface in Java for Selenium WebDriver

seleniumQuery Feature-rich jQuery-like Java interface for Selenium WebDriver seleniumQuery is a feature-rich cross-driver Java library that brings a j

null 69 Nov 27, 2022
Apache Nutch is an extensible and scalable web crawler

Apache Nutch README For the latest information about Nutch, please visit our website at: https://nutch.apache.org/ and our wiki, at: https://cwiki.apa

The Apache Software Foundation 2.5k Dec 31, 2022
A scalable, mature and versatile web crawler based on Apache Storm

StormCrawler is an open source collection of resources for building low-latency, scalable web crawlers on Apache Storm. It is provided under Apache Li

DigitalPebble Ltd 776 Jan 2, 2023
A simple expressive web framework for java. Spark has a kotlin DSL https://github.com/perwendel/spark-kotlin

Spark - a tiny web framework for Java 8 Spark 2.9.3 is out!! Changeset <dependency> <groupId>com.sparkjava</groupId> <artifactId>spark-core</a

Per Wendel 9.4k Dec 29, 2022
vʌvr (formerly called Javaslang) is a non-commercial, non-profit object-functional library that runs with Java 8+. It aims to reduce the lines of code and increase code quality.

Vavr is an object-functional language extension to Java 8, which aims to reduce the lines of code and increase code quality. It provides persistent co

vavr 5.1k Jan 3, 2023
Bank Statement Analyzer Application that currently runs in terminal with the commands: javac Application.java java Application [file-name].csv GUI coming soon...

Bank Statement Analyzer Application that currently runs in terminal with the commands: javac Application.java java Application [file-name].csv GUI coming soon...

Hayden Hanson 0 May 21, 2022
A virtual Linux shell environment application for Android OS. Runs Alpine Linux in QEMU system emulator. Termux app fork.

vShell (Virtual Shell) — a successor of Termux project which provides an alternate implementation of the Linux terminal emulator for Android OS.

null 2 Feb 1, 2022
Box86Launcher Is Modified Version Of ptitSeb/box86 Which Runs x86 Version Of WineHQ On Android Nativel

Box86Launcher Box86Launcher Is Modified Version Of ptitSeb/box86 Which Runs x86 Version Of WineHQ On Android Natively. Unlike ExaGear or Running Box86

AkiraYuki 61 Jan 3, 2023
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

Oryx Project 1.8k Dec 28, 2022
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

Oryx Project 1.7k Mar 12, 2021
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

Oryx Project 1.8k Dec 28, 2022
Java port of Brainxyz's Artificial Life, a simple program to simulate primitive Artificial Life using simple rules of attraction or repulsion among atom-like particles, producing complex self-organzing life-like patterns.

ParticleSimulation simple Java port of Brainxyz's Artificial Life A simple program to simulate primitive Artificial Life using simple rules of attract

Koonts 3 Oct 5, 2022
Open Source Web Crawler for Java

crawler4j crawler4j is an open source web crawler for Java which provides a simple interface for crawling the Web. Using it, you can setup a multi-thr

Yasser Ganjisaffar 4.3k Jan 3, 2023
A scalable web crawler framework for Java.

Readme in Chinese A scalable crawler framework. It covers the whole lifecycle of crawler: downloading, url management, content extraction and persiste

Yihua Huang 10.7k Jan 5, 2023
Apache Spark - A unified analytics engine for large-scale data processing

Apache Spark Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an op

The Apache Software Foundation 34.7k Jan 2, 2023