Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.

USC Information Retrieval & Data Science

Last update: Dec 17, 2022

Overview

Sparkler

A web crawler is a bot program that fetches resources from the web for the sake of building applications like search engines, knowledge bases, etc. Sparkler (contraction of Spark-Crawler) is a new web crawler that makes use of recent advancements in distributed computing and information retrieval domains by conglomerating various Apache projects like Spark, Kafka, Lucene/Solr, Tika, and pf4j. Sparkler is an extensible, highly scalable, and high-performance web crawler that is an evolution of Apache Nutch and runs on Apache Spark Cluster.

NOTE:

~~Sparkler is being proposed to Apache Incubator. Review the proposal document and provide your suggestions here here~~ Will be done later, eventually!

Notable features of Sparkler:

Provides Higher performance and fault tolerance: The crawl pipeline has been redesigned to take advantage of the caching and fault tolerance capability of Apache Spark.
Supports complex and near real-time analytics: The internal data-structure is an indexed store powered by Apache Lucene and has the functionality to answer complex queries in near real time. Apache Solr (Supporting standalone for a quick start and cloud mode to scale horizontally) is used to expose the crawler analytics via HTTP API. These analytics can be visualized using intuitive charts in Admin dashboard (coming soon).
Streams out the content in real-time: Optionally, Apache Kafka can be configured to retrieve the output content as and when the content becomes available.
Java Script Rendering Executes the javascript code in webpages to create final state of the page. The setup is easy and painless, scales by distributing the work on Spark. It preserves the sessions and cookies for the subsequent requests made to a host.
Extensible plugin framework: Sparkler is designed to be modular. It supports plugins to extend and customize the runtime behaviour.
Universal Parser: Apache Tika, the most popular content detection, and content analysis toolkit that can deal with thousands of file formats, is used to discover links to the outgoing web resources and also to perform analysis on fetched resources.

Quick Start: Running your first crawl job in minutes

To use sparkler, install docker and run the below commands:

# Step 0. Get this script
wget https://raw.githubusercontent.com/USCDataScience/sparkler/master/sparkler-core/bin/dockler.sh
# Step 1. Run the script - it starts docker container and forwards ports to host
bash dockler.sh 
# Step 2. Inject seed urls
/data/sparkler/bin/sparkler.sh inject -id 1 -su 'http://www.bbc.com/news'
# Step 3. Start the crawl job
/data/sparkler/bin/sparkler.sh crawl -id 1 -tn 100 -i 2     # id=1, top 100 URLs, do -i=2 iterations

Running Sparkler with seed urls file:

> seedfile.txt command. 3. Inject seed urls using the following command, (assuming you are in sparkler/bin directory) $bash sparkler.sh inject -id 1 -sf seed-urls.txt 4. Start the crawl job.">

1. Follow Steps 0-1
2. Create a file name seed-urls.txt using Emacs editor as follows:     
       a. emacs sparkler/bin/seed-urls.txt 
       b. copy paste your urls 
       c. Ctrl+x Ctrl+s to save  
       d. Ctrl+x Ctrl+c to quit the editor [Reference: http://mally.stanford.edu/~sr/computing/emacs.html]

* Note: You can use Vim and Nano editors also or use: echo -e "http://example1.com\nhttp://example2.com" >> seedfile.txt command.

3. Inject seed urls using the following command, (assuming you are in sparkler/bin directory) 
$bash sparkler.sh inject -id 1 -sf seed-urls.txt
4. Start the crawl job.

To crawl until the end of all new URLS, use -i -1, Example: /data/sparkler/bin/sparkler.sh crawl -id 1 -i -1

Access the dashboard http://localhost:8983/banana/ (forwarded from docker image). The dashboard should look like the one in the below:

Dashboard

Making Contributions:

Contact Us

Any questions or suggestions are welcomed in our mailing list [email protected] Alternatively, you may use the slack channel for getting help http://irds.usc.edu/sparkler/#slack

Comments

Caught Server refused connection at: http://localhost:8983/solr/crawldb

Issue Description

Please describe our issue, along with: Is very easy, I the second command I run on your guide didn't worked

How to reproduce it

I run bash dockler.sh and the result I had was:

root@DS1515:/volume3/Docker_Volume/Sparkler# bash dockler.sh
Cant find docker image sparkler-local. Going to Fetch it
Fetching uscdatascience/sparkler:latest and tagging as sparkler-local
latest: Pulling from uscdatascience/sparkler
Digest: sha256:4395aa8e69a220cd3bf52ada94aa6dc2ed3e84919470a007faf9cf80f89308eb
Status: Image is up to date for uscdatascience/sparkler:latest
docker.io/uscdatascience/sparkler:latest
Found image: 7bf3f592ca23
Going to launch the shell inside sparkler's docker container.
You can press CTRL-D to exit.
You can rerun this script to resume.
You can access solr at http://localhost:8983/solr when solr is running
You can spark master UI at http://localhost:4041/ when spark master is running

Some useful queries:

- Get stats on groups, status, depth:
    http://localhost:8983/solr/crawldb/query?q=*:*&rows=0&facet=true&&facet.field=crawl_id&facet.field=status&facet.field=group&facet.field=discover_depth

Inside docker, you can do the following:

/data/solr/bin/solr - command line tool for administering solr
    start -force -> start solr
    stop -force -> stop solr
    status -force -> get status of solr
    restart -force -> restart solr

/data/sparkler/bin/sparkler.sh - command line interface to sparkler
   inject - inject seed urls
   crawl - launch a crawl job

As second step I run /data/sparkler/bin/sparkler.sh inject -id 1 -su 'http://www.bbc.com/news' and as result I have:

bash-4.2$ /data/sparkler/bin/sparkler.sh inject -id 1 -su 'http://www.bbc.com/news'
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/data/sparkler/sparkler-app-0.3.1-SNAPSHOT/lib/org.apache.logging.log4j.log4j-slf4j-impl-2.11.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/data/sparkler/sparkler-app-0.3.1-SNAPSHOT/lib/org.slf4j.slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2021-11-27 23:18:42 INFO  PluginService$:53 - Loading plugins...
2021-11-27 23:18:42 INFO  PluginService$:62 - 2 plugin(s) Active: [urlfilter-regex, urlfilter-samehost]
2021-11-27 23:18:42 WARN  PluginService$:65 - 4 extra plugin(s) available but not activated: Set(fetcher-chrome, scorer-dd-svn, fetcher-jbrowser, fetcher-htmlunit)
2021-11-27 23:18:42 DEBUG PluginService$:68 - Loading urlfilter-regex
2021-11-27 23:18:42 INFO  PluginService$:73 - Extensions found: []
2021-11-27 23:18:42 DEBUG PluginService$:68 - Loading urlfilter-samehost
2021-11-27 23:18:42 INFO  PluginService$:73 - Extensions found: []
2021-11-27 23:18:42 INFO  PluginService$:82 - Recognised Plugins: Map()
2021-11-27 23:18:42 INFO  Injector$:108 - Injecting 1 seeds
2021-11-27 23:18:43 WARN  SolrProxy:93 - Caught Server refused connection at: http://localhost:8983/solr/crawldb while adding beans, trying to add one by one
2021-11-27 23:18:43 WARN  SolrProxy:100 - (SKIPPED) Server refused connection at: http://localhost:8983/solr/crawldb while adding [!!!edu.usc.irds.sparkler.model.Resource@26a529dc=>java.util.IllegalFormatConversionException:f != java.util.HashMap!!!]
2021-11-27 23:18:43 DEBUG SolrProxy:101 - Server refused connection at: http://localhost:8983/solr/crawldb
org.apache.solr.client.solrj.SolrServerException: Server refused connection at: http://localhost:8983/solr/crawldb
        at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:672) ~[org.apache.solr.solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 - romseygeek - 2020-03-13 09:38:26]
        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:265) ~[org.apache.solr.solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 - romseygeek - 2020-03-13 09:38:26]
        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:248) ~[org.apache.solr.solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 - romseygeek - 2020-03-13 09:38:26]
        at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:211) ~[org.apache.solr.solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 - romseygeek - 2020-03-13 09:38:26]
        at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:177) ~[org.apache.solr.solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 - romseygeek - 2020-03-13 09:38:26]
        at org.apache.solr.client.solrj.SolrClient.addBean(SolrClient.java:285) ~[org.apache.solr.solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 - romseygeek - 2020-03-13 09:38:26]
        at org.apache.solr.client.solrj.SolrClient.addBean(SolrClient.java:267) ~[org.apache.solr.solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 - romseygeek - 2020-03-13 09:38:26]
        at edu.usc.irds.sparkler.storage.solr.SolrProxy.addResources(SolrProxy.scala:97) [sparkler-app.sparkler-app-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
        at edu.usc.irds.sparkler.service.Injector.run(Injector.scala:111) [sparkler-app.sparkler-app-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
        at edu.usc.irds.sparkler.base.CliTool.run(CliTool.scala:34) [sparkler-app.sparkler-app-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
        at edu.usc.irds.sparkler.base.CliTool.run$(CliTool.scala:32) [sparkler-app.sparkler-app-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
        at edu.usc.irds.sparkler.service.Injector.run(Injector.scala:43) [sparkler-app.sparkler-app-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
        at edu.usc.irds.sparkler.service.Injector$.main(Injector.scala:162) [sparkler-app.sparkler-app-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
        at edu.usc.irds.sparkler.service.Injector.main(Injector.scala) [sparkler-app.sparkler-app-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
        at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
        at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:?]
        at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
        at java.lang.reflect.Method.invoke(Method.java:567) ~[?:?]
        at edu.usc.irds.sparkler.Main$.main(Main.scala:50) [sparkler-app.sparkler-app-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
        at edu.usc.irds.sparkler.Main.main(Main.scala) [sparkler-app.sparkler-app-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
Caused by: org.apache.http.conn.HttpHostConnectException: Connect to localhost:8983 [localhost/127.0.0.1] failed: Connection refused
        at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:156) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:564) ~[org.apache.solr.solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 - romseygeek - 2020-03-13 09:38:26]
        ... 19 more
Caused by: java.net.ConnectException: Connection refused
        at sun.nio.ch.Net.pollConnect(Native Method) ~[?:?]
        at sun.nio.ch.Net.pollConnectNow(Net.java:579) ~[?:?]
        at sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:542) ~[?:?]
        at sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:597) ~[?:?]
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:339) ~[?:?]
        at java.net.Socket.connect(Socket.java:603) ~[?:?]
        at org.apache.http.conn.socket.PlainConnectionSocketFactory.connectSocket(PlainConnectionSocketFactory.java:75) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:564) ~[org.apache.solr.solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 - romseygeek - 2020-03-13 09:38:26]
        ... 19 more
Exception in thread "main" java.lang.reflect.InvocationTargetException
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:567)
        at edu.usc.irds.sparkler.Main$.main(Main.scala:50)
        at edu.usc.irds.sparkler.Main.main(Main.scala)
Caused by: org.apache.solr.client.solrj.SolrServerException: Server refused connection at: http://localhost:8983/solr/crawldb
        at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:672)
        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:265)
        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:248)
        at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:211)
        at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:504)
        at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:479)
        at edu.usc.irds.sparkler.storage.solr.SolrProxy.commitCrawlDb(SolrProxy.scala:112)
        at edu.usc.irds.sparkler.service.Injector.run(Injector.scala:112)
        at edu.usc.irds.sparkler.base.CliTool.run(CliTool.scala:34)
        at edu.usc.irds.sparkler.base.CliTool.run$(CliTool.scala:32)
        at edu.usc.irds.sparkler.service.Injector.run(Injector.scala:43)
        at edu.usc.irds.sparkler.service.Injector$.main(Injector.scala:162)
        at edu.usc.irds.sparkler.service.Injector.main(Injector.scala)
        ... 6 more
Caused by: org.apache.http.conn.HttpHostConnectException: Connect to localhost:8983 [localhost/127.0.0.1] failed: Connection refused
        at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:156)
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376)
        at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)
        at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
        at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
        at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
        at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
        at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:564)
        ... 18 more
Caused by: java.net.ConnectException: Connection refused
        at java.base/sun.nio.ch.Net.pollConnect(Native Method)
        at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:579)
        at java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:542)
        at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:597)
        at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:339)
        at java.base/java.net.Socket.connect(Socket.java:603)
        at org.apache.http.conn.socket.PlainConnectionSocketFactory.connectSocket(PlainConnectionSocketFactory.java:75)
        at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
        ... 28 more
2021-11-27 23:18:43 WARN  PluginService$:49 - Stopping all plugins... Runtime is about to exit.

Environment and Version Information

Please indicate relevant versions, including, if relevant:

Java Version: 1.8.0_275 but I taught it was provided inside your Docker
Spark Version: I taught it was already instaled inside your Docker. If not I haven't installed it
Operating System name and version: is Docker installed on a Synology DS1515+ . If I run docker version I receive Client: Version: 20.10.3 API version: 1.41 Go version: go1.15.6 Git commit: b35e731 Built: Fri Jun 18 08:25:45 2021 OS/Arch: linux/amd64 Context: default Experimental: true

Server: Engine: Version: 20.10.3 API version: 1.41 (minimum version 1.12) Go version: go1.15.6 Git commit: e7f7c95 Built: Fri Jun 18 08:26:10 2021 OS/Arch: linux/amd64 Experimental: false containerd: Version: v1.4.3 GitCommit: b1dc45ec561bd867c4805eee786caab7cc83acae runc: Version: v1.0.0-rc93 GitCommit: 89783e1862a2cc04647ab15b6e88a0af3d66fac3 docker-init: Version: 0.19.0 GitCommit: 12b6a20

An external links for reference

Nah, just tell me if Java and Spark are inside your Docker image or not. If they are not and I have to install them you can close this ticket

Contributing

I'm willing to contribute

opened by francesco1119 19

De-Duplicate documents in CrawlDB (Solr)
Since now we have a more sophisticated definition of the id field (with timestamp included), we have to think on de-duplication of the documents.

I am opening a discussion channel here to define de-duplication. Some of the suggestions are:

Compare SHA256 hash of the raw_content i.e. signature field (but this will enforce fetching of the duplicate document even though we are not storing it)

Compare the url field

We can refer here for the implementation.
Discussion
opened by karanjeets 16
Error when injecting urls
I am new to Sparkler. I begun with the requirements provided and everything worked well. Then, when I started to inject some urls, I got an issue.

Here are the steps that I followed to inject urls:

cd to the root directory of the project

docker run -it sparkler-local

/data/solr/bin/solr

/data/sparkler/bin/sparkler.sh

java -jar sparkler-app-0.1-SNAPSHOT.jar inject -sf seed.txt

The output that it must appear is:

2016-06-07 19:22:49 INFO Injector$:70 [main] - Injecting 2 seeds >>jobId = sparkler-job-1465352569649

But, I got this output: Error: Unable to access jarfile sparkler-app-0.1-SNAPSHOT.jar

I don't understand what caused this problem and how can I resolve it. I will be very gratefull if I get an answer. I look forward to your reply.

Thank you.
opened by User12300 15
Mvn2sbt

Works, but needs some finishing off, but to expose it here, we have a working SBT build now.

Why move from Maven to SBT you ask? Because then IDE's can load up the code and work properly with the Java and Scala mix rather than the crazy state we were in with the old build.

opened by buggtb 13
Sparkler Configuration Update
I have added validations and type check in sparkler configuration. Also, I have added some TODOs comment which I think should be given a thought to. @thammegowda @karanjeets do give it a look. Let me know any suggestions, will do the needful. Also, I found a very interesting thing, there are multiple sources of the same type of files present.

./resources/sparkler-default.yaml ./sparkler-app/target/test-classes/sparkler-default.yaml ./sparkler-app/src/test/resources/sparkler-default.yaml ./conf/sparkler-default.yaml ./sparkler-api/target/classes/sparkler-default.yaml

There are multiple files like sparkler-default.yaml. #93 : bit of work
opened by SHASHANK-PRO-05 13
banana dashboard does not load the injected jobs

Issue Description

Hello .I am trying to build the sparkler using the docker image and after I inject the jobs with id and try to see index and other behaviors in the banana dash board it always say zero .I don't know the reason why this happen

I use ubuntu 16.04 lts version and the docker version is Docker version 17.09.0-ce, build afdb6d4 and i use jdk 1.8 and the spark version is the latest ones. Thank you!!

opened by YehualashetGit 12
Working with remote spark.

Has anyone tried this with a non local spark?

I ask because when I try and run on a remote spark I get class mismatch errors:

java.lang.RuntimeException: java.io.InvalidClassException: org.apache.spark.rpc.netty.RequestMessage; local class incompatible: stream classdesc serialVersionUID = -5447855329526097695, local class serialVersionUID = -2221986757032131007

But then when I check the versions, you guys use Spark 1.6.1 which requires Scala 2.10.x, but you use Scala 2.11.x and if you try and downgrade to 2.10 it doesn't compile.

opened by buggtb 11

Sparkler Elasticsearch storage engine

We are moving ahead with this project within the USC CSCI 401 Senior Capstone Program. My stakeholder needs are as follows

1. As a Crawler Administrator, I need to be able to specify and configure the Sparkler storage engine through configuration parameters.
1. As a Data Analyst, I need access to a dashboard which displays crawl information.
1. As a DevOps Engineer, I need to deploy Sparkler (configured with the Elasticsearch storage engine) via Docker and as a Helm Chart.
1. As a Test and Quality Assurance Engineer, I need to integrate Sparkler tests for the Elasticsearch storage engine into my CI process.
1. As a Development Lead, I need access to developer documentation covering the Elasticsearch storage engine for Sparkler.

Sparkler Committers, I wonder if it is possible for us to use the Github projects feature to manage the project?

Capstone Team, please reply with your Github ID here.

opened by lewismc 10

Java null pointer error in fetch()

Hi!

I encounter some errors. The program is crashing for 10 crawls and I have the following errors (i put bold chars). Can you help me to figure out why ?

Best,

1st

2016-12-26 16:40:24 ERROR Executor:95 [Executor task launch worker-1] - Exception in task 3.0 in stage 1.0 (TID 8) org.openqa.selenium.WebDriverException: Build info: version: 'unknown', revision: 'unknown', time: 'unknown'** System info: host: '', ip: '', os.name: 'Mac OS X', os.arch: 'x86_64', os.version: '10.12.2', java.version: '1.8.0_91' Driver info: driver.version: JBrowserDriver at com.machinepublishers.jbrowserdriver.Util.handleException(Util.java:140) at com.machinepublishers.jbrowserdriver.JBrowserDriver.get(JBrowserDriver.java:646) at edu.usc.irds.sparkler.plugin.FetcherJBrowser.fetch(FetcherJBrowser.java:81) at edu.usc.irds.sparkler.util.FetcherDefault$FetchIterator.next(FetcherDefault.java:77) at edu.usc.irds.sparkler.util.FetcherDefault$FetchIterator.next(FetcherDefault.java:60) at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43) at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:52) at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:28) at scala.collection.Iterator$$anon$12.next(Iterator.scala:444) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:285) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.EOFException at java.io.DataInputStream.readByte(DataInputStream.java:267) at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:215) at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:162) at java.rmi.server.RemoteObjectInvocationHandler.invokeRemoteMethod(RemoteObjectInvocationHandler.java:227) at java.rmi.server.RemoteObjectInvocationHandler.invoke(RemoteObjectInvocationHandler.java:179) at com.machinepublishers.jbrowserdriver.$Proxy18.get(Unknown Source) at com.machinepublishers.jbrowserdriver.JBrowserDriver.get(JBrowserDriver.java:643) ... 20 more

2nd

2016-12-26 16:40:24 ERROR TaskSetManager:74 [task-result-getter-3] - Task 3 in stage 1.0 failed 1 times; aborting job Exception in thread "main" java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at edu.usc.irds.sparkler.Main$.main(Main.scala:47) at edu.usc.irds.sparkler.Main.main(Main.scala) Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 1.0 failed 1 times, most recent failure: Lost task 3.0 in stage 1.0 (TID 8, localhost): org.openqa.selenium.WebDriverException: Build info: version: 'unknown', revision: 'unknown', time: 'unknown'** System info: host: '', ip: '', os.name: 'Mac OS X', os.arch: 'x86_64', os.version: '10.12.2', java.version: '1.8.0_91' Driver info: driver.version: JBrowserDriver at com.machinepublishers.jbrowserdriver.Util.handleException(Util.java:140) at com.machinepublishers.jbrowserdriver.JBrowserDriver.get(JBrowserDriver.java:646) at edu.usc.irds.sparkler.plugin.FetcherJBrowser.fetch(FetcherJBrowser.java:81) at edu.usc.irds.sparkler.util.FetcherDefault$FetchIterator.next(FetcherDefault.java:77) at edu.usc.irds.sparkler.util.FetcherDefault$FetchIterator.next(FetcherDefault.java:60) at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43) at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:52) at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:28) at scala.collection.Iterator$$anon$12.next(Iterator.scala:444) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:285) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.EOFException at java.io.DataInputStream.readByte(DataInputStream.java:267) at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:215) at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:162) at java.rmi.server.RemoteObjectInvocationHandler.invokeRemoteMethod(RemoteObjectInvocationHandler.java:227) at java.rmi.server.RemoteObjectInvocationHandler.invoke(RemoteObjectInvocationHandler.java:179) at com.machinepublishers.jbrowserdriver.$Proxy18.get(Unknown Source) at com.machinepublishers.jbrowserdriver.JBrowserDriver.get(JBrowserDriver.java:643) ... 20 more

Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1922) at edu.usc.irds.sparkler.pipeline.Crawler$$anonfun$run$1.apply$mcVI$sp(Crawler.scala:139) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) at edu.usc.irds.sparkler.pipeline.Crawler.run(Crawler.scala:121) at edu.usc.irds.sparkler.base.CliTool$class.run(CliTool.scala:34) at edu.usc.irds.sparkler.pipeline.Crawler.run(Crawler.scala:40) at edu.usc.irds.sparkler.pipeline.Crawler$.main(Crawler.scala:211) at edu.usc.irds.sparkler.pipeline.Crawler.main(Crawler.scala) ... 6 more Caused by: org.openqa.selenium.WebDriverException: Build info: version: 'unknown', revision: 'unknown', time: 'unknown' System info: host: '', ip: '', os.name: 'Mac OS X', os.arch: 'x86_64', os.version: '10.12.2', java.version: '1.8.0_91' Driver info: driver.version: JBrowserDriver at com.machinepublishers.jbrowserdriver.Util.handleException(Util.java:140) at com.machinepublishers.jbrowserdriver.JBrowserDriver.get(JBrowserDriver.java:646) at edu.usc.irds.sparkler.plugin.FetcherJBrowser.fetch(FetcherJBrowser.java:81) at edu.usc.irds.sparkler.util.FetcherDefault$FetchIterator.next(FetcherDefault.java:77) at edu.usc.irds.sparkler.util.FetcherDefault$FetchIterator.next(FetcherDefault.java:60) at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43) at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:52) at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:28) at scala.collection.Iterator$$anon$12.next(Iterator.scala:444) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:285) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.EOFException at java.io.DataInputStream.readByte(DataInputStream.java:267) at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:215) at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:162) at java.rmi.server.RemoteObjectInvocationHandler.invokeRemoteMethod(RemoteObjectInvocationHandler.java:227) at java.rmi.server.RemoteObjectInvocationHandler.invoke(RemoteObjectInvocationHandler.java:179) at com.machinepublishers.jbrowserdriver.$Proxy18.get(Unknown Source) at com.machinepublishers.jbrowserdriver.JBrowserDriver.get(JBrowserDriver.java:643) ... 20 more

3rd

ERROR Utils:95 [Executor task launch worker-2] - Uncaught exception in thread Executor task launch worker-2 java.lang.NullPointerException at org.apache.spark.scheduler.Task$$anonfun$run$1.apply$mcV$sp(Task.scala:95) at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1229) at org.apache.spark.scheduler.Task.run(Task.scala:93) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2016-12-26 16:40:57 ERROR Executor:95 [Executor task launch worker-2] - Exception in task 1.0 in stage 1.0 (TID 6) java.util.NoSuchElementException: key not found: 6** at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:59) at scala.collection.mutable.HashMap.apply(HashMap.scala:65) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:322) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Exception in thread "Executor task launch worker-2" java.lang.IllegalStateException: RpcEnv already stopped. at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:159) at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:131) at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:192) at org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:516) at org.apache.spark.scheduler.local.LocalBackend.statusUpdate(LocalBackend.scala:151) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:317) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Exception in thread "Executor task launch worker-4" java.lang.IllegalStateException: RpcEnv already stopped. at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:159) at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:131) at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:192) at org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:516) at org.apache.spark.scheduler.local.LocalBackend.statusUpdate(LocalBackend.scala:151) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:317) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

...

2016-12-26 16:42:35 DEBUG FetcherJBrowser:153 [FelixStartLevel] - Exception Connection refused Build info: version: 'unknown', revision: 'unknown', time: 'unknown' System info: host: '', ip: '', os.name: 'Mac OS X', os.arch: 'x86_64', os.version: '10.12.2', java.version: '1.8.0_91' Driver info: driver.version: JBrowserDriver raised. The driver is either already closed or this is an unknown exception

Process finished with exit code 1

opened by arelaxend 9
Plugin for fetching pages using a headless browser
This is a very first version of plugin

We are looking to improve further by

restricting resources (like css, fonts) as per url filter

Adding resources already downloaded in crawled data. For example while loading a web page on browser images on that web page are also downloaded. We can send these downloaded images to crawldb
opened by smadha 9
Removing sparkler-app-0.2.0-SNAPSHOT.jar
First i've got an issue with the sparkler-app-0.2.0-SNAPSHOT.jar so i deleted it, then i couldn't rebuild the project, Is there any particular reason to keep the sparkler-app-0.2.0-SNAPSHOT.jar. Well fyi, removing the:

<dependency> <groupId>edu.usc.irds.sparkler</groupId> <artifactId>sparkler-api</artifactId> <version>0.2.0-SNAPSHOT</version> </dependency>

from sparkler-app/pom.xml fix the issue, I'll do it myself if you confirm that it is not useful anymore & when i'll finish my work
opened by misterpilou 8
Bump json5 and react-scripts in /retired/sparkler-ui
Bumps json5 to 2.2.3 and updates ancestor dependency react-scripts. These dependencies need to be updated together.

Updates json5 from 0.5.1 to 2.2.3

Release notes

Sourced from json5's releases.

v2.2.3

Fix: [email protected] is now the 'latest' release according to npm instead of v1.0.2. (#299)

v2.2.2

Fix: Properties with the name __proto__ are added to objects and arrays. (#199) This also fixes a prototype pollution vulnerability reported by Jonathan Gregson! (#295).

v2.2.1

Fix: Removed dependence on minimist to patch CVE-2021-44906. (#266)

v2.2.0

New: Accurate and documented TypeScript declarations are now included. There is no need to install @types/json5. (#236, #244)

v2.1.3 [code, diff]

Fix: An out of memory bug when parsing numbers has been fixed. (#228, #229)

v2.1.2

Fix: Bump minimist to v1.2.5. (#222)

v2.1.1

New: package.json and package.json5 include a module property so bundlers like webpack, rollup and parcel can take advantage of the ES Module build. (#208)

Fix: stringify outputs \0 as \\x00 when followed by a digit. (#210)

Fix: Spelling mistakes have been fixed. (#196)

v2.1.0

New: The index.mjs and index.min.mjs browser builds in the dist directory support ES6 modules. (#187)

v2.0.1

Fix: The browser builds in the dist directory support ES5. (#182)

v2.0.0

Major: JSON5 officially supports Node.js v6 and later. Support for Node.js v4 has been dropped. Since Node.js v6 supports ES5 features, the code has been rewritten in native ES5, and the dependence on Babel has been eliminated.

New: Support for Unicode 10 has been added.

New: The test framework has been migrated from Mocha to Tap.

New: The browser build at dist/index.js is no longer minified by default. A minified version is available at dist/index.min.js. (#181)

Fix: The warning has been made clearer when line and paragraph separators are

... (truncated)

Changelog

Sourced from json5's changelog.

v2.2.3 [code, diff]

Fix: [email protected] is now the 'latest' release according to npm instead of v1.0.2. (#299)

v2.2.2 [code, diff]

Fix: Properties with the name __proto__ are added to objects and arrays. (#199) This also fixes a prototype pollution vulnerability reported by Jonathan Gregson! (#295).

v2.2.1 [code, diff]

Fix: Removed dependence on minimist to patch CVE-2021-44906. (#266)

v2.2.0 [code, diff]

New: Accurate and documented TypeScript declarations are now included. There is no need to install @types/json5. (#236, #244)

v2.1.3 [code, diff]

Fix: An out of memory bug when parsing numbers has been fixed. (#228, #229)

v2.1.2 [code, diff]

Fix: Bump minimist to v1.2.5. (#222)

v2.1.1 [code, [diff][d2.1.1]]

... (truncated)

Commits

c3a7524 2.2.3

94fd06d docs: update CHANGELOG for v2.2.3

3b8cebf docs(security): use GitHub security advisories

f0fd9e1 docs: publish a security policy

6a91a05 docs(template): bug -> bug report

14f8cb1 2.2.2

10cc7ca docs: update CHANGELOG for v2.2.2

7774c10 fix: add proto to objects and arrays

edde30a Readme: slight tweak to intro

97286f8 Improve example in readme

Additional commits viewable in compare view

Updates react-scripts from 2.1.8 to 5.0.1

Changelog

Sourced from react-scripts's changelog.

3.0.0 and Newer Versions

Please refer to CHANGELOG.md for the newer versions.

Commits

19fa58d Publish

9802941 fix: webpack noise printed only if error or warning (#12245)

2eef1d0 Update templates to use React 18 createRoot (#12220)

221e511 Publish

5614c87 Add support for Tailwind (#11717)

20edab4 fix(webpackDevServer): disable overlay for warnings (#11413)

3afbbc0 Update all dependencies (#11624)

f5467d5 feat(eslint-config-react-app): support ESLint 8.x (#11375)

c7627ce Update webpack and dev server (#11646)

544befe Update package.json (#11597)

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies javascript
opened by dependabot[bot] 1
Bump express from 4.16.4 to 4.18.2 in /retired/sparkler-ui
Bumps express from 4.16.4 to 4.18.2.

Release notes

Sourced from express's releases.

4.18.2

Fix regression routing a large stack in a single route

deps: [email protected]

deps: [email protected]

perf: remove unnecessary object clone

deps: [email protected]

4.18.1

Fix hanging on large stack of sync routes

4.18.0

Add "root" option to res.download

Allow options without filename in res.download

Deprecate string and non-integer arguments to res.status

Fix behavior of null/undefined as maxAge in res.cookie

Fix handling very large stacks of sync middleware

Ignore Object.prototype values in settings through app.set/app.get

Invoke default with same arguments as types in res.format

Support proper 205 responses using res.send

Use http-errors for res.format error

deps: [email protected]

Fix error message for json parse whitespace in strict

Fix internal error when inflated body exceeds limit

Prevent loss of async hooks context

Prevent hanging when request already read

deps: [email protected]

deps: [email protected]

deps: [email protected]

deps: [email protected]

deps: [email protected]

deps: [email protected]

Add priority option

Fix expires option to reject invalid dates

deps: [email protected]

Replace internal eval usage with Function constructor

Use instance methods on process to check for listeners

deps: [email protected]

Remove set content headers that break response

deps: [email protected]

deps: [email protected]

deps: [email protected]

Prevent loss of async hooks context

deps: [email protected]

deps: [email protected]

Fix emitted 416 error missing headers property

Limit the headers removed for 304 response

deps: [email protected]

deps: [email protected]

deps: [email protected]

deps: [email protected]

... (truncated)

Changelog

Sourced from express's changelog.

4.18.2 / 2022-10-08

Fix regression routing a large stack in a single route

deps: [email protected]

deps: [email protected]

perf: remove unnecessary object clone

deps: [email protected]

4.18.1 / 2022-04-29

Fix hanging on large stack of sync routes

4.18.0 / 2022-04-25

Add "root" option to res.download

Allow options without filename in res.download

Deprecate string and non-integer arguments to res.status

Fix behavior of null/undefined as maxAge in res.cookie

Fix handling very large stacks of sync middleware

Ignore Object.prototype values in settings through app.set/app.get

Invoke default with same arguments as types in res.format

Support proper 205 responses using res.send

Use http-errors for res.format error

deps: [email protected]

Fix error message for json parse whitespace in strict

Fix internal error when inflated body exceeds limit

Prevent loss of async hooks context

Prevent hanging when request already read

deps: [email protected]

deps: [email protected]

deps: [email protected]

deps: [email protected]

deps: [email protected]

deps: [email protected]

Add priority option

Fix expires option to reject invalid dates

deps: [email protected]

Replace internal eval usage with Function constructor

Use instance methods on process to check for listeners

deps: [email protected]

Remove set content headers that break response

deps: [email protected]

deps: [email protected]

deps: [email protected]

Prevent loss of async hooks context

deps: [email protected]

deps: [email protected]

... (truncated)

Commits

8368dc1 4.18.2

61f4049 docs: replace Freenode with Libera Chat

bb7907b build: [email protected]

f56ce73 build: [email protected]

24b3dc5 deps: [email protected]

689d175 deps: [email protected]

340be0f build: [email protected]

33e8dc3 docs: use Node.js name style

644f646 build: [email protected]

ecd7572 build: [email protected]

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies javascript
opened by dependabot[bot] 1
Bump decode-uri-component from 0.2.0 to 0.2.2 in /retired/sparkler-ui
Bumps decode-uri-component from 0.2.0 to 0.2.2.

Release notes

Sourced from decode-uri-component's releases.

v0.2.2

Prevent overwriting previously decoded tokens 980e0bf

https://github.com/SamVerschueren/decode-uri-component/compare/v0.2.1...v0.2.2

v0.2.1

Switch to GitHub workflows 76abc93

Fix issue where decode throws - fixes #6 746ca5d

Update license (#1) 486d7e2

Tidelift tasks a650457

Meta tweaks 66e1c28

https://github.com/SamVerschueren/decode-uri-component/compare/v0.2.0...v0.2.1

Commits

a0eea46 0.2.2

980e0bf Prevent overwriting previously decoded tokens

3c8a373 0.2.1

76abc93 Switch to GitHub workflows

746ca5d Fix issue where decode throws - fixes #6

486d7e2 Update license (#1)

a650457 Tidelift tasks

66e1c28 Meta tweaks

See full diff in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies javascript
opened by dependabot[bot] 1
Bump certifi from 2020.4.5.1 to 2022.12.7 in /retired/sparkler-sce/webui
Bumps certifi from 2020.4.5.1 to 2022.12.7.

Commits

9e9e840 2022.12.07

b81bdb2 2022.09.24

939a28f 2022.09.14

aca828a 2022.06.15.2

de0eae1 Only use importlib.resources's new files() / Traversable API on Python ≥3.11 ...

b8eb5e9 2022.06.15.1

47fb7ab Fix deprecation warning on Python 3.11 (#199)

b0b48e0 fixes #198 -- update link in license

9d514b4 2022.06.15

4151e88 Add py.typed to MANIFEST.in to package in sdist (#196)

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies python
opened by dependabot[bot] 1
Bump loader-utils and react-scripts in /retired/sparkler-ui
Bumps loader-utils to 2.0.4 and updates ancestor dependency react-scripts. These dependencies need to be updated together.

Updates loader-utils from 1.2.3 to 2.0.4

Release notes

Sourced from loader-utils's releases.

v2.0.4

2.0.4 (2022-11-11)

Bug Fixes

ReDoS problem (#225) (ac09944)

v2.0.3

2.0.3 (2022-10-20)

Bug Fixes

security: prototype pollution exploit (#217) (a93cf6f)

v2.0.2

2.0.2 (2021-11-04)

Bug Fixes

base64 generation and unicode characters (#197) (8c2d24e)

v2.0.1

2.0.1 (2021-10-29)

Bug Fixes

md4 support on Node.js v17 (#193) (1069f61)

v2.0.0

2.0.0 (2020-03-17)

⚠ BREAKING CHANGES

minimum required Node.js version is 8.9.0 (#166) (c937e8c)

the getOptions method returns empty object on empty query (#167) (b595cfb)

Use md4 by default

v1.4.2

1.4.2 (2022-11-11)

Bug Fixes

ReDoS problem (#226) (17cbf8f)

... (truncated)

Changelog

Sourced from loader-utils's changelog.

2.0.4 (2022-11-11)

Bug Fixes

ReDoS problem (#225) (ac09944)

2.0.3 (2022-10-20)

Bug Fixes

security: prototype pollution exploit (#217) (a93cf6f)

2.0.2 (2021-11-04)

Bug Fixes

base64 generation and unicode characters (#197) (8c2d24e)

2.0.1 (2021-10-29)

Bug Fixes

md4 support on Node.js v17 (#193) (1069f61)

2.0.0 (2020-03-17)

⚠ BREAKING CHANGES

minimum required Node.js version is 8.9.0 (#166) (c937e8c)

the getOptions method returns empty object on empty query (#167) (b595cfb)

Use md4 by default

1.4.0 (2020-02-19)

Features

the resourceQuery is passed to the interpolateName method (#163) (cd0e428)

1.3.0 (2020-02-19)

... (truncated)

Commits

6688b50 chore(release): 2.0.4

ac09944 fix: ReDoS problem (#225)

7162619 chore(release): 2.0.3

a93cf6f fix(security): prototype polution exploit (#217)

90c7c4b chore(release): 2.0.2

8c2d24e fix: base64 generation and unicode characters (#197)

5fb5562 chore(release): 2.0.1

1069f61 fix: md4 support on Node.js v17 (#193)

d9f4e23 chore(release): 2.0.0

865dc03 refactor: switch to md4 by default (#168)

Additional commits viewable in compare view

Updates react-scripts from 2.1.8 to 5.0.1

Changelog

Sourced from react-scripts's changelog.

3.0.0 and Newer Versions

Please refer to CHANGELOG.md for the newer versions.

Commits

19fa58d Publish

9802941 fix: webpack noise printed only if error or warning (#12245)

2eef1d0 Update templates to use React 18 createRoot (#12220)

221e511 Publish

5614c87 Add support for Tailwind (#11717)

20edab4 fix(webpackDevServer): disable overlay for warnings (#11413)

3afbbc0 Update all dependencies (#11624)

f5467d5 feat(eslint-config-react-app): support ESLint 8.x (#11375)

c7627ce Update webpack and dev server (#11646)

544befe Update package.json (#11597)

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies javascript
opened by dependabot[bot] 1
Bump joblib from 0.14.1 to 1.2.0 in /retired/sparkler-sce/webui
Bumps joblib from 0.14.1 to 1.2.0.

Changelog

Sourced from joblib's changelog.

Release 1.2.0

Fix a security issue where eval(pre_dispatch) could potentially run arbitrary code. Now only basic numerics are supported. joblib/joblib#1327

Make sure that joblib works even when multiprocessing is not available, for instance with Pyodide joblib/joblib#1256

Avoid unnecessary warnings when workers and main process delete the temporary memmap folder contents concurrently. joblib/joblib#1263

Fix memory alignment bug for pickles containing numpy arrays. This is especially important when loading the pickle with mmap_mode != None as the resulting numpy.memmap object would not be able to correct the misalignment without performing a memory copy. This bug would cause invalid computation and segmentation faults with native code that would directly access the underlying data buffer of a numpy array, for instance C/C++/Cython code compiled with older GCC versions or some old OpenBLAS written in platform specific assembly. joblib/joblib#1254

Vendor cloudpickle 2.2.0 which adds support for PyPy 3.8+.

Vendor loky 3.3.0 which fixes several bugs including:

robustly forcibly terminating worker processes in case of a crash (joblib/joblib#1269);

avoiding leaking worker processes in case of nested loky parallel calls;

reliability spawn the correct number of reusable workers.

Release 1.1.0

Fix byte order inconsistency issue during deserialization using joblib.load in cross-endian environment: the numpy arrays are now always loaded to use the system byte order, independently of the byte order of the system that serialized the pickle. joblib/joblib#1181

Fix joblib.Memory bug with the ignore parameter when the cached function is a decorated function.

... (truncated)

Commits

5991350 Release 1.2.0

3fa2188 MAINT cleanup numpy warnings related to np.matrix in tests (#1340)

cea26ff CI test the future loky-3.3.0 branch (#1338)

8aca6f4 MAINT: remove pytest.warns(None) warnings in pytest 7 (#1264)

067ed4f XFAIL test_child_raises_parent_exits_cleanly with multiprocessing (#1339)

ac4ebd5 MAINT add back pytest warnings plugin (#1337)

a23427d Test child raises parent exits cleanly more reliable on macos (#1335)

ac09691 [MAINT] various test updates (#1334)

4a314b1 Vendor loky 3.2.0 (#1333)

bdf47e9 Make test_parallel_with_interactively_defined_functions_default_backend timeo...

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies python
opened by dependabot[bot] 1