Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

Overview

Apache Zeppelin

Documentation: User Guide
Mailing Lists: User and Dev mailing list
Continuous Integration: core frontend rat
Contributing: Contribution Guide
Issue Tracker: Jira
License: Apache 2.0

Zeppelin, a web-based notebook that enables interactive data analytics. You can make beautiful data-driven, interactive and collaborative documents with SQL, Scala and more.

Core feature:

  • Web based notebook style editor.
  • Built-in Apache Spark support

To know more about Zeppelin, visit our web site https://zeppelin.apache.org

Getting Started

Install binary package

Please go to install to install Apache Zeppelin from binary package.

Build from source

Please check Build from source to build Zeppelin from source.

Comments
  • R Interpreter for Zeppelin

    R Interpreter for Zeppelin

    This is the initial PR for an R Interpreter for Zeppelin. There's still some work to be done (e.g., tests), but its useable, it brings to Zeppelin features from R like its library of statistics and machine learning packages, as well as advanced interactive visualizations. So I'd like to open it up for others to comment and/or become involved.

    Summary:

    • There are two interpreters, one emulates a REPL, the other uses knitr to weave markdown and formatted R output. The two interpreters share a single execution environment.
    • Visualisations: Besides R's own graphics, this also supports interactive visualizations with googleVis and rCharts. I am working on htmlwidgets (almost done) with the author of that package, and a next-step project is to get Shiny/ggvis working. Sometimes, a visualization won't load until the page is reloaded. I'm not sure why this is.
    • Licensing: To talk to R, this integrates code forked from rScala. rScala was released with a BSD-license option, and the author's permission was obtained.
    • Spark: Getting R to share a single spark context with the Spark interpreter group is going to be a project. For right now, the R interpreters live in their own "r" interpreter group, and new spark contexts are created on startup.
    • Zeppelin Context: Not yet integrated, in significant part because there's no ZeppelinContext to talk to until it lives in the Spark interpreter group.
    • Documentation: A notebook is included that demonstrates what the interpreter does and how to use it.
    • Tests: Working on it...

    P.S.: This is my first PR on a project of this size; let me know what I messed up and I'll try to fix it ASAP.

    opened by elbamos 168
  • R and SparkR Support [WIP]

    R and SparkR Support [WIP]

    What is this PR for?

    Implement R and SpakR Intepreter as part of the Spark Interpreter Group. It also implements R and Scala binding (in both directions).

    What type of PR is it?

    [Feature]

    Todos

    • [ ] - Documentation
    • [ ] - Unit test (if relevant, as we depend on R being available on the host)
    • [ ] - Assess licensing (a priori ok as we don't delive anything out of ASL2, but we should corectly phrase the NOTICE as we depend on non-ASL2 comptabile licenses (R) to run the interpreter).

    Is there a relevant Jira issue?

    https://issues.apache.org/jira/browse/ZEPPELIN-156 SparkR support

    How should this be tested?

    You need R available on the host running the notebook.

    For Centos: yum install R R-devel
    For Ubuntu: apt-get install r-base r-cran-rserve
    

    Install additional R packages:

    curl https://cran.r-project.org/src/contrib/Archive/rscala/rscala_1.0.6.tar.gz -o /tmp/rscala_1.0.6.tar.gz
    R CMD INSTALL /tmp/rscala_1.0.6.tar.gz
    R -e "install.packages('ggplot2', repos = 'http://cran.us.r-project.org')"
    R -e install.packages('knitr', repos = 'http://cran.us.r-project.org')
    
    • Build + launch Zeppelin and test the R note.

    !!! you need rscala_1.0.6 (if not, you need to build with -Drscala.version=...)

    Screenshots (if appropriate)

    Simple R

    Simple R

    Plot

    Plot

    Scala R Binding

    Scala R Binding

    R Scala Binding

    R Scala Binding

    SparkR

    SparkR

    Questions:

    • Does the licenses files need update? to be checked... (cfr R needs to be available to make this interpreter operational).
    • Is there breaking changes for older versions? No
    • Does this needs documentation? Yes
    opened by echarles 80
  • [ZEPPELIN-116] Add Apache Mahout Interpreter

    [ZEPPELIN-116] Add Apache Mahout Interpreter

    What is this PR for?

    This PR adds Mahout functionality for the Spark Interpreter.

    What type of PR is it?

    Improvement

    Todos

    • [x] Implement Mahout Interpreter in Spark
    • [x] Add Unit Tests
    • [x] Add Documentation
    • [x] Add Example Notebook

    What is the Jira issue?

    https://issues.apache.org/jira/browse/ZEPPELIN-116

    How should this be tested?

    Open a Spark Notebook with Mahout enabled and run a few simple commands using the R-Like DSL and Spark Distributed Context (Mahout Specific)

    Screenshots (if appropriate)

    Questions:

    • Does the licenses files need update? No
    • Is there breaking changes for older versions? No
    • Does this needs documentation? Yes
    opened by rawkintrevo 71
  • JDBC generic interpreter

    JDBC generic interpreter

    Only you need to add to the classpath the jdbc connector jar and the interpreter add the particular properties for each db. In the file zeppelin-daemon.sh add: ZEPPELIN_CLASSPATH+=":${ZEPPELIN_HOME}/jdbc/jdbc/connector jar"

    opened by vgmartinez 54
  • Add support to run Spark interpreter on a Kubernetes cluster

    Add support to run Spark interpreter on a Kubernetes cluster

    What is this PR for?

    The goal of this PR is to be able to execute Spark notebooks on Kubernetes in cluster mode, so that the Spark Driver runs inside Kubernetes cluster - based on https://github.com/apache-spark-on-k8s/spark. Zeppelin uses spark-submit to start RemoteInterpreterServer which is able to execute notebooks on Spark. Kubernetes specific spark-submit parameters like driver, executor, init container, shuffle images should be set in SPARK_SUBMIT_OPTIONS environment variable. In case the Spark interpreter is configured with a K8 Spark specific master url (k8s://https....) RemoteInterpreterServer is launched inside a Spark driver pod on Kubernetes, thus Zeppelin server it has to be able to connect to the remote server. In a Kubernetes cluster the best solution for this is creating a K8S service for RemoteInterpreterServer. This is the reason for having the SparkK8RemoteInterpreterManagerProcess - extending functionality of RemoteInterpreterManagerProcess - which creates the Kubernetes service, mapping the port of RemoteInterpreterServer in Driver pod and connects to this service once Spark Driver pod is in Running state.

    Design considerations: As described in spark-interpreter-k8s.md, the Zeppelin Server is running inside the Kubenetes cluster - thus we can choose where to run the Zeppelin server - the benefit of running the server inside K8S is that we don't have to deal with authentication. However is not enough to start only the Zeppelin Server inside the Kubernetes cluster as by default Zeppelin will start spark-submit in the same pod and will run every Spark job locally. The scope of this PR is run to run spark-submit (apache-spark-on-k8s version) properly configured with Docker images etc. so that the Spark driver will be started in a separate pod in the cluster, also staring separate pods for Spark executors thus we can benefit from dynamic scaling of executors inside the Kubernetes cluster (while all the scheduling, pod allocation, resource management is done by the Kubernetes scheduler).

    Please see below how is this running/used:

    The cluster:

    Spark/Zeppelin cluster

    The flow:

    Zeppelin Flow

    What type of PR is it?

    Feature

    What is the Jira issue?

    • https://issues.apache.org/jira/browse/ZEPPELIN-3020

    How should this be tested?

    Unit and functional tests - running notebooks on Spark on K8S.

    Questions:

    • Does the licenses files need update?
    • Is there breaking changes for older versions?
    • Does this needs documentation?
    opened by matyix 53
  • [ZEPPELIN-2598] Securing Zeppelin with OpenID Connect

    [ZEPPELIN-2598] Securing Zeppelin with OpenID Connect

    What is this PR for?

    Integrating Open ID connect login into Zeppelin leveraging Shiro(already present) and Pac4J( that needs to be in the classpath). Modifications done here should not affect any existing mechanisms but simply integrates and enable new once.

    What type of PR is it?

    [Improvement]

    What is the Jira issue?

    [ZEPPELIN-2598]

    Questions:

    • Does the licenses files need update?
    • Is there breaking changes for older versions?
    • Does this needs documentation?
    opened by andreaTP 52
  • [ZEPPELIN-1210] Run interpreter per user

    [ZEPPELIN-1210] Run interpreter per user

    What is this PR for?

    Enabling each user to run same interpreter.

    What type of PR is it?

    [Improvement]

    What is the Jira issue?

    https://issues.apache.org/jira/browse/ZEPPELIN-1210

    How should this be tested?

    1. Enable shiro to use authentication mode
    2. Check per user in your interpreter tab
    3. Run different paragraphs with different users
      1. Run %spark sc.version, you will see the two res0: ... in your paragraphs

    Screenshots (if appropriate)

    Questions:

    • Does the licenses files need update? No
    • Is there breaking changes for older versions? No
    • Does this needs documentation? No
    opened by jongyoul 50
  • ZEPPELIN-672] Add Datatables instead of normal tables

    ZEPPELIN-672] Add Datatables instead of normal tables

    What is this PR for?

    It is an extension of #6 #714 It allows user to export data in a paragraph to a TSV/CSV file.

    What type of PR is it?

    Feature

    Todos

    • [x] - Improves the current Table features like Search, Fixed Headers, Sorting

    Is there a relevant Jira issue?

    ZEPPELIN-672

    How should this be tested?

    1. Create a paragraph with data in %table view
    2. Click on TSV/CSV button to export CSV/TSV file

    Screenshots (if appropriate)

    image

    Questions:

    • Does the licenses files need update? Need to have MIT license for Datatables.
    • Is there breaking changes for older versions? No
    • Does this needs documentation? No
    opened by ankurmitujjain 50
  • [ZEPPELIN-2582][DOCS] docs for interpreter binding modes

    [ZEPPELIN-2582][DOCS] docs for interpreter binding modes

    What is this PR for?

    Updated interpreter_binding_mode.md since users are sometimes confused what this mode means and there is already opened JIRA issue. This documentation will be helpful to Zeppelin users.

    disclaimer: content was copied from here with author's consent.

    What type of PR is it?

    [Documentation]

    Todos

    DONE

    What is the Jira issue?

    ZEPPELIN-2582

    How should this be tested?

    1. setup rvm 2.1.0+ and install required bundles for docs/
    2. bundle exec jekyll serve --watch
    3. http://localhost:4000/usage/interpreter/interpreter_binding_mode.html

    Screenshots (if appropriate)

    image

    image

    image

    image

    image

    image

    Questions:

    • Does the licenses files need update? - NO
    • Is there breaking changes for older versions? - NO
    • Does this needs documentation? - This PR is about docs.
    opened by 1ambda 49
  • [ZEPPELIN-707]Automatically adds %.* of previous paragraph's typing

    [ZEPPELIN-707]Automatically adds %.* of previous paragraph's typing

    What is this PR for?

    Automatically adds %pyspark in the code area after paragraph created, if previous paragraph's typing is %pyspark.

    What type of PR is it?

    New Feature

    What is the Jira issue?

    https://issues.apache.org/jira/browse/ZEPPELIN-707

    How should this be tested?

    unit test Run-time checking

    Screenshots (if appropriate)

    • default interpreter zeppelin-707
    • while run paragraph, go to line end zeppelin-707-2
    opened by mwkang 49
  • ZEPPELIN-157: Adding Map Visualization for Zeppelin

    ZEPPELIN-157: Adding Map Visualization for Zeppelin

    • [x] Listing Map charting Libraries
    • [x] Checking compatible license
    • [x] Adding chart library (leafletjs)
    • [x] Adding new button for chart mapping
    • [x] Loading GIS Mapp in Zeppelin
    • [x] Removing the setting for mapping chart
    • [x] Adding sample data set (loading)
      • [x] Get the sample GIS data and read it
      • [x] Add it as in tutorail, where it can be run from the clean environment
    • [x] Generalization data validation model
      • [x] Add 'validator' for the Map graph input format as service
    • [x] Test for this
    opened by Madhuka 48
  • CI broken

    CI broken

    What is this PR for?

    At the moment the CI is broken. This PullRequest is intended to be a common basis to fix the error quickly.

    What type of PR is it?

    Bug Fix

    Todos

    • [ ] - Fix CI

    How should this be tested?

    • via current CI

    Questions:

    • Does the license files need to update? No
    • Is there breaking changes for older versions? No
    • Does this needs documentation? No
    opened by Reamer 4
  • Bump json5 from 1.0.1 to 1.0.2 in /zeppelin-web-angular

    Bump json5 from 1.0.1 to 1.0.2 in /zeppelin-web-angular

    Bumps json5 from 1.0.1 to 1.0.2.

    Release notes

    Sourced from json5's releases.

    v1.0.2

    • Fix: Properties with the name __proto__ are added to objects and arrays. (#199) This also fixes a prototype pollution vulnerability reported by Jonathan Gregson! (#295). This has been backported to v1. (#298)
    Changelog

    Sourced from json5's changelog.

    Unreleased [code, diff]

    v2.2.3 [code, diff]

    v2.2.2 [code, diff]

    • Fix: Properties with the name __proto__ are added to objects and arrays. (#199) This also fixes a prototype pollution vulnerability reported by Jonathan Gregson! (#295).

    v2.2.1 [code, diff]

    • Fix: Removed dependence on minimist to patch CVE-2021-44906. (#266)

    v2.2.0 [code, diff]

    • New: Accurate and documented TypeScript declarations are now included. There is no need to install @types/json5. (#236, #244)

    v2.1.3 [code, diff]

    • Fix: An out of memory bug when parsing numbers has been fixed. (#228, #229)

    v2.1.2 [code, diff]

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies javascript 
    opened by dependabot[bot] 0
  • Bump jettison from 1.4.0 to 1.5.2

    Bump jettison from 1.4.0 to 1.5.2

    Bumps jettison from 1.4.0 to 1.5.2.

    Release notes

    Sourced from jettison's releases.

    Jettison 1.5.2

    What's Changed

    Full Changelog: https://github.com/jettison-json/jettison/compare/jettison-1.5.1...jettison-1.5.2

    Jettison 1.5.1

    What's Changed

    Full Changelog: https://github.com/jettison-json/jettison/compare/jettison-1.5.0...jettison-1.5.1

    Commits
    • 6dc73a0 [maven-release-plugin] prepare release jettison-1.5.2
    • 19ae19f Fixing StackOverflow error
    • 325b51b Bump woodstox-core from 6.2.8 to 6.4.0
    • 81d3786 [maven-release-plugin] prepare for next development iteration
    • bdb3982 [maven-release-plugin] prepare release jettison-1.5.1
    • 1268b75 Prevent infinite loop when a /* comment is not terminated
    • cff9f28 Create codeql-analysis.yml
    • 395f862 Stack Overflow fix on malformed JSON
    • a5d2223 [maven-release-plugin] prepare for next development iteration
    • e1bf529 [maven-release-plugin] prepare release jettison-1.5.0
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies java 
    opened by dependabot[bot] 0
  • [MINOR] Support aarch for frontend plugin

    [MINOR] Support aarch for frontend plugin

    What is this PR for?

    Fixing build error when using aarch jdk for mac OS. our version of node doesn't support amd64 arch but the old plugin tries to download arm version.

    What type of PR is it?

    Bug Fix

    Todos

    • [x] - Update frontend-maven-plugin

    What is the Jira issue?

    N/A

    How should this be tested?

    • Pass to build with arm jdk in macOS

    Screenshots (if appropriate)

    image

    Questions:

    • Does the license files need to update? No
    • Is there breaking changes for older versions? No
    • Does this needs documentation? No
    opened by jongyoul 5
  • Bump express from 4.17.1 to 4.18.2 in /zeppelin-web-angular

    Bump express from 4.17.1 to 4.18.2 in /zeppelin-web-angular

    Bumps express from 4.17.1 to 4.18.2.

    Release notes

    Sourced from express's releases.

    4.18.2

    4.18.1

    • Fix hanging on large stack of sync routes

    4.18.0

    ... (truncated)

    Changelog

    Sourced from express's changelog.

    4.18.2 / 2022-10-08

    4.18.1 / 2022-04-29

    • Fix hanging on large stack of sync routes

    4.18.0 / 2022-04-25

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies javascript 
    opened by dependabot[bot] 0
  • Bump express from 4.16.4 to 4.17.3 in /zeppelin-web

    Bump express from 4.16.4 to 4.17.3 in /zeppelin-web

    Bumps express from 4.16.4 to 4.17.3.

    Release notes

    Sourced from express's releases.

    4.17.3

    4.17.2

    4.17.1

    • Revert "Improve error message for null/undefined to res.status"

    4.17.0

    • Add express.raw to parse bodies into Buffer
    • Add express.text to parse bodies into string

    ... (truncated)

    Changelog

    Sourced from express's changelog.

    4.17.3 / 2022-02-16

    4.17.2 / 2021-12-16

    4.17.1 / 2019-05-25

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies javascript 
    opened by dependabot[bot] 0
Owner
The Apache Software Foundation
The Apache Software Foundation
Apache Druid: a high performance real-time analytics database.

Website | Documentation | Developer Mailing List | User Mailing List | Slack | Twitter | Download Apache Druid Druid is a high performance real-time a

The Apache Software Foundation 12.3k Jan 9, 2023
The official home of the Presto distributed SQL query engine for big data

Presto Presto is a distributed SQL query engine for big data. See the User Manual for deployment instructions and end user documentation. Requirements

Presto 14.3k Jan 5, 2023
A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, organization and lifecycle management for both streaming and batch data ecosystems.

Apache Gobblin Apache Gobblin is a highly scalable data management solution for structured and byte-oriented data in heterogeneous data ecosystems. Ca

The Apache Software Foundation 2.1k Jan 4, 2023
A scalable, mature and versatile web crawler based on Apache Storm

StormCrawler is an open source collection of resources for building low-latency, scalable web crawlers on Apache Storm. It is provided under Apache Li

DigitalPebble Ltd 776 Jan 2, 2023
Distributed and fault-tolerant realtime computation: stream processing, continuous computation, distributed RPC, and more

IMPORTANT NOTE!!! Storm has Moved to Apache. The official Storm git repository is now hosted by Apache, and is mirrored on github here: https://github

Nathan Marz 8.9k Dec 26, 2022
OpenRefine is a free, open source power tool for working with messy data and improving it

OpenRefine OpenRefine is a Java-based power tool that allows you to load data, understand it, clean it up, reconcile it, and augment it with data comi

OpenRefine 9.2k Jan 1, 2023
A platform for visualization and real-time monitoring of data workflows

Status This project is no longer maintained. Ambrose Twitter Ambrose is a platform for visualization and real-time monitoring of MapReduce data workfl

Twitter 1.2k Dec 31, 2022
Netflix's distributed Data Pipeline

Suro: Netflix's Data Pipeline Suro is a data pipeline service for collecting, aggregating, and dispatching large volume of application events includin

Netflix, Inc. 772 Dec 9, 2022
Hadoop library for large-scale data processing, now an Apache Incubator project

Apache DataFu Follow @apachedatafu Apache DataFu is a collection of libraries for working with large-scale data in Hadoop. The project was inspired by

LinkedIn's Attic 589 Apr 1, 2022
SAMOA (Scalable Advanced Massive Online Analysis) is an open-source platform for mining big data streams.

SAMOA: Scalable Advanced Massive Online Analysis. This repository is discontinued. The development of SAMOA has moved over to the Apache Software Foun

Yahoo Archive 424 Dec 28, 2022
Program finds average number of words in each comment given a large data set by use of hadoop's map reduce to work in parallel efficiently.

Finding average number of words in all the comments in a data set ?? Mapper Function In the mapper function we first tokenize entire data and then fin

Aleezeh Usman 3 Aug 23, 2021
Access paged data as a "stream" with async loading while maintaining order

DataStream What? DataStream is a simple piece of code to access paged data and interface it as if it's a single "list". It only keeps track of queued

Thomas 1 Jan 19, 2022
Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.

Elephant Bird About Elephant Bird is Twitter's open source library of LZO, Thrift, and/or Protocol Buffer-related Hadoop InputFormats, OutputFormats,

Twitter 1.1k Jan 5, 2023
Stream summarizer and cardinality estimator.

Description A Java library for summarizing data in streams for which it is infeasible to store all events. More specifically, there are classes for es

AddThis 2.2k Dec 30, 2022
Machine Learning Platform and Recommendation Engine built on Kubernetes

Update January 2018 Seldon Core open sourced. Seldon Core focuses purely on deploying a wide range of ML models on Kubernetes, allowing complex runtim

Seldon 1.5k Dec 15, 2022
Desktop app to browse and administer your MongoDB cluster

UMONGO, the MongoDB GUI UMONGO, the MongoDB GUI About This version of UMongo is provided as free software by EdgyTech LLC. UMongo is open source, and

Antoine Girbal 583 Nov 11, 2022
Spring Data Redis extensions for better search, documents models, and more

Object Mapping (and more) for Redis! Redis OM Spring extends Spring Data Redis to take full advantage of the power of Redis. Project Stage Snapshot Is

Redis 303 Dec 29, 2022
FLiP: StreamNative: Cloud-Native: Streaming Analytics Using Apache Flink SQL on Apache Pulsar

StreamingAnalyticsUsingFlinkSQL FLiP: StreamNative: Cloud-Native: Streaming Analytics Using Apache Flink SQL on Apache Pulsar Running on NVIDIA XAVIER

Timothy Spann 5 Dec 19, 2021
Maven notebook

Maven notebook Quick topics Maven is a standard tool used for building and managing any Java-based project. Maven buils projects using what's called p

eze 2 Feb 10, 2022
JVM version of Pact. Enables consumer driven contract testing, providing a mock service and DSL for the consumer project, and interaction playback and verification for the service provider project.

pact-jvm JVM implementation of the consumer driven contract library pact. From the Ruby Pact website: Define a pact between service consumers and prov

Pact Foundation 962 Dec 31, 2022