A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, organization and lifecycle management for both streaming and batch data ecosystems.

Last update: Jan 4, 2023

Overview

Apache Gobblin

Apache Gobblin is a highly scalable data management solution for structured and byte-oriented data in heterogeneous data ecosystems.

Capabilities

Ingestion and export of data from a variety of sources and sinks into and out of the data lake. Gobblin is optimized and designed for ELT patterns with inline transformations on ingest (small t).
Data Organization within the lake (e.g. compaction, partitioning, deduplication)
Lifecycle Management of data within the lake (e.g. data retention)
Compliance Management of data across the ecosystem (e.g. fine-grain data deletions)

Highlights

Battle tested at scale: Runs in production at petabyte-scale at companies like LinkedIn, PayPal, Verizon etc.
Feature rich: Supports task partitioning, state management for incremental processing, atomic data publishing, data quality checking, job scheduling, fault tolerance etc.
Supports stream and batch execution modes
Control Plane (Gobblin-as-a-service) supports programmatic triggering and orchestration of data plane operations.

Common Patterns used in production

Stream / Batch ingestion of Kafka to Data Lake (HDFS, S3, ADLS)
Bulk-loading serving stores from the Data Lake (e.g. HDFS -> Couchbase)
Support for data sync across Federated Data Lake (HDFS <-> HDFS, HDFS <-> S3, S3 <-> ADLS)
Integrate external vendor API-s (e.g. Salesforce, Dynamics etc.) with data store (HDFS, Couchbase etc)
Enforcing Data retention policies and GDPR deletion on HDFS / ADLS

Apache Gobblin is NOT

A general purpose data transformation engine like Spark or Flink. Gobblin can delegate complex-data processing tasks to Spark, Hive etc.
A data storage system like Apache Kafka or HDFS. Gobblin integrates with these systems as sources or sinks.
A general-purpose workflow execution system like Airflow, Azkaban, Dagster, Luigi.

Requirements

Java >= 1.8

If building the distribution with tests turned on:

Maven version 3.5.3

Instructions to run Apache RAT (Release Audit Tool)

Extract the archive file to your local directory.
Run ./gradlew rat. Report will be generated under build/rat/rat-report.html

Instructions to build the distribution

Extract the archive file to your local directory.
Skip tests and build the distribution: Run ./gradlew build -x findbugsMain -x test -x rat -x checkstyleMain The distribution will be created in build/gobblin-distribution/distributions directory. (or)
Run tests and build the distribution (requires Maven): Run ./gradlew build The distribution will be created in build/gobblin-distribution/distributions directory.

Quick Links

Comments

Improve database history store performance
Changes:

Switched database scripts to use flyway. This enables us to keep track of migration scripts, ensure that users can upgrade, and verify that the correct version is installed.

Changed tests from using derby to using embedded mysql. This allows our tests to verify the same code, migrations, etc. as will be used in production.

Added a check in DatabaseJobHistoryStore that verifies that the DB is the correct version. If it is not, a warning will be logged and all operations on the store will become no-ops. This ensures that we don't break if users upgrade the code, but forget to upgrade the database.

Changed travis build to have dependencies needed to run embedded mysql, split the jdk7/jdk8 build jobs, enabled caching. This allows easier identification of problems, improves performance, etc.

Changed DatabaseJobHistoryStore to use batching and upserts to greatly increase performance. Also stopped sending in epoch for startTime/endTime which are not specified. Lastly, removed job properties from task properties to reduce data suplication. All this is to improve performance and data correctness.

Changed the admin UI to show '-' for startTime/endTime when not set

Added step at end of travis build to output test failures to console. The facilitates the ability to determine why the build failed.

Enabled support for compression in the JobExecutionInfoServer

Reduced the amount of data requested by Admin UI

Rewrote the DatabaseJobHistoryStore querying code to greatly reduce the number of queries

Job History Store Management

Migrate an unmanaged database (First Time)

./historystore-manager.sh migrate -Durl="jdbc:mysql://<SERVER>/gobblin" -Duser="<USER>" -Dpassword="<PASSWORD>" -DbaselineOnMigrate=true

Migrate a database (Subsequent Time)

./historystore-manager.sh migrate -Durl="jdbc:mysql://<SERVER>/gobblin" -Duser="<USER>" -Dpassword="<PASSWORD>"

Get database migration status

./historystore-manager.sh info -Durl="jdbc:mysql://<SERVER>/gobblin" -Duser="<USER>" -Dpassword="<PASSWORD>"
enhancement
opened by kadaan 49
Oracle Source/Extractor

Created OracleSource.java and OracleExtractor.java with heavy inspiration from MySqlSource.java and MySqlExtractor.java.

The classes are largely the same as the MySql ones with the exception of getSchemaMetadata() which needed to be modified to account for Oracle reserved words such as format and comment. Also timestamp formats had to be slightly refactored.

Further different strategies for handling primitive values like booleans in Oracle needed to be accounted for. This resulted in overriding getSchema(). The only difference is that Oracle treats booleans as varchar2(1). I created a simple casting utility castToBoolean() which I added to SqlQueryUtils.java.

Since getSchema requires use of the private static final Gson factory from JDBCExtractor. I created a getter for access. It is accessed in OracleExtractor's getSchema() method like so: this.getGson()

I am able to successfully complete table_append and table_full jobs using this new source and extractor. I did a clean assemble, and it did not break the build. I believe I stayed consistent with documentation and coding format.

Please consider my pull request, and let me know if I need to change anything.

Thanks! Brian Vanover

P.S. One thing is that in order to use the Oracle source and extractor, ojdbc6.jar must be added as a dependency. I simply copied the jar into goblin-dist/lib, but it may be worthwhile to add it as a dependency. I just wasn't sure how to modify the build scripts to include it.

opened by ghost 37
Add support for tunneling JDBC traffic over an HTTP proxy
Frequently data stores to be accessed by Gobblin reside outside data centers. In these cases, outbound access from a data center typically needs to go through a gateway proxy for security purposes. In some cases this is an HTTP proxy. However, some protocols like JDBC don't support the concept of "proxies", let alone HTTP proxies, and hence a solution is needed to enable this.

This Pull Request provides a method of tunneling JDBC connections over an HTTP proxy. Note that it is currently only implemented for JDBC, but can be extended to work with any other TCP-based protocol.

The way this works for JDBC is:

When a JdbcProvider is invoked, it checks if the WorkUnit has a proxy host and port defined.

If so, it extracts the remote host and port from the connectionUrl.

It then creates a Tunnel instance configured with the remote host and port. The Tunnel starts a thread that listens on an arbitrary port on localhost.

The JdbcProvider then changes the connectionUrl to replace the remote host and port with the localhost and port and passes it on to the driver.

Hence when the driver creates a JDBC connection, it connects to the Tunnel socket. The Tunnel then connects to the remote host through the proxy via a HTTP CONNECT request.

If established successfully, the Tunnel then just relays bytes back and forth.

When the JDBC data source is closed down, the Tunnel is shut down as well.

The Tunnel can accept as many connections as the JdbcExtractor opens. The Tunnel uses NIO to minimize resource usage.
opened by kunalkandekar 25
Add zookeeper based job lock for gobblin yarn

@sahilTakiar Here is the ZookeeperBasedJobLock that we discussed in #754. The only part I'm a little unsure about is the cancellation of a running job if the connection to ZK is lost. I think it is needed to ensure that duplicate jobs are not running, but I'm not sure if the mean of cancellation is correct.
enhancement

opened by kadaan 21
Add inline Hive registration to Gobblin job

Added a BaseDataPublisherWithHiveRegistration which records the paths that need to be registered as a state property. After JobContext commits the data, it will retrieve the paths from the state and register them in Hive.

opened by zliu41 20
Persist state and commit data on a per-dataset basis
A WorkUnit can be specified to belong to a dataset group by using a new config property dataset.urn.

Job/task state will be persisted on a per-dataset basis into different state store files, one per dataset. All current dataset state store files will be loaded in the next job run.

Commit of data will be done on a per-dataset basis, allowing datasets to be committed independently.

For existing job that does not have a dataset concept or only deals with a single dataset, the logic of job/task state persistence and data commit remains the same.

Signed-off-by: Yinan Li [email protected]
opened by liyinan926 19
JDBC Writer

Pull request for JDBC Writer. Will combine to one commit after the review.

Design doc: https://docs.google.com/a/linkedin.com/document/d/1pSWZtgNssxaV1c19xdd1rKft9JaXM2qmqDwwbIb0scM/edit?usp=sharing

opened by jinhyukchang 18
Separate publisher filesystem from writer filesystem

Support for publishing data to a different filesystem from where data is written. The use case of this is writing the files to HDFS, but publishing them to S3.

opened by kadaan 18
Adding Support for Complex Watermark Types

This is the first pull request for adding complex watermark types to Gobblin. This will replace the legacy system of tracking watermarks. The old system was de-centralized, and depended and passing custom configuration parameters between executions via a WorkUnitState.

The new implementation contains a new interface called Watermark which extends the Comparable and the Copyable interfaces. It only contains one method increment(Object record) which defines how to increment the watermark for a given record.

Another class called WatermarkInterval contains the logic for maintaining low and high watermarks. The corresponding changes to WorkUnit and WorkUnitState have been made. Since this pull request mainly focuses on defining the interfaces, no migration code has been done to the framework yet.

opened by sahilTakiar 18
Base interfaces for value auditing
Added base interfaces for value auditing. The pr mainly has interfaces for

RowSelectionPolicy - An interface to decide if a row needs to be audited

ColomnProjectionPolicy - An interface that projects certain columns/fields of an input {@link GenericRecord} to generate a new {@link GenericRecord} that can be audited.

AuditSink - An interface for persisting value audits

ValueAuditGenerator - The class that implements value based auditing. The class captures the values of certain columns from the rows in the dataset using the {@link ColumnProjectionPolicy}. This is done for every row or for a sample of the rows as defined by the {@link RowSelectionPolicy}

The module that drives auditing will create an instance of ValueAuditGenerator for each avro file being written and called ValueAuditGenerator.audit(GenericRecord)

The complete javadoc has been generated and hosted through github pages here to easily navigate through the detailed documentation

I am adding some unit tests while this is being reviewed. @chavdar can you review?
opened by pcadabam-zz 17
[GOBBLIN-207] Use hadoop filesystem to download job data

This PR switches from downloading the job package using FileUtils.copyURLToFile() to using the Hadoop filesystem filesystem.copyToLocalFile(). This opens up new choices of where to store the job package and allows for better authN support.

opened by kadaan 16
[GOBBLIN-1755] Support extended ACLs and sticky bit for file based distcp
Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

[ ] My PR addresses the following Gobblin JIRA issues and references them in the PR title. For example, "[GOBBLIN-XXX] My Gobblin PR"

https://issues.apache.org/jira/browse/GOBBLIN-1755

Description

[ ] Here are some details about my PR, including screenshots (if applicable):

[ ] Added new preserve attributes for extended ACLs and sticky bit support

[ ] Updated FileAwareInputStreamDataWriter to set ACLs and sticky bit

Tests

[ ] My PR adds the following unit tests OR does not need testing for this extremely good reason:

[ ] build successfully

Commits

[ ] My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":

Subject is separated from body by a blank line

Subject is limited to 50 characters

Subject does not end with a period

Subject uses the imperative mood ("add", not "adding")

Body wraps at 72 characters

Body explains "what" and "why", not "how"
opened by meethngala 1
Add error reporting when attempting to resolve flow configs
Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

[ ] My PR addresses the following Gobblin JIRA issues and references them in the PR title. For example, "[GOBBLIN-XXX] My Gobblin PR"

https://issues.apache.org/jira/browse/GOBBLIN-1759

Description

[ ] Here are some details about my PR, including screenshots (if applicable): Adds more detailed logging when attempting to resolve flow configuration Logs errors for both variable substitution as well as flow template mismatches/errors

Tests

[ ] My PR adds the following unit tests OR does not need testing for this extremely good reason: Changes tests to test for number of errors returned or checks that no errors were reported instead of just true or false during flow compilation process

Commits

[ ] My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":

Subject is separated from body by a blank line

Subject is limited to 50 characters

Subject does not end with a period

Subject uses the imperative mood ("add", not "adding")

Body wraps at 72 characters

Body explains "what" and "why", not "how"
opened by AndyJiang99 1
Fix the type of default values to match what Schema.parse() does.
When calling field.defaultVal() (or equivalent compat helper methods) under Avro 1.9.2+, we get back a type that's not consistent with what Schema.parse() creates internally. This causes two Schemas constructed in these two different ways (Schema.parse vs Schema.createRecord) to be considered unequal, even though their toString() representations are identical.

Fix this situation by calling parseJsonToObject(), which results in the default value being interpreted similar to Schema.parse().

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

[ ] My PR addresses the following Gobblin JIRA issues and references them in the PR title. For example, "[GOBBLIN-XXX] My Gobblin PR"

https://issues.apache.org/jira/browse/GOBBLIN-XXX

Description

[ ] Here are some details about my PR, including screenshots (if applicable):

Tests

[ ] My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

[ ] My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":

Subject is separated from body by a blank line

Subject is limited to 50 characters

Subject does not end with a period

Subject uses the imperative mood ("add", not "adding")

Body wraps at 72 characters

Body explains "what" and "why", not "how"
opened by srramach 1
[GOBBLIN-1749] Add dependency for handling xz-compressed Avro file
Add dependency on xz for handling xz-compressed Avro files

Fix unit test to ensure all codecs are correctly supported

Update AvroHdfsDataWriter's document for covering all compression codecs

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

[x] My PR addresses the following Gobblin JIRA issues and references them in the PR title. For example, "[GOBBLIN-XXX] My Gobblin PR"

https://issues.apache.org/jira/browse/GOBBLIN-1749

Description

[x] Here are some details about my PR, including screenshots (if applicable):

After upgrading Avro to 1.9.2, reading and writing xz-compressed Avro file fails by default. This PR fixes it.

Tests

[x] My PR adds the following unit tests OR does not need testing for this extremely good reason:

I updated AvroHdfsDataWriterTest to ensure that all codecs are supported

Commits

[x] My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":

Subject is separated from body by a blank line

Subject is limited to 50 characters

Subject does not end with a period

Subject uses the imperative mood ("add", not "adding")

Body wraps at 72 characters

Body explains "what" and "why", not "how"
opened by sekikn 1
[GOBBLIN-1744] Improve handling of null value edge cases when querying Helix
Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

[X] My PR addresses the following Gobblin JIRA issues and references them in the PR title. For example, "[GOBBLIN-1744] My Gobblin PR"

https://issues.apache.org/jira/browse/GOBBLIN-1744

Description

[X] Here are some details about my PR, including screenshots (if applicable):

I made 2 helix null checks in 2 separate spots

(1) HelixAssignedParticipantCheck

In production, we've seen that the helix assigned participant check failed due but due to helix issues not due to a split brain. When helix returns null, this actually means that the data does not exist. This is an unexpected case and we can assume that Helix itself is having issues (i.e. not a Gobblin side issue).

I am adding this log because if the Helix assigned participant check fails, this is most likely a Helix issue but it's not immediately obvious what the exact issue is. I've added 2 likely scenarios we've seen internally as common scenarios where oncall has seen this as the rootcause.

(2) HelixUtils#getWorkflowIdsFromJobNames(HelixManager helixManager, Collection jobNames)

kafka-streaming-replanner-tracking INFO - Caused by: java.lang.NullPointerException kafka-streaming-replanner-tracking INFO - at org.apache.gobblin.cluster.HelixUtils.getWorkflowIdsFromJobNames(HelixUtils.java:327) kafka-streaming-replanner-tracking INFO - at org.apache.gobblin.prototype.kafka.replanner.GobblinHelixClusterJob.getWorkflowIdFromJobName(GobblinHelixClusterJob.java:142) kafka-streaming-replanner-tracking INFO - at org.apache.gobblin.prototype.kafka.replanner.GobblinHelixClusterJob.getJobActive(GobblinHelixClusterJob.java:175) kafka-streaming-replanner-tracking INFO - ... 54 more

This is a similar case where Helix returns a null value. This can be caused when this util is called during a replanner / restart of the helix workflow. It can also be caused by a helix data consistency issue. The code doesn't expect a null and will fail with NPE. It is much better to fail gracefully and leave a descriptive log. We do not want to fail loudly because the job can exist in other workflows. In which case, we want to proceed with checking the other workflows gracefully

I've also made a small change to how we fetch the workflow configs. In the original implementation, we call the getWorkflowConfig() again after getting the workflow map. I think this is causing some weird inconsistent state. We get a workflow and then if it is somehow deleted by the time we call getWorkflowConfig(), then we end up with a null value.

There's really no reason to split up the call since they do the same thing and we are just being wasteful by making extra zookeeper reads.

Tests

[X] My PR adds the following unit tests OR does not need testing for this extremely good reason:

HelixAssignedParticipantCheckTest

The existing test in helix assigned participant check triggers this because it returns a null participant from mock helix. I've attached the corresponding log output that we should see.

2022-11-17 18:23:02 PST ERROR [pool-35-thread-1] org.apache.gobblin.cluster.HelixAssignedParticipantCheck 143 - The current assigned participant is null. This implies that (a)Helix failed to write to zookeeper, which is often caused by lack of compression leading / exceeding zookeeper jute max buffer size (Default 1MB) (b)Helix reassigned the task (unlikely if this current task has been running without issue. Helix does not have code for reassigning "running" tasks) Note: This logic is true as of Helix version 1.0.2 and ZK version 3.6

This test is not run in CI so I attached a screenshot below

HelixUtilsTest

I've also added tests in the HelixUtilsTest. I mock Helix API responses and replicate a workflow -> job -> task DAG.

ClusterIntegrationTest The cluster integration test tests the replanner and job clean up. The workflowIdFromJobName is used in the replanner to check the helix workflow to restart. So I am pretty sure I didn't break anything. Not sure if this integration test runs in the CI so I've added a screenshot of the tests passing.

Commits

[X] My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":

Subject is separated from body by a blank line

Subject is limited to 50 characters

Subject does not end with a period

Subject uses the imperative mood ("add", not "adding")

Body wraps at 72 characters

Body explains "what" and "why", not "how"
opened by homatthew 1
add a config to set MR jobs jar path in CompactionSource
Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

[ ] My PR addresses the following Gobblin JIRA issues and references them in the PR title. For example, "[GOBBLIN-XXX] My Gobblin PR"

https://issues.apache.org/jira/browse/GOBBLIN-XXX

Description

[ ] Here are some details about my PR, including screenshots (if applicable):

Tests

[ ] My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

[ ] My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":

Subject is separated from body by a blank line

Subject is limited to 50 characters

Subject does not end with a period

Subject uses the imperative mood ("add", not "adding")

Body wraps at 72 characters

Body explains "what" and "why", not "how"
opened by arjun4084346 0

Releases(gobblin_0.11.0)

gobblin_0.11.0(Jul 20, 2017)

Source code(tar.gz)
Source code(zip)
gobblin_0.10.0(May 5, 2017)

http://linkedin.github.io/gobblin//release/2017/05/05/release-0.10.0.html
Source code(tar.gz)
Source code(zip)
gobblin-distribution-0.10.0.tar.gz(112.49 MB)
gobblin_0.9.0(Dec 19, 2016)

http://linkedin.github.io/gobblin//release/2016/12/19/release-0.9.0.html
Source code(tar.gz)
Source code(zip)
gobblin-distribution-0.9.0.tar.gz(113.01 MB)
gobblin_0.8.0(Sep 2, 2016)

http://linkedin.github.io/gobblin/release/2016/09/02/release-0.8.0.html
Source code(tar.gz)
Source code(zip)
gobblin-distribution-0.8.0.tar.gz(108.28 MB)
gobblin_0.7.0(May 18, 2016)

http://linkedin.github.io/gobblin/release/2016/05/17/release-0.7.0.html
Source code(tar.gz)
Source code(zip)
gobblin-distribution-0.7.0.tar.gz(97.38 MB)
gobblin_0.6.2(Feb 2, 2016)

http://linkedin.github.io/gobblin/release/2016/02/01/release-0.6.2.html
Source code(tar.gz)
Source code(zip)
gobblin_0.6.0(Dec 21, 2015)

http://linkedin.github.io/gobblin/release/2015/12/17/release-0.6.0.html
Source code(tar.gz)
Source code(zip)

Owner

The Apache Software Foundation

GitHub https://gobblin.apache.org/

SAMOA (Scalable Advanced Massive Online Analysis) is an open-source platform for mining big data streams.

SAMOA: Scalable Advanced Massive Online Analysis. This repository is discontinued. The development of SAMOA has moved over to the Apache Software Foun

424 Dec 28, 2022

Distributed and fault-tolerant realtime computation: stream processing, continuous computation, distributed RPC, and more

IMPORTANT NOTE!!! Storm has Moved to Apache. The official Storm git repository is now hosted by Apache, and is mirrored on github here: https://github

8.9k Dec 26, 2022

Netflix's distributed Data Pipeline

Suro: Netflix's Data Pipeline Suro is a data pipeline service for collecting, aggregating, and dispatching large volume of application events includin

772 Dec 9, 2022

Google Mr4c GNU Lesser 3 Google Mr4c MR4C is an implementation framework that allows you to run native code within the Hadoop execution framework. License: GNU Lesser 3, .

Introduction to the MR4C repo About MR4C MR4C is an implementation framework that allows you to run native code within the Hadoop execution framework.

911 Dec 9, 2022

Apache Heron (Incubating) is a realtime, distributed, fault-tolerant stream processing engine from Twitter

Heron is a realtime analytics platform developed by Twitter. It has a wide array of architectural improvements over it's predecessor. Heron in Apache

3.6k Dec 28, 2022

Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

Apache Zeppelin Documentation: User Guide Mailing Lists: User and Dev mailing list Continuous Integration: Contributing: Contribution Guide Issue Trac

5.9k Jan 8, 2023

OpenRefine is a free, open source power tool for working with messy data and improving it

OpenRefine OpenRefine is a Java-based power tool that allows you to load data, understand it, clean it up, reconcile it, and augment it with data comi

9.2k Jan 1, 2023

A platform for visualization and real-time monitoring of data workflows

Status This project is no longer maintained. Ambrose Twitter Ambrose is a platform for visualization and real-time monitoring of MapReduce data workfl

1.2k Dec 31, 2022

Hadoop library for large-scale data processing, now an Apache Incubator project

Apache DataFu Follow @apachedatafu Apache DataFu is a collection of libraries for working with large-scale data in Hadoop. The project was inspired by

589 Apr 1, 2022

Program finds average number of words in each comment given a large data set by use of hadoop's map reduce to work in parallel efficiently.

Finding average number of words in all the comments in a data set ?? Mapper Function In the mapper function we first tokenize entire data and then fin

3 Aug 23, 2021

Access paged data as a "stream" with async loading while maintaining order

DataStream What? DataStream is a simple piece of code to access paged data and interface it as if it's a single "list". It only keeps track of queued

1 Jan 19, 2022

Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.

Elephant Bird About Elephant Bird is Twitter's open source library of LZO, Thrift, and/or Protocol Buffer-related Hadoop InputFormats, OutputFormats,

1.1k Jan 5, 2023

Stream summarizer and cardinality estimator.

Description A Java library for summarizing data in streams for which it is infeasible to store all events. More specifically, there are classes for es

2.2k Dec 30, 2022

Machine Learning Platform and Recommendation Engine built on Kubernetes

Update January 2018 Seldon Core open sourced. Seldon Core focuses purely on deploying a wide range of ML models on Kubernetes, allowing complex runtim

1.5k Dec 15, 2022

:elephant: Elasticsearch real-time search and analytics natively integrated with Hadoop

Elasticsearch Hadoop Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Apache Hive, Apache Pig, Apach

1.9k Dec 22, 2022

Desktop app to browse and administer your MongoDB cluster

UMONGO, the MongoDB GUI UMONGO, the MongoDB GUI About This version of UMongo is provided as free software by EdgyTech LLC. UMongo is open source, and

583 Nov 11, 2022

A scalable, mature and versatile web crawler based on Apache Storm

StormCrawler is an open source collection of resources for building low-latency, scalable web crawlers on Apache Storm. It is provided under Apache Li

776 Jan 2, 2023

This is an Android application that deals with storage or "data persistence". The app is suitable for any shelter house to store the data of pets such as name, breed, gender and weight of the pet. The app uses a SQLite Database to store the data. The data is stored locally on the users phone. This app uses many other concepts such as building a ContentProvider and using a CursorAdapter and CursorLoader to automatically load the data.

Pets App This app displays a list of pets and their related com.example.android.pets.data that the user inputs. Used in a Udacity course in the Androi

3 Sep 2, 2021

Jornada Big Tech: I will have 3 months to study and prepare myself for the Big Tech interviews. Repository containing all my study material.

Jornada Big Tech (Big Tech Journey) Jornada Big Tech: I will have 3 months to study and prepare myself for the Big Tech interviews. Repository contain

87 Dec 8, 2022

Flink Table Store is a unified streaming and batch store for building dynamic tables on Apache Flink

366 Jan 1, 2023

A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, organization and lifecycle management for both streaming and batch data ecosystems.

Related tags

Overview

Apache Gobblin

Capabilities

Highlights

Common Patterns used in production

Apache Gobblin is NOT

Requirements

Instructions to run Apache RAT (Release Audit Tool)

Instructions to build the distribution

Quick Links

Comments

Changes:

Job History Store Management

JIRA

Description

Tests

Commits

JIRA

Description

Tests

Commits

JIRA

Description

Tests

Commits

JIRA

Description

Tests

Commits

JIRA

Description

(1) HelixAssignedParticipantCheck

(2) HelixUtils#getWorkflowIdsFromJobNames(HelixManager helixManager, Collection jobNames)

Tests

Commits

JIRA

Description

Tests

Commits

Releases(gobblin_0.11.0)

gobblin_0.11.0(Jul 20, 2017)

gobblin_0.10.0(May 5, 2017)

gobblin_0.9.0(Dec 19, 2016)

gobblin_0.8.0(Sep 2, 2016)

gobblin_0.7.0(May 18, 2016)

gobblin_0.6.2(Feb 2, 2016)

gobblin_0.6.0(Dec 21, 2015)

Owner

The Apache Software Foundation

SAMOA (Scalable Advanced Massive Online Analysis) is an open-source platform for mining big data streams.

Distributed and fault-tolerant realtime computation: stream processing, continuous computation, distributed RPC, and more

Netflix's distributed Data Pipeline

Google Mr4c GNU Lesser 3 Google Mr4c MR4C is an implementation framework that allows you to run native code within the Hadoop execution framework. License: GNU Lesser 3, .

Apache Heron (Incubating) is a realtime, distributed, fault-tolerant stream processing engine from Twitter

Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

OpenRefine is a free, open source power tool for working with messy data and improving it

A platform for visualization and real-time monitoring of data workflows

Hadoop library for large-scale data processing, now an Apache Incubator project

Program finds average number of words in each comment given a large data set by use of hadoop's map reduce to work in parallel efficiently.

Access paged data as a "stream" with async loading while maintaining order

Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.

Stream summarizer and cardinality estimator.

Machine Learning Platform and Recommendation Engine built on Kubernetes

:elephant: Elasticsearch real-time search and analytics natively integrated with Hadoop

Desktop app to browse and administer your MongoDB cluster

A scalable, mature and versatile web crawler based on Apache Storm

Jornada Big Tech: I will have 3 months to study and prepare myself for the Big Tech interviews. Repository containing all my study material.

Flink Table Store is a unified streaming and batch store for building dynamic tables on Apache Flink