A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, organization and lifecycle management for both streaming and batch data ecosystems.

Overview

Apache Gobblin

Build Status Documentation Status Maven Central Stack Overflow Join us on Slack codecov.io

Apache Gobblin is a highly scalable data management solution for structured and byte-oriented data in heterogeneous data ecosystems.

Capabilities

  • Ingestion and export of data from a variety of sources and sinks into and out of the data lake. Gobblin is optimized and designed for ELT patterns with inline transformations on ingest (small t).
  • Data Organization within the lake (e.g. compaction, partitioning, deduplication)
  • Lifecycle Management of data within the lake (e.g. data retention)
  • Compliance Management of data across the ecosystem (e.g. fine-grain data deletions)

Highlights

  • Battle tested at scale: Runs in production at petabyte-scale at companies like LinkedIn, PayPal, Verizon etc.
  • Feature rich: Supports task partitioning, state management for incremental processing, atomic data publishing, data quality checking, job scheduling, fault tolerance etc.
  • Supports stream and batch execution modes
  • Control Plane (Gobblin-as-a-service) supports programmatic triggering and orchestration of data plane operations.

Common Patterns used in production

  • Stream / Batch ingestion of Kafka to Data Lake (HDFS, S3, ADLS)
  • Bulk-loading serving stores from the Data Lake (e.g. HDFS -> Couchbase)
  • Support for data sync across Federated Data Lake (HDFS <-> HDFS, HDFS <-> S3, S3 <-> ADLS)
  • Integrate external vendor API-s (e.g. Salesforce, Dynamics etc.) with data store (HDFS, Couchbase etc)
  • Enforcing Data retention policies and GDPR deletion on HDFS / ADLS

Apache Gobblin is NOT

  • A general purpose data transformation engine like Spark or Flink. Gobblin can delegate complex-data processing tasks to Spark, Hive etc.
  • A data storage system like Apache Kafka or HDFS. Gobblin integrates with these systems as sources or sinks.
  • A general-purpose workflow execution system like Airflow, Azkaban, Dagster, Luigi.

Requirements

  • Java >= 1.8

If building the distribution with tests turned on:

  • Maven version 3.5.3

Instructions to run Apache RAT (Release Audit Tool)

  1. Extract the archive file to your local directory.
  2. Run ./gradlew rat. Report will be generated under build/rat/rat-report.html

Instructions to build the distribution

  1. Extract the archive file to your local directory.
  2. Skip tests and build the distribution: Run ./gradlew build -x findbugsMain -x test -x rat -x checkstyleMain The distribution will be created in build/gobblin-distribution/distributions directory. (or)
  3. Run tests and build the distribution (requires Maven): Run ./gradlew build The distribution will be created in build/gobblin-distribution/distributions directory.

Quick Links

Comments
  • Improve database history store performance

    Improve database history store performance

    Changes:

    1. Switched database scripts to use flyway. This enables us to keep track of migration scripts, ensure that users can upgrade, and verify that the correct version is installed.
    2. Changed tests from using derby to using embedded mysql. This allows our tests to verify the same code, migrations, etc. as will be used in production.
    3. Added a check in DatabaseJobHistoryStore that verifies that the DB is the correct version. If it is not, a warning will be logged and all operations on the store will become no-ops. This ensures that we don't break if users upgrade the code, but forget to upgrade the database.
    4. Changed travis build to have dependencies needed to run embedded mysql, split the jdk7/jdk8 build jobs, enabled caching. This allows easier identification of problems, improves performance, etc.
    5. Changed DatabaseJobHistoryStore to use batching and upserts to greatly increase performance. Also stopped sending in epoch for startTime/endTime which are not specified. Lastly, removed job properties from task properties to reduce data suplication. All this is to improve performance and data correctness.
    6. Changed the admin UI to show '-' for startTime/endTime when not set
    7. Added step at end of travis build to output test failures to console. The facilitates the ability to determine why the build failed.
    8. Enabled support for compression in the JobExecutionInfoServer
    9. Reduced the amount of data requested by Admin UI
    10. Rewrote the DatabaseJobHistoryStore querying code to greatly reduce the number of queries

    Job History Store Management

    Migrate an unmanaged database (First Time)

    ./historystore-manager.sh migrate -Durl="jdbc:mysql://<SERVER>/gobblin" -Duser="<USER>" -Dpassword="<PASSWORD>" -DbaselineOnMigrate=true
    

    Migrate a database (Subsequent Time)

    ./historystore-manager.sh migrate -Durl="jdbc:mysql://<SERVER>/gobblin" -Duser="<USER>" -Dpassword="<PASSWORD>"
    

    Get database migration status

    ./historystore-manager.sh info -Durl="jdbc:mysql://<SERVER>/gobblin" -Duser="<USER>" -Dpassword="<PASSWORD>"
    
    enhancement 
    opened by kadaan 49
  • Oracle Source/Extractor

    Oracle Source/Extractor

    Created OracleSource.java and OracleExtractor.java with heavy inspiration from MySqlSource.java and MySqlExtractor.java.

    The classes are largely the same as the MySql ones with the exception of getSchemaMetadata() which needed to be modified to account for Oracle reserved words such as format and comment. Also timestamp formats had to be slightly refactored.

    Further different strategies for handling primitive values like booleans in Oracle needed to be accounted for. This resulted in overriding getSchema(). The only difference is that Oracle treats booleans as varchar2(1). I created a simple casting utility castToBoolean() which I added to SqlQueryUtils.java.

    Since getSchema requires use of the private static final Gson factory from JDBCExtractor. I created a getter for access. It is accessed in OracleExtractor's getSchema() method like so: this.getGson()

    I am able to successfully complete table_append and table_full jobs using this new source and extractor. I did a clean assemble, and it did not break the build. I believe I stayed consistent with documentation and coding format.

    Please consider my pull request, and let me know if I need to change anything.

    Thanks! Brian Vanover

    P.S. One thing is that in order to use the Oracle source and extractor, ojdbc6.jar must be added as a dependency. I simply copied the jar into goblin-dist/lib, but it may be worthwhile to add it as a dependency. I just wasn't sure how to modify the build scripts to include it.

    opened by ghost 37
  • Add support for tunneling JDBC traffic over an HTTP proxy

    Add support for tunneling JDBC traffic over an HTTP proxy

    Frequently data stores to be accessed by Gobblin reside outside data centers. In these cases, outbound access from a data center typically needs to go through a gateway proxy for security purposes. In some cases this is an HTTP proxy. However, some protocols like JDBC don't support the concept of "proxies", let alone HTTP proxies, and hence a solution is needed to enable this.

    This Pull Request provides a method of tunneling JDBC connections over an HTTP proxy. Note that it is currently only implemented for JDBC, but can be extended to work with any other TCP-based protocol.

    The way this works for JDBC is:

    1. When a JdbcProvider is invoked, it checks if the WorkUnit has a proxy host and port defined.
    2. If so, it extracts the remote host and port from the connectionUrl.
    3. It then creates a Tunnel instance configured with the remote host and port. The Tunnel starts a thread that listens on an arbitrary port on localhost.
    4. The JdbcProvider then changes the connectionUrl to replace the remote host and port with the localhost and port and passes it on to the driver.
    5. Hence when the driver creates a JDBC connection, it connects to the Tunnel socket. The Tunnel then connects to the remote host through the proxy via a HTTP CONNECT request.
    6. If established successfully, the Tunnel then just relays bytes back and forth.
    7. When the JDBC data source is closed down, the Tunnel is shut down as well.

    The Tunnel can accept as many connections as the JdbcExtractor opens. The Tunnel uses NIO to minimize resource usage.

    opened by kunalkandekar 25
  • Add zookeeper based job lock for gobblin yarn

    Add zookeeper based job lock for gobblin yarn

    @sahilTakiar Here is the ZookeeperBasedJobLock that we discussed in #754. The only part I'm a little unsure about is the cancellation of a running job if the connection to ZK is lost. I think it is needed to ensure that duplicate jobs are not running, but I'm not sure if the mean of cancellation is correct.

    enhancement 
    opened by kadaan 21
  • Add inline Hive registration to Gobblin job

    Add inline Hive registration to Gobblin job

    Added a BaseDataPublisherWithHiveRegistration which records the paths that need to be registered as a state property. After JobContext commits the data, it will retrieve the paths from the state and register them in Hive.

    opened by zliu41 20
  • Persist state and commit data on a per-dataset basis

    Persist state and commit data on a per-dataset basis

    1. A WorkUnit can be specified to belong to a dataset group by using a new config property dataset.urn.
    2. Job/task state will be persisted on a per-dataset basis into different state store files, one per dataset. All current dataset state store files will be loaded in the next job run.
    3. Commit of data will be done on a per-dataset basis, allowing datasets to be committed independently.
    4. For existing job that does not have a dataset concept or only deals with a single dataset, the logic of job/task state persistence and data commit remains the same.

    Signed-off-by: Yinan Li [email protected]

    opened by liyinan926 19
  • JDBC Writer

    JDBC Writer

    Pull request for JDBC Writer. Will combine to one commit after the review.

    Design doc: https://docs.google.com/a/linkedin.com/document/d/1pSWZtgNssxaV1c19xdd1rKft9JaXM2qmqDwwbIb0scM/edit?usp=sharing

    opened by jinhyukchang 18
  • Separate publisher filesystem from writer filesystem

    Separate publisher filesystem from writer filesystem

    Support for publishing data to a different filesystem from where data is written. The use case of this is writing the files to HDFS, but publishing them to S3.

    opened by kadaan 18
  • Adding Support for Complex Watermark Types

    Adding Support for Complex Watermark Types

    This is the first pull request for adding complex watermark types to Gobblin. This will replace the legacy system of tracking watermarks. The old system was de-centralized, and depended and passing custom configuration parameters between executions via a WorkUnitState.

    The new implementation contains a new interface called Watermark which extends the Comparable and the Copyable interfaces. It only contains one method increment(Object record) which defines how to increment the watermark for a given record.

    Another class called WatermarkInterval contains the logic for maintaining low and high watermarks. The corresponding changes to WorkUnit and WorkUnitState have been made. Since this pull request mainly focuses on defining the interfaces, no migration code has been done to the framework yet.

    opened by sahilTakiar 18
  • Base interfaces for value auditing

    Base interfaces for value auditing

    Added base interfaces for value auditing. The pr mainly has interfaces for

    • RowSelectionPolicy - An interface to decide if a row needs to be audited
    • ColomnProjectionPolicy - An interface that projects certain columns/fields of an input {@link GenericRecord} to generate a new {@link GenericRecord} that can be audited.
    • AuditSink - An interface for persisting value audits
    • ValueAuditGenerator - The class that implements value based auditing. The class captures the values of certain columns from the rows in the dataset using the {@link ColumnProjectionPolicy}. This is done for every row or for a sample of the rows as defined by the {@link RowSelectionPolicy}
    • The module that drives auditing will create an instance of ValueAuditGenerator for each avro file being written and called ValueAuditGenerator.audit(GenericRecord)

    The complete javadoc has been generated and hosted through github pages here to easily navigate through the detailed documentation

    I am adding some unit tests while this is being reviewed. @chavdar can you review?

    opened by pcadabam-zz 17
  • [GOBBLIN-207] Use hadoop filesystem to download job data

    [GOBBLIN-207] Use hadoop filesystem to download job data

    This PR switches from downloading the job package using FileUtils.copyURLToFile() to using the Hadoop filesystem filesystem.copyToLocalFile(). This opens up new choices of where to store the job package and allows for better authN support.

    opened by kadaan 16
  • [GOBBLIN-1755] Support extended ACLs and sticky bit for file based distcp

    [GOBBLIN-1755] Support extended ACLs and sticky bit for file based distcp

    Dear Gobblin maintainers,

    Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

    JIRA

    • [ ] My PR addresses the following Gobblin JIRA issues and references them in the PR title. For example, "[GOBBLIN-XXX] My Gobblin PR"
      • https://issues.apache.org/jira/browse/GOBBLIN-1755

    Description

    • [ ] Here are some details about my PR, including screenshots (if applicable):
    • [ ] Added new preserve attributes for extended ACLs and sticky bit support
    • [ ] Updated FileAwareInputStreamDataWriter to set ACLs and sticky bit

    Tests

    • [ ] My PR adds the following unit tests OR does not need testing for this extremely good reason:
    • [ ] build successfully

    Commits

    • [ ] My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
      1. Subject is separated from body by a blank line
      2. Subject is limited to 50 characters
      3. Subject does not end with a period
      4. Subject uses the imperative mood ("add", not "adding")
      5. Body wraps at 72 characters
      6. Body explains "what" and "why", not "how"
    opened by meethngala 1
  • Add error reporting when attempting to resolve flow configs

    Add error reporting when attempting to resolve flow configs

    Dear Gobblin maintainers,

    Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

    JIRA

    • [ ] My PR addresses the following Gobblin JIRA issues and references them in the PR title. For example, "[GOBBLIN-XXX] My Gobblin PR"
      • https://issues.apache.org/jira/browse/GOBBLIN-1759

    Description

    • [ ] Here are some details about my PR, including screenshots (if applicable): Adds more detailed logging when attempting to resolve flow configuration Logs errors for both variable substitution as well as flow template mismatches/errors

    Tests

    • [ ] My PR adds the following unit tests OR does not need testing for this extremely good reason: Changes tests to test for number of errors returned or checks that no errors were reported instead of just true or false during flow compilation process

    Commits

    • [ ] My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
      1. Subject is separated from body by a blank line
      2. Subject is limited to 50 characters
      3. Subject does not end with a period
      4. Subject uses the imperative mood ("add", not "adding")
      5. Body wraps at 72 characters
      6. Body explains "what" and "why", not "how"
    opened by AndyJiang99 1
  • Fix the type of default values to match what Schema.parse() does.

    Fix the type of default values to match what Schema.parse() does.

    When calling field.defaultVal() (or equivalent compat helper methods) under Avro 1.9.2+, we get back a type that's not consistent with what Schema.parse() creates internally. This causes two Schemas constructed in these two different ways (Schema.parse vs Schema.createRecord) to be considered unequal, even though their toString() representations are identical.

    Fix this situation by calling parseJsonToObject(), which results in the default value being interpreted similar to Schema.parse().

    Dear Gobblin maintainers,

    Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

    JIRA

    • [ ] My PR addresses the following Gobblin JIRA issues and references them in the PR title. For example, "[GOBBLIN-XXX] My Gobblin PR"
      • https://issues.apache.org/jira/browse/GOBBLIN-XXX

    Description

    • [ ] Here are some details about my PR, including screenshots (if applicable):

    Tests

    • [ ] My PR adds the following unit tests OR does not need testing for this extremely good reason:

    Commits

    • [ ] My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
      1. Subject is separated from body by a blank line
      2. Subject is limited to 50 characters
      3. Subject does not end with a period
      4. Subject uses the imperative mood ("add", not "adding")
      5. Body wraps at 72 characters
      6. Body explains "what" and "why", not "how"
    opened by srramach 1
  • [GOBBLIN-1749] Add dependency for handling xz-compressed Avro file

    [GOBBLIN-1749] Add dependency for handling xz-compressed Avro file

    • Add dependency on xz for handling xz-compressed Avro files

    • Fix unit test to ensure all codecs are correctly supported

    • Update AvroHdfsDataWriter's document for covering all compression codecs

    Dear Gobblin maintainers,

    Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

    JIRA

    • [x] My PR addresses the following Gobblin JIRA issues and references them in the PR title. For example, "[GOBBLIN-XXX] My Gobblin PR"
      • https://issues.apache.org/jira/browse/GOBBLIN-1749

    Description

    • [x] Here are some details about my PR, including screenshots (if applicable):

    After upgrading Avro to 1.9.2, reading and writing xz-compressed Avro file fails by default. This PR fixes it.

    Tests

    • [x] My PR adds the following unit tests OR does not need testing for this extremely good reason:

    I updated AvroHdfsDataWriterTest to ensure that all codecs are supported

    Commits

    • [x] My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
      1. Subject is separated from body by a blank line
      2. Subject is limited to 50 characters
      3. Subject does not end with a period
      4. Subject uses the imperative mood ("add", not "adding")
      5. Body wraps at 72 characters
      6. Body explains "what" and "why", not "how"
    opened by sekikn 1
  • [GOBBLIN-1744] Improve handling of null value edge cases when querying Helix

    [GOBBLIN-1744] Improve handling of null value edge cases when querying Helix

    Dear Gobblin maintainers,

    Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

    JIRA

    • [X] My PR addresses the following Gobblin JIRA issues and references them in the PR title. For example, "[GOBBLIN-1744] My Gobblin PR"
      • https://issues.apache.org/jira/browse/GOBBLIN-1744

    Description

    • [X] Here are some details about my PR, including screenshots (if applicable):

    I made 2 helix null checks in 2 separate spots

    (1) HelixAssignedParticipantCheck


    In production, we've seen that the helix assigned participant check failed due but due to helix issues not due to a split brain. When helix returns null, this actually means that the data does not exist. This is an unexpected case and we can assume that Helix itself is having issues (i.e. not a Gobblin side issue).

    I am adding this log because if the Helix assigned participant check fails, this is most likely a Helix issue but it's not immediately obvious what the exact issue is. I've added 2 likely scenarios we've seen internally as common scenarios where oncall has seen this as the rootcause.

    (2) HelixUtils#getWorkflowIdsFromJobNames(HelixManager helixManager, Collection jobNames)


    kafka-streaming-replanner-tracking INFO - Caused by: java.lang.NullPointerException
    kafka-streaming-replanner-tracking INFO - at org.apache.gobblin.cluster.HelixUtils.getWorkflowIdsFromJobNames(HelixUtils.java:327)
    kafka-streaming-replanner-tracking INFO - at org.apache.gobblin.prototype.kafka.replanner.GobblinHelixClusterJob.getWorkflowIdFromJobName(GobblinHelixClusterJob.java:142)
    kafka-streaming-replanner-tracking INFO - at org.apache.gobblin.prototype.kafka.replanner.GobblinHelixClusterJob.getJobActive(GobblinHelixClusterJob.java:175)
    kafka-streaming-replanner-tracking INFO - ... 54 more
    

    This is a similar case where Helix returns a null value. This can be caused when this util is called during a replanner / restart of the helix workflow. It can also be caused by a helix data consistency issue. The code doesn't expect a null and will fail with NPE. It is much better to fail gracefully and leave a descriptive log. We do not want to fail loudly because the job can exist in other workflows. In which case, we want to proceed with checking the other workflows gracefully

    I've also made a small change to how we fetch the workflow configs. In the original implementation, we call the getWorkflowConfig() again after getting the workflow map. I think this is causing some weird inconsistent state. We get a workflow and then if it is somehow deleted by the time we call getWorkflowConfig(), then we end up with a null value.

    There's really no reason to split up the call since they do the same thing and we are just being wasteful by making extra zookeeper reads.

    Tests

    • [X] My PR adds the following unit tests OR does not need testing for this extremely good reason:

    HelixAssignedParticipantCheckTest

    The existing test in helix assigned participant check triggers this because it returns a null participant from mock helix. I've attached the corresponding log output that we should see.

    2022-11-17 18:23:02 PST ERROR [pool-35-thread-1] org.apache.gobblin.cluster.HelixAssignedParticipantCheck 143 - The current assigned participant is null. This implies that 
    		(a)Helix failed to write to zookeeper, which is often caused by lack of compression leading / exceeding zookeeper jute max buffer size (Default 1MB)
    		(b)Helix reassigned the task (unlikely if this current task has been running without issue. Helix does not have code for reassigning "running" tasks)
    Note: This logic is true as of Helix version 1.0.2 and ZK version 3.6
    

    This test is not run in CI so I attached a screenshot below image

    HelixUtilsTest

    I've also added tests in the HelixUtilsTest. I mock Helix API responses and replicate a workflow -> job -> task DAG. image

    ClusterIntegrationTest The cluster integration test tests the replanner and job clean up. The workflowIdFromJobName is used in the replanner to check the helix workflow to restart. So I am pretty sure I didn't break anything. Not sure if this integration test runs in the CI so I've added a screenshot of the tests passing. image

    Commits

    • [X] My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
      1. Subject is separated from body by a blank line
      2. Subject is limited to 50 characters
      3. Subject does not end with a period
      4. Subject uses the imperative mood ("add", not "adding")
      5. Body wraps at 72 characters
      6. Body explains "what" and "why", not "how"
    opened by homatthew 1
  • add a config to set MR jobs jar path in CompactionSource

    add a config to set MR jobs jar path in CompactionSource

    Dear Gobblin maintainers,

    Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

    JIRA

    • [ ] My PR addresses the following Gobblin JIRA issues and references them in the PR title. For example, "[GOBBLIN-XXX] My Gobblin PR"
      • https://issues.apache.org/jira/browse/GOBBLIN-XXX

    Description

    • [ ] Here are some details about my PR, including screenshots (if applicable):

    Tests

    • [ ] My PR adds the following unit tests OR does not need testing for this extremely good reason:

    Commits

    • [ ] My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
      1. Subject is separated from body by a blank line
      2. Subject is limited to 50 characters
      3. Subject does not end with a period
      4. Subject uses the imperative mood ("add", not "adding")
      5. Body wraps at 72 characters
      6. Body explains "what" and "why", not "how"
    opened by arjun4084346 0
Releases(gobblin_0.11.0)
Owner
The Apache Software Foundation
The Apache Software Foundation
SAMOA (Scalable Advanced Massive Online Analysis) is an open-source platform for mining big data streams.

SAMOA: Scalable Advanced Massive Online Analysis. This repository is discontinued. The development of SAMOA has moved over to the Apache Software Foun

Yahoo Archive 424 Dec 28, 2022
Distributed and fault-tolerant realtime computation: stream processing, continuous computation, distributed RPC, and more

IMPORTANT NOTE!!! Storm has Moved to Apache. The official Storm git repository is now hosted by Apache, and is mirrored on github here: https://github

Nathan Marz 8.9k Dec 26, 2022
Netflix's distributed Data Pipeline

Suro: Netflix's Data Pipeline Suro is a data pipeline service for collecting, aggregating, and dispatching large volume of application events includin

Netflix, Inc. 772 Dec 9, 2022
Google Mr4c GNU Lesser 3 Google Mr4c MR4C is an implementation framework that allows you to run native code within the Hadoop execution framework. License: GNU Lesser 3, .

Introduction to the MR4C repo About MR4C MR4C is an implementation framework that allows you to run native code within the Hadoop execution framework.

Google 911 Dec 9, 2022
Apache Heron (Incubating) is a realtime, distributed, fault-tolerant stream processing engine from Twitter

Heron is a realtime analytics platform developed by Twitter. It has a wide array of architectural improvements over it's predecessor. Heron in Apache

The Apache Software Foundation 3.6k Dec 28, 2022
Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

Apache Zeppelin Documentation: User Guide Mailing Lists: User and Dev mailing list Continuous Integration: Contributing: Contribution Guide Issue Trac

The Apache Software Foundation 5.9k Jan 8, 2023
OpenRefine is a free, open source power tool for working with messy data and improving it

OpenRefine OpenRefine is a Java-based power tool that allows you to load data, understand it, clean it up, reconcile it, and augment it with data comi

OpenRefine 9.2k Jan 1, 2023
A platform for visualization and real-time monitoring of data workflows

Status This project is no longer maintained. Ambrose Twitter Ambrose is a platform for visualization and real-time monitoring of MapReduce data workfl

Twitter 1.2k Dec 31, 2022
Hadoop library for large-scale data processing, now an Apache Incubator project

Apache DataFu Follow @apachedatafu Apache DataFu is a collection of libraries for working with large-scale data in Hadoop. The project was inspired by

LinkedIn's Attic 589 Apr 1, 2022
Program finds average number of words in each comment given a large data set by use of hadoop's map reduce to work in parallel efficiently.

Finding average number of words in all the comments in a data set ?? Mapper Function In the mapper function we first tokenize entire data and then fin

Aleezeh Usman 3 Aug 23, 2021
Access paged data as a "stream" with async loading while maintaining order

DataStream What? DataStream is a simple piece of code to access paged data and interface it as if it's a single "list". It only keeps track of queued

Thomas 1 Jan 19, 2022
Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.

Elephant Bird About Elephant Bird is Twitter's open source library of LZO, Thrift, and/or Protocol Buffer-related Hadoop InputFormats, OutputFormats,

Twitter 1.1k Jan 5, 2023
Stream summarizer and cardinality estimator.

Description A Java library for summarizing data in streams for which it is infeasible to store all events. More specifically, there are classes for es

AddThis 2.2k Dec 30, 2022
Machine Learning Platform and Recommendation Engine built on Kubernetes

Update January 2018 Seldon Core open sourced. Seldon Core focuses purely on deploying a wide range of ML models on Kubernetes, allowing complex runtim

Seldon 1.5k Dec 15, 2022
:elephant: Elasticsearch real-time search and analytics natively integrated with Hadoop

Elasticsearch Hadoop Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Apache Hive, Apache Pig, Apach

elastic 1.9k Dec 22, 2022
Desktop app to browse and administer your MongoDB cluster

UMONGO, the MongoDB GUI UMONGO, the MongoDB GUI About This version of UMongo is provided as free software by EdgyTech LLC. UMongo is open source, and

Antoine Girbal 583 Nov 11, 2022
A scalable, mature and versatile web crawler based on Apache Storm

StormCrawler is an open source collection of resources for building low-latency, scalable web crawlers on Apache Storm. It is provided under Apache Li

DigitalPebble Ltd 776 Jan 2, 2023
Jornada Big Tech: I will have 3 months to study and prepare myself for the Big Tech interviews. Repository containing all my study material.

Jornada Big Tech (Big Tech Journey) Jornada Big Tech: I will have 3 months to study and prepare myself for the Big Tech interviews. Repository contain

Camila Maia 87 Dec 8, 2022
Flink Table Store is a unified streaming and batch store for building dynamic tables on Apache Flink

Flink Table Store is a unified streaming and batch store for building dynamic tables on Apache Flink

The Apache Software Foundation 366 Jan 1, 2023