39 Repositories
Website | Documentation | Developer Mailing List | User Mailing List | Slack | Twitter | Download Apache Druid Druid is a high performance real-time a
Apache Zeppelin Documentation: User Guide Mailing Lists: User and Dev mailing list Continuous Integration: Contributing: Contribution Guide Issue Trac
Presto Presto is a distributed SQL query engine for big data. See the User Manual for deployment instructions and end user documentation. Requirements
Apache Flink Apache Flink is an open source stream processing framework with powerful stream- and batch-processing capabilities. Learn more about Flin
Elephant Bird About Elephant Bird is Twitter's open source library of LZO, Thrift, and/or Protocol Buffer-related Hadoop InputFormats, OutputFormats,
Apache Gobblin Apache Gobblin is a highly scalable data management solution for structured and byte-oriented data in heterogeneous data ecosystems. Ca
Master Branch: Storm is a distributed realtime computation system. Similar to how Hadoop provides a set of general primitives for doing batch processi
StormCrawler is an open source collection of resources for building low-latency, scalable web crawlers on Apache Storm. It is provided under Apache Li
OpenRefine OpenRefine is a Java-based power tool that allows you to load data, understand it, clean it up, reconcile it, and augment it with data comi
DeepDive As of 2017, DeepDive project is in maintenance mode and no longer under active development. The user community remains active, but the origin
Status This project is no longer maintained. Ambrose Twitter Ambrose is a platform for visualization and real-time monitoring of MapReduce data workfl
Description A Java library for summarizing data in streams for which it is infeasible to store all events. More specifically, there are classes for es
Apache Hudi Apache Hudi (pronounced Hoodie) stands for Hadoop Upserts Deletes and Incrementals. Hudi manages the storage of large analytical datasets
HdrHistogram HdrHistogram: A High Dynamic Range (HDR) Histogram This repository currently includes a Java implementation of HdrHistogram. C, C#/.NET,
Heron is a realtime analytics platform developed by Twitter. It has a wide array of architectural improvements over it's predecessor. Heron in Apache
Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l
Apache Hive (TM) The Apache Hive (TM) data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storag
Welcome to Impala Lightning-fast, distributed SQL queries for petabytes of data stored in Apache Hadoop clusters. Impala is a modern, massively-distri
SAMOA: Scalable Advanced Massive Online Analysis. This repository is discontinued. The development of SAMOA has moved over to the Apache Software Foun
IMPORTANT NOTE!!! Storm has Moved to Apache. The official Storm git repository is now hosted by Apache, and is mirrored on github here: https://github
ghidra-scripts A collection of my Ghidra scripts. iOS FOX: This script locates all calls to objc_msgSend family functions, tries to infer the actual m
Elasticsearch Hadoop Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Apache Hive, Apache Pig, Apach
Update January 2018 Seldon Core open sourced. Seldon Core focuses purely on deploying a wide range of ML models on Kubernetes, allowing complex runtim
DubboPOC Apache Dubbo 漏洞POC 持续更新中 CVE-2019-17564 CVE-2020-1948 CVE-2020-1948绕过 CVE-2021-25641 CVE-2021-30179 others 免责声明 项目仅供学习使用,任何未授权检测造成的直接或者间接的后果及
Introduction to the MR4C repo About MR4C MR4C is an implementation framework that allows you to run native code within the Hadoop execution framework.
Suro: Netflix's Data Pipeline Suro is a data pipeline service for collecting, aggregating, and dispatching large volume of application events includin
Caution: H2O-3 is now the current H2O! Please visit https://github.com/h2oai/h2o-3 H2O H2O makes Hadoop do math! H2O scales statistics, machine learni
Apache Kylin Apache Kylin is an open source Distributed Analytics Engine to provide SQL interface and multi-dimensional analysis (OLAP) on Hadoop supp
Hadoop-MapReduce-Anagram-Solver The implementation consists of a program that utilizes the Hadoop Map-Reduce framework to identify the anagrams of the
UMONGO, the MongoDB GUI UMONGO, the MongoDB GUI About This version of UMongo is provided as free software by EdgyTech LLC. UMongo is open source, and
Gomule-d2r GoMule enabled for D2R Original GoMule App http://gomule.sourceforge.net/ all credits go to Gohanman, Randall, Silospen, collaborators, ...
Oculus is an Archived Project Oculus is no longer actively maintained. Your mileage with patches may vary. Oculus Oculus is the anomaly correlation co
Apache DataFu Follow @apachedatafu Apache DataFu is a collection of libraries for working with large-scale data in Hadoop. The project was inspired by
Flink CDC Connectors is a set of source connectors for Apache Flink, ingesting changes from different databases using change data capture (CDC). The Flink CDC Connectors integrates Debezium as the engine to capture data changes.
Anurag000-rgb/MapReduce-Repetation_Counting MapReduce Code for Counting the numbers in JAVA Basically in this project But it good to write in Apache Spark using scala Rather In Apache MapReduce
PageRank implementation in hadoop Use kiwenalu/hadoop-cluster-docker (set cluster size for 5) for running JAR. Load dataset to memory using script
DataStream What? DataStream is a simple piece of code to access paged data and interface it as if it's a single "list". It only keeps track of queued
Finding average number of words in all the comments in a data set 📝 Mapper Function In the mapper function we first tokenize entire data and then fin
All the files have been commented for your ease. Furthermore you may also add further comments if you may. For further queries contact me at : chhxnsh