20 Java Hadoop Libraries

20 Repositories

Java hadoop Libraries

PageRank implementation in hadoop

PageRank implementation in hadoop Use kiwenalu/hadoop-cluster-docker (set cluster size for 5) for running JAR. Load dataset to memory using script

Jan 24, 2022

Program that uses Hadoop Map-Reduce to identify the anagrams of the words of a file

Hadoop-MapReduce-Anagram-Solver The implementation consists of a program that utilizes the Hadoop Map-Reduce framework to identify the anagrams of the

Dec 4, 2022

Apache Nutch is an extensible and scalable web crawler

Apache Nutch README For the latest information about Nutch, please visit our website at: https://nutch.apache.org/ and our wiki, at: https://cwiki.apa

Dec 31, 2022

Apache Drill is a distributed MPP query layer for self describing data

Apache Drill Apache Drill is a distributed MPP query layer that supports SQL and alternative query languages against NoSQL and Hadoop data storage sys

Jan 7, 2023

Program finds average number of words in each comment given a large data set by use of hadoop's map reduce to work in parallel efficiently.

Finding average number of words in all the comments in a data set 📝 Mapper Function In the mapper function we first tokenize entire data and then fin

Aug 23, 2021

In this task, we had to write a MapReduce program to analyze the sentiment of a keyword from a list of comments. This was done using Hadoop HDFS.

All the files have been commented for your ease. Furthermore you may also add further comments if you may. For further queries contact me at : chhxnsh

Aug 14, 2021

Real-time Query for Hadoop; mirror of Apache Impala

Welcome to Impala Lightning-fast, distributed SQL queries for petabytes of data stored in Apache Hadoop clusters. Impala is a modern, massively-distri

Dec 28, 2022

The official home of the Presto distributed SQL query engine for big data

Presto Presto is a distributed SQL query engine for big data. See the User Manual for deployment instructions and end user documentation. Requirements

Jan 5, 2023

:elephant: Elasticsearch real-time search and analytics natively integrated with Hadoop

Elasticsearch Hadoop Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Apache Hive, Apache Pig, Apach

Dec 22, 2022

Apache Hive

Apache Hive (TM) The Apache Hive (TM) data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storag

Dec 28, 2022

Hadoop library for large-scale data processing, now an Apache Incubator project

Apache DataFu Follow @apachedatafu Apache DataFu is a collection of libraries for working with large-scale data in Hadoop. The project was inspired by

Apr 1, 2022

Google Mr4c GNU Lesser 3 Google Mr4c MR4C is an implementation framework that allows you to run native code within the Hadoop execution framework. License: GNU Lesser 3, .

Introduction to the MR4C repo About MR4C MR4C is an implementation framework that allows you to run native code within the Hadoop execution framework.

Dec 9, 2022

Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.

Elephant Bird About Elephant Bird is Twitter's open source library of LZO, Thrift, and/or Protocol Buffer-related Hadoop InputFormats, OutputFormats,

Jan 5, 2023

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

H2O H2O is an in-memory platform for distributed, scalable machine learning. H2O uses familiar interfaces like R, Python, Scala, Java, JSON and the Fl

Jan 2, 2023

Apache Hive

Apache Hive (TM) The Apache Hive (TM) data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storag

Dec 28, 2022

The official home of the Presto distributed SQL query engine for big data

Presto Presto is a distributed SQL query engine for big data. See the User Manual for deployment instructions and end user documentation. Requirements

Dec 30, 2022

Model import deployment framework for retraining models (pytorch, tensorflow,keras) deploying in JVM Micro service environments, mobile devices, iot, and Apache Spark

The Eclipse Deeplearning4J (DL4J) ecosystem is a set of projects intended to support all the needs of a JVM based deep learning application. This mean

Dec 30, 2022

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)

Trino is a fast distributed SQL query engine for big data analytics. See the User Manual for deployment instructions and end user documentation. Devel

Dec 31, 2022

Apache Calcite

Apache Calcite Apache Calcite is a dynamic data management framework. It contains many of the pieces that comprise a typical database management syste

Dec 31, 2022

Apache ORC - the smallest, fastest columnar storage for Hadoop workloads

Apache ORC ORC is a self-describing type-aware columnar file format designed for Hadoop workloads. It is optimized for large streaming reads, but with

Jan 2, 2023

Java Hadoop Resources

Related tags

Java hadoop Libraries