Program finds average number of words in each comment given a large data set by use of hadoop's map reduce to work in parallel efficiently.

Overview

Finding average number of words in all the comments in a data set

📝 Mapper Function

In the mapper function we first tokenize entire data and then find first occurrence of ‘Text=”’ which signifies the beginning of the comment and then count number of words in the comment until ‘”’ is found which signifies end of comment.

📊 Reducer function

Length of each comment is sent to reducer with one single standard key – ‘key’. Reducer sums each value and counts number of values which depicts total number of comments. The sum is divided by number of comments which gives us the average which is sent back to main main and displayed.

Files included:

Code can be found in the .java files, while complete .jar file is also available

Screenshots of output

Find below screenshot of testrun:

  • INPUT - In the picture shown below, 11 rows were given as input so the average length given by Hadoop MapReduce could be manually checked

task2-testoutput

  • OUTPUT - As we can see total number of words in each comment is divided by total number of comments, giving us the answer 33.

task2-testinput

You might also like...

The official home of the Presto distributed SQL query engine for big data

Presto Presto is a distributed SQL query engine for big data. See the User Manual for deployment instructions and end user documentation. Requirements

Jan 5, 2023

A platform for visualization and real-time monitoring of data workflows

A platform for visualization and real-time monitoring of data workflows

Status This project is no longer maintained. Ambrose Twitter Ambrose is a platform for visualization and real-time monitoring of MapReduce data workfl

Dec 31, 2022

Access paged data as a "stream" with async loading while maintaining order

DataStream What? DataStream is a simple piece of code to access paged data and interface it as if it's a single "list". It only keeps track of queued

Jan 19, 2022

Program that uses Hadoop Map-Reduce to identify the anagrams of the words of a file

Program that uses Hadoop Map-Reduce to identify the anagrams of the words of a file

Hadoop-MapReduce-Anagram-Solver The implementation consists of a program that utilizes the Hadoop Map-Reduce framework to identify the anagrams of the

Dec 4, 2022

Tree View; Mind map; Think map; tree map; custom view; 自定义; 树状图;思维导图;组织机构图;层次图

GysoTreeView 【中文】【English】 ⭐ If ok, give me a star ⭐ ⭐ ⭐ ⭐ ⭐ ⭐ Tree View; Mind map; Think map; tree map; 树状图;思维导图;组织机构图;层次图;树型图 A custom tree view for

Dec 30, 2022

Parallel programming quick sort and parallel sum examples with Fork-join, RecursiveTaskT, RecursiveAction

Parallel programming quick sort and parallel sum examples with Fork-join, RecursiveTask<T>, RecursiveAction

QuickSortMultiThreading Parallel programming quick sort and parallel sum examples with Fork-join, RecursiveTaskT, RecursiveAction Fork-Join Fork-Joi

Jun 12, 2022

A tool that can calculate the average solution set for a first guess in the game of Wordle

word-distances A tool that can calculate the average solution set for a first guess in the game of Wordle. Yes, the name isn't great -- I initially ha

May 2, 2022

Adds value to towns, by giving each one a unique set of automatically-generated resources.

TownyResources TownyResources adds value to towns, by giving each one a unique set of automatically-produced resources which can be collected by playe

Dec 30, 2022

A simple FizzBuzz playing program which will count up to a number of your choice.

FizzBuzz A simple program which plays FizzBuzz up to a number of your choice. For those who don't know how FizzBuzz works, you count up from 1, but: E

Sep 15, 2022

This is Yoink Inc's Rat (Doesnt work if u want u can make it work and pull request this)

Yoink-RAT This is Yoink Inc's Rat (Doesnt work if u want u can make it work and pull request this) Why cuz How do i use it just put your webhook in it

Dec 14, 2022

This app/widget is based on the work of Anthony (tonesto7), which is in turn based on the earlier work of David Schablowsky

This app/widget is based on the work of Anthony (tonesto7), which is in turn based on the earlier work of David Schablowsky

Mustang Mach-E Widget for Android Intro This app/widget is based on the work of Anthony (tonesto7), which is in turn based on the earlier work of Davi

Nov 15, 2022

Squaremap is a minimalistic and lightweight world map viewer for Minecraft servers, using the vanilla map rendering style

squaremap squaremap (formerly known as Pl3xMap) is a minimalistic and lightweight live world map viewer for Minecraft servers. What is squaremap If, l

Jan 3, 2023

It is a Basic Comment App for different users.

It is a Basic Comment App for different users.

Zoho-comments It is a Basic Comment App for different users. Technology Used : JAVA Swing, Mysql. Tools Used : Eclipse , WampServer. Mysql Table : sig

Feb 12, 2022

Program that allows employees to clock in and clock out of work. Employees who are managers can add, edit and delete employees and shifts from the database.

Program that allows employees to clock in and clock out of work. Employees who are managers can add, edit and delete employees and shifts from the database.

Clock-In-Clock-Out-System Created by: Kennedy Janto, Taylor Vandenberg, Duc Nguyen, Alex Gomez, Janista Gitbumrungsin This is a semester long project

Nov 5, 2022

A small game written in Java to review words.

这是一个实现网络连接的助记单词游戏 项目的具体功能: 实现多个用户通过网络连机进行游戏 通过对随机下落的六级词汇的补全,在游戏中提高用户的单词水平 记录每次游戏的成绩和情况(答对、答错、未答) 运用的技术及难点: 技术: 运用JavaSwing对游戏进行图形化开发,运用JavaSocket实现C/S

Feb 2, 2022
Owner
Aleezeh Usman
Aleezeh Usman
Hadoop library for large-scale data processing, now an Apache Incubator project

Apache DataFu Follow @apachedatafu Apache DataFu is a collection of libraries for working with large-scale data in Hadoop. The project was inspired by

LinkedIn's Attic 589 Apr 1, 2022
A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, organization and lifecycle management for both streaming and batch data ecosystems.

Apache Gobblin Apache Gobblin is a highly scalable data management solution for structured and byte-oriented data in heterogeneous data ecosystems. Ca

The Apache Software Foundation 2.1k Jan 4, 2023
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

Oryx Project 1.8k Dec 28, 2022
Hudi manages the storage of large analytical datasets on DFS

Apache Hudi Apache Hudi (pronounced Hoodie) stands for Hadoop Upserts Deletes and Incrementals. Hudi manages the storage of large analytical datasets

The Apache Software Foundation 3.8k Dec 30, 2022
Flink CDC Connectors is a set of source connectors for Apache Flink

Flink CDC Connectors is a set of source connectors for Apache Flink, ingesting changes from different databases using change data capture (CDC). The Flink CDC Connectors integrates Debezium as the engine to capture data changes.

null 6 Mar 23, 2022
In this task, we had to write a MapReduce program to analyze the sentiment of a keyword from a list of comments. This was done using Hadoop HDFS.

All the files have been commented for your ease. Furthermore you may also add further comments if you may. For further queries contact me at : chhxnsh

Hassan Shahzad 5 Aug 14, 2021
Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

Apache Zeppelin Documentation: User Guide Mailing Lists: User and Dev mailing list Continuous Integration: Contributing: Contribution Guide Issue Trac

The Apache Software Foundation 5.9k Jan 8, 2023
OpenRefine is a free, open source power tool for working with messy data and improving it

OpenRefine OpenRefine is a Java-based power tool that allows you to load data, understand it, clean it up, reconcile it, and augment it with data comi

OpenRefine 9.2k Jan 1, 2023
Netflix's distributed Data Pipeline

Suro: Netflix's Data Pipeline Suro is a data pipeline service for collecting, aggregating, and dispatching large volume of application events includin

Netflix, Inc. 772 Dec 9, 2022
SAMOA (Scalable Advanced Massive Online Analysis) is an open-source platform for mining big data streams.

SAMOA: Scalable Advanced Massive Online Analysis. This repository is discontinued. The development of SAMOA has moved over to the Apache Software Foun

Yahoo Archive 424 Dec 28, 2022