Program that uses Hadoop Map-Reduce to identify the anagrams of the words of a file

Overview

Hadoop-MapReduce-Anagram-Solver

The implementation consists of a program that utilizes the Hadoop Map-Reduce framework to identify the anagrams of the words of a file.

Author: Nikolas Petrou, MSc in Data Science

But what is an anagram?

An anagram is a word or phrase formed by rearranging the letters of a different word, by using all the original characters/letters exactly once.

For example:

  • Refills→fillers
  • Relayed→layered
  • Rentals→antlers
  • Rebuild→builder

Data

Specifically this task focuses on finding the anagrams of the words of the following file: https://raw.githubusercontent.com/pmichaud/rpbench/master/files/unixdict.txt

You can download & upload the aforementioned UNIX dictionary file to your own HDFS filesystem using the following commands:

Implementation

Examples of desired output:

  • 2 hasn't,shan't
  • 2 cascara,caracas
  • 2 ramada,armada

The main idea of this problem's solution is to use the same Key for every word that can be rearranged together. Thus, the ideal Key for each read word to use during the mapping phase, is a Text object with the sorted letters-characters (alphabetically) of the read word. For example, both declaim and decimal words will be using the key acdeilm.

The desired output of the program is located in the part-r-00000 file, while the code file is located in the Anagram.java file. There are more than enough comments which explain the whole implementation very analytically.

Helpful Material-Links

If you are not very familiar with the Hadoop Map-Reduce framework, the following sites provide useful information for understanding some basic concepts, as well as some of the ideas of this task:

Fundamentals of MapReduce with MapReduce Example

Creating Custom Hadoop Writable Data Type

MSc in Data Science Programme

You might also like...

Tree View; Mind map; Think map; tree map; custom view; 自定义; 树状图;思维导图;组织机构图;层次图

GysoTreeView 【中文】【English】 ⭐ If ok, give me a star ⭐ ⭐ ⭐ ⭐ ⭐ ⭐ Tree View; Mind map; Think map; tree map; 树状图;思维导图;组织机构图;层次图;树型图 A custom tree view for

Dec 30, 2022

Squaremap is a minimalistic and lightweight world map viewer for Minecraft servers, using the vanilla map rendering style

squaremap squaremap (formerly known as Pl3xMap) is a minimalistic and lightweight live world map viewer for Minecraft servers. What is squaremap If, l

Jan 3, 2023

A simple program that is realized by entering data, storing it in memory (in a file) and reading from a file to printing that data.

A simple program that is realized by entering data, storing it in memory (in a file) and reading from a file to printing that data.

Pet project A simple program that is realized by entering data, storing it in memory (in a file) and reading from a file to printing that data. It can

Apr 28, 2022

In this task, we had to write a MapReduce program to analyze the sentiment of a keyword from a list of comments. This was done using Hadoop HDFS.

All the files have been commented for your ease. Furthermore you may also add further comments if you may. For further queries contact me at : chhxnsh

Aug 14, 2021

log4j-scanner is a project derived from other members of the open-source community by CISA's Rapid Action Force team to help organizations identify potentially vulnerable web services affected by the log4j vulnerabilities.

Log4j Scanner This repository provides a scanning solution for the log4j Remote Code Execution vulnerabilities (CVE-2021-44228 & CVE-2021-45046). The

Dec 22, 2022

Beagle helps you identify keywords, phrases, regexes, and complex search queries of interest in streams of text documents.

Beagle helps you identify keywords, phrases, regexes, and complex search queries of interest in streams of text documents.

Beagle Beagle is a detector of interesting things in text. Its intended use is in-stream search applications. Suppose you need to monitor a stream of

Dec 3, 2022

Burp Active Scan extension to identify Log4j vulnerabilities CVE-2021-44228 and CVE-2021-45046

Log4j-HammerTime This Burp Suite Active Scanner extension validates exploitation of the Apache Log4j CVE-2021-44228 and CVE-2021-45046 vulnerabilities

Jan 8, 2022

Lightweight installer written in java, made for minecraft mods, The installer uses JPanel and uses a URL to install to the specific area (Discord URL's work the best i've seen)

InstallerForJava Lightweight installer written in java, made for minecraft mods, The installer uses JPanel and uses a URL to install to the specific a

Dec 9, 2022

A small game written in Java to review words.

这是一个实现网络连接的助记单词游戏 项目的具体功能: 实现多个用户通过网络连机进行游戏 通过对随机下落的六级词汇的补全,在游戏中提高用户的单词水平 记录每次游戏的成绩和情况(答对、答错、未答) 运用的技术及难点: 技术: 运用JavaSwing对游戏进行图形化开发,运用JavaSocket实现C/S

Feb 2, 2022

WordleCompanion - A tool to help you determine those hard-to-guess words while doing your daily Wordle puzzles.

A tool to help you determine those hard-to-guess words while doing your daily Wordle puzzles. How it works Enter the 5-letter word you

Jan 22, 2022

WordleCracker - This bot is designed for guessing 5 letter words for the trending game wordle

WordleCracker - This bot is designed for guessing 5 letter words for the trending game wordle

WordleCracker This bot is designed for guessing 5 letter words for the trending game wordle. The user informs about the status of the game by giving i

Nov 7, 2022

FileServer - A multithreaded client-server program that uses Java Sockets to establish TCP/IP connection

A multithreaded client-server program that uses Java Sockets to establish TCP/IP connection. The server allows multiple clients to upload, retrieve and delete files on/from the server.

Nov 13, 2022

UMS is a CRUD based management system which uses File Handling to manipulate data and perform the CRUD operations

UMS is a CRUD based management system which uses File Handling to manipulate data and perform the CRUD operations

UMS is a CRUD (Create, Read, Update, Delete) based management system which uses File Handling to manipulate data and perform the CRUD operations. It is a group project made using Java procedural programming having both User and Admin sides.

Dec 20, 2022

vʌvr (formerly called Javaslang) is a non-commercial, non-profit object-functional library that runs with Java 8+. It aims to reduce the lines of code and increase code quality.

vʌvr (formerly called Javaslang) is a non-commercial, non-profit object-functional library that runs with Java 8+. It aims to reduce the lines of code and increase code quality.

Vavr is an object-functional language extension to Java 8, which aims to reduce the lines of code and increase code quality. It provides persistent co

Jan 3, 2023

Example how to reduce React Native iOS build times drastically

Reduce React Native iOS build times Introduction Amount of code in pods can be huge. Pods don’t change often. On CI, all pods are compiled over and ov

Dec 15, 2022

This is an automated library software built in Java Netbeans to reduce manual efforts of the librarian, students to ensure smooth functioning of library by involving RFIDs.

This is an automated library software built in Java Netbeans to reduce manual efforts of the librarian, students to ensure smooth functioning of library by involving RFIDs.

Advanced-Library-Automation-System This is an advanced automated library software built in Java Netbeans to reduce manual efforts of the librarian, st

Dec 6, 2022

A simple file sharing program

FileSharing A simple file sharing program How to use Place all the files to be shared in /html/files (symbolic links work).

May 13, 2021

A simple program used to enter people into a file stored in memory, and the same saved data is displayed in a table

A simple program used to enter people into a file stored in memory, and the same saved data is displayed in a table

A simple program used to enter people (students or professors) into a file stored in memory, and the same saved data is displayed in a table. Persons have the appropriate attributes where name, surname, etc. are entered and identified by ID.

Apr 28, 2022
Owner
Nikolas Petrou
M.Sc. Data Science student, University of Cyprus (UCY) Research Assistant at the Laboratory of Internet Computing (LInC) B.Sc degree in Computer Science
Nikolas Petrou
In this task, we had to write a MapReduce program to analyze the sentiment of a keyword from a list of comments. This was done using Hadoop HDFS.

All the files have been commented for your ease. Furthermore you may also add further comments if you may. For further queries contact me at : chhxnsh

Hassan Shahzad 5 Aug 14, 2021
Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.

Elephant Bird About Elephant Bird is Twitter's open source library of LZO, Thrift, and/or Protocol Buffer-related Hadoop InputFormats, OutputFormats,

Twitter 1.1k Jan 5, 2023
Google Mr4c GNU Lesser 3 Google Mr4c MR4C is an implementation framework that allows you to run native code within the Hadoop execution framework. License: GNU Lesser 3, .

Introduction to the MR4C repo About MR4C MR4C is an implementation framework that allows you to run native code within the Hadoop execution framework.

Google 911 Dec 9, 2022
Hadoop library for large-scale data processing, now an Apache Incubator project

Apache DataFu Follow @apachedatafu Apache DataFu is a collection of libraries for working with large-scale data in Hadoop. The project was inspired by

LinkedIn's Attic 589 Apr 1, 2022
:elephant: Elasticsearch real-time search and analytics natively integrated with Hadoop

Elasticsearch Hadoop Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Apache Hive, Apache Pig, Apach

elastic 1.9k Dec 22, 2022
Real-time Query for Hadoop; mirror of Apache Impala

Welcome to Impala Lightning-fast, distributed SQL queries for petabytes of data stored in Apache Hadoop clusters. Impala is a modern, massively-distri

Cloudera 27 Dec 28, 2022
PageRank implementation in hadoop

PageRank implementation in hadoop Use kiwenalu/hadoop-cluster-docker (set cluster size for 5) for running JAR. Load dataset to memory using script

Maksym Zub 1 Jan 24, 2022
Program finds average number of words in each comment given a large data set by use of hadoop's map reduce to work in parallel efficiently.

Finding average number of words in all the comments in a data set ?? Mapper Function In the mapper function we first tokenize entire data and then fin

Aleezeh Usman 3 Aug 23, 2021