PageRank implementation in hadoop

Overview

PageRank implementation in hadoop

Use kiwenalu/hadoop-cluster-docker (set cluster size for 5) for running JAR.

Load dataset to memory using script

    ./prepare-input.sh

Run mapreduce script

    ./run-pagerank.sh

Copy output to machine from hdfs:

    hdfs dfs -get output/result/part-r-00000 result.txt

Algorithm

We are given with graph transition matrix M.

Our goal is to find vector V, where V(i) is the probability that we reach page i.

The final formula is V = M * V, so we can approximate V starting with V_0, and then

   V_{i+1} = normalized (M * V_{i})

Results

On cluster with 5 nodes, total time for 20 iterations is approximately 700 seconds.

You might also like...

Hadoop library for large-scale data processing, now an Apache Incubator project

Apache DataFu Follow @apachedatafu Apache DataFu is a collection of libraries for working with large-scale data in Hadoop. The project was inspired by

Apr 1, 2022

:elephant: Elasticsearch real-time search and analytics natively integrated with Hadoop

Elasticsearch Hadoop Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Apache Hive, Apache Pig, Apach

Dec 22, 2022

Real-time Query for Hadoop; mirror of Apache Impala

Welcome to Impala Lightning-fast, distributed SQL queries for petabytes of data stored in Apache Hadoop clusters. Impala is a modern, massively-distri

Dec 28, 2022

In this task, we had to write a MapReduce program to analyze the sentiment of a keyword from a list of comments. This was done using Hadoop HDFS.

All the files have been commented for your ease. Furthermore you may also add further comments if you may. For further queries contact me at : chhxnsh

Aug 14, 2021

Program finds average number of words in each comment given a large data set by use of hadoop's map reduce to work in parallel efficiently.

Program finds average number of words in each comment given a large data set by use of hadoop's map reduce to work in parallel efficiently.

Finding average number of words in all the comments in a data set 📝 Mapper Function In the mapper function we first tokenize entire data and then fin

Aug 23, 2021

Program that uses Hadoop Map-Reduce to identify the anagrams of the words of a file

Program that uses Hadoop Map-Reduce to identify the anagrams of the words of a file

Hadoop-MapReduce-Anagram-Solver The implementation consists of a program that utilizes the Hadoop Map-Reduce framework to identify the anagrams of the

Dec 4, 2022

Backport of functionality based on JSR-310 to Java SE 6 and 7. This is NOT an implementation of JSR-310.

ThreeTen backport project JSR-310 provides a new date and time library for Java SE 8. This project is the backport to Java SE 6 and 7. See the main ho

Jan 8, 2023

Java JsonPath implementation

Jayway JsonPath A Java DSL for reading JSON documents. Jayway JsonPath is a Java port of Stefan Goessner JsonPath implementation. News 10 Dec 2020 - R

Jan 4, 2023

FizzBuzz Enterprise Edition is a no-nonsense implementation of FizzBuzz made by serious businessmen for serious business purposes.

FizzBuzzEnterpriseEdition Enterprise software marks a special high-grade class of software that makes careful use of relevant software architecture de

Dec 31, 2022

The Java gRPC implementation. HTTP/2 based RPC

gRPC-Java - An RPC library and framework gRPC-Java works with JDK 7. gRPC-Java clients are supported on Android API levels 16 and up (Jelly Bean and l

Jan 1, 2023

MessagePack serializer implementation for Java / msgpack.org[Java]

MessagePack for Java MessagePack is a binary serialization format. If you need a fast and compact alternative of JSON, MessagePack is your friend. For

Dec 31, 2022

a pug implementation written in Java (formerly known as jade)

Attention: jade4j is now pug4j In alignment with the javascript template engine we renamed jade4j to pug4j. You will find it under https://github.com/

Oct 16, 2022

An implementation of darcy-web that uses Selenium WebDriver as the automation library backend.

darcy-webdriver An implementation of darcy-ui and darcy-web that uses Selenium WebDriver as the automation library backend. maven dependency gr

Aug 22, 2020

Hashids algorithm v1.0.0 implementation in Java

Hashids.java A small Java class to generate YouTube-like hashes from one or many numbers. Ported from javascript hashids.js by Ivan Akimov What is it?

Jan 5, 2023

a pug implementation written in Java (formerly known as jade)

Attention: jade4j is now pug4j In alignment with the javascript template engine we renamed jade4j to pug4j. You will find it under https://github.com/

Oct 16, 2022

A novel implementation of the Raft consensus algorithm

Copycat Copycat has moved! Copycat 2.x is now atomix-raft and includes a variety of improvements to Copycat 1.x: Multiple state machines per cluster M

Dec 6, 2022

Bloofi: A java implementation of multidimensional Bloom filters

Bloofi: A java implementation of multidimensional Bloom filters Bloom filters are probabilistic data structures commonly used for approximate membersh

Nov 2, 2022

High performance Java implementation of a Cuckoo filter - Apache Licensed

Cuckoo Filter For Java This library offers a similar interface to Guava's Bloom filters. In most cases it can be used interchangeably and has addition

Dec 30, 2022

Java port of a concurrent trie hash map implementation from the Scala collections library

About This is a Java port of a concurrent trie hash map implementation from the Scala collections library. It is almost a line-by-line conversion from

Oct 31, 2022
Owner
Maksym Zub
Maksym Zub
Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.

Elephant Bird About Elephant Bird is Twitter's open source library of LZO, Thrift, and/or Protocol Buffer-related Hadoop InputFormats, OutputFormats,

Twitter 1.1k Jan 5, 2023
Hadoop library for large-scale data processing, now an Apache Incubator project

Apache DataFu Follow @apachedatafu Apache DataFu is a collection of libraries for working with large-scale data in Hadoop. The project was inspired by

LinkedIn's Attic 589 Apr 1, 2022
:elephant: Elasticsearch real-time search and analytics natively integrated with Hadoop

Elasticsearch Hadoop Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Apache Hive, Apache Pig, Apach

elastic 1.9k Dec 22, 2022
Real-time Query for Hadoop; mirror of Apache Impala

Welcome to Impala Lightning-fast, distributed SQL queries for petabytes of data stored in Apache Hadoop clusters. Impala is a modern, massively-distri

Cloudera 27 Dec 28, 2022
In this task, we had to write a MapReduce program to analyze the sentiment of a keyword from a list of comments. This was done using Hadoop HDFS.

All the files have been commented for your ease. Furthermore you may also add further comments if you may. For further queries contact me at : chhxnsh

Hassan Shahzad 5 Aug 14, 2021
Program finds average number of words in each comment given a large data set by use of hadoop's map reduce to work in parallel efficiently.

Finding average number of words in all the comments in a data set ?? Mapper Function In the mapper function we first tokenize entire data and then fin

Aleezeh Usman 3 Aug 23, 2021
Program that uses Hadoop Map-Reduce to identify the anagrams of the words of a file

Hadoop-MapReduce-Anagram-Solver The implementation consists of a program that utilizes the Hadoop Map-Reduce framework to identify the anagrams of the

Nikolas Petrou 2 Dec 4, 2022
Google Mr4c GNU Lesser 3 Google Mr4c MR4C is an implementation framework that allows you to run native code within the Hadoop execution framework. License: GNU Lesser 3, .

Introduction to the MR4C repo About MR4C MR4C is an implementation framework that allows you to run native code within the Hadoop execution framework.

Google 911 Dec 9, 2022
Apache ORC - the smallest, fastest columnar storage for Hadoop workloads

Apache ORC ORC is a self-describing type-aware columnar file format designed for Hadoop workloads. It is optimized for large streaming reads, but with

The Apache Software Foundation 576 Jan 2, 2023
Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.

Elephant Bird About Elephant Bird is Twitter's open source library of LZO, Thrift, and/or Protocol Buffer-related Hadoop InputFormats, OutputFormats,

Twitter 1.1k Jan 5, 2023