Mirror of Apache Mahout

Last update: Jan 4, 2023

Related tags

Overview

Welcome to Apache Mahout!

The goal of the Apache Mahout™ project is to build an environment for quickly creating scalable, performant machine learning applications.

For additional information about Mahout, visit the Mahout Home Page

Setting up your Environment

Whether you are using the Mahout- shell, running command line jobs, or using it as a library to build apps, you will need to set-up several environment variables. Edit your environment in ~/.bash_profile for Mac or ~/.bashrc for many Linux distributions. Add the following

export MAHOUT_HOME=/path/to/mahout
export MAHOUT_LOCAL=true # for running standalone on your dev machine, 
# unset MAHOUT_LOCAL for running on a cluster

You will need $JAVA_HOME, and if you are running on Spark, you will also need $SPARK_HOME.

Using Mahout as a Library

Running any application that uses Mahout will require installing a binary or source version and setting the environment. To compile from source:

mvn -DskipTests clean install
To run tests do mvn test
To set up your IDE, do mvn eclipse:eclipse or mvn idea:idea

To use Maven, add the appropriate setting to your pom.xml or build.sbt following the template below.

To use the Samsara environment you'll need to include both the engine neutral math-scala dependency:

<dependency>
    <groupId>org.apache.mahout</groupId>
    <artifactId>mahout-math-scala</artifactId>
    <version>${mahout.version}</version>
</dependency>

and a dependency for back end engine translation, e.g:

<dependency>
    <groupId>org.apache.mahout</groupId>
    <artifactId>mahout-spark</artifactId>
    <version>${mahout.version}</version>
</dependency>

Building From Source

Prerequisites:

Linux Environment (preferably Ubuntu 16.04.x) Note: Currently, only the JVM-only build will work on a Mac. gcc > 4.x NVIDIA Card (installed with OpenCL drivers alongside usual GPU drivers)

Downloads

Install java 1.7+ in an easily accessible directory (for this example, ~/java/) http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

Create a directory ~/apache/.

Download apache Maven 3.3.9 and un-tar/gunzip to ~/apache/apache-maven-3.3.9/ . https://maven.apache.org/download.cgi

Download and un-tar/gunzip Hadoop 2.4.1 to ~/apache/hadoop-2.4.1/ . https://archive.apache.org/dist/hadoop/common/hadoop-2.4.1/

Download and un-tar/gunzip spark-1.6.3-bin-hadoop2.4 to ~/apache/ . http://spark.apache.org/downloads.html Choose release: Spark-1.6.3 (Nov 07 2016) Choose a package type: Pre-Built for Hadoop 2.4

Install ViennaCL 1.7.0+ If running Ubuntu 16.04+

sudo apt-get install libviennacl-dev

Otherwise if your distribution’s package manager does not have a viennniacl-dev package >1.7.0, clone it directly into the directory which will be included in when being compiled by Mahout:

mkdir ~/tmp
cd ~/tmp && git clone https://github.com/viennacl/viennacl-dev.git
cp -r viennacl/ /usr/local/
cp -r CL/ /usr/local/

Ensure that the OpenCL 1.2+ drivers are all installed (packed with most consumer-grade NVIDIA drivers). Not sure about higher-end cards.

Clone mahout repository into ~/apache.

git clone https://github.com/apache/mahout.git

Configuration

When building mahout for a spark backend, we need four System Environment variables set:

    export MAHOUT_HOME=/home/<user>/apache/mahout
    export HADOOP_HOME=/home/<user>/apache/hadoop-2.4.1
    export SPARK_HOME=/home/<user>/apache/spark-1.6.3-bin-hadoop2.4    
    export JAVA_HOME=/home/<user>/java/jdk-1.8.121

Mahout on Spark regularly uses one more env variable, the IP of the Spark clusters' master node (usually, the node hosting the session user).

To use four local cores (Spark master need not be running)

export MASTER=local[4]

To use all available local cores (again, Spark master need not be running)

export MASTER=local[*]

To point to a cluster with spark running:

export MASTER=spark://master.ip.address:7077

We then add these to the path:

   PATH=$PATH$:MAHOUT_HOME/bin:$HADOOP_HOME/bin:$SPARK_HOME/bin:$JAVA_HOME/bin

These get appended to the users' ~/.bashrc file.

Building Mahout with Apache Maven

Currently, Mahout has three builds. From the $MAHOUT_HOME directory, we may issue the commands to build each using mvn profiles.

JVM only:

mvn clean install -DskipTests

JVM with native OpenMP level 2 and level 3 matrix/vector Multiplication

mvn clean install -Pviennacl-omp -Phadoop2 -DskipTests

JVM with native OpenMP and OpenCL for Level 2 and level 3 matrix/vector Multiplication. (GPU errors fall back to OpenMP, and currently, only a single GPU/node is supported).

mvn clean install -Pviennacl -Phadoop2 -DskipTests

Testing the Mahout Environment

Mahout provides an extension to the spark-shell that is good for getting to know the language, testing partition loads, prototyping algorithms, etc.

To launch the shell in local mode with two threads - simply do the following:

$ MASTER=local[2] mahout spark-shell

After a very verbose startup, a Mahout welcome screen will appear:

Loading /home/andy/sandbox/apache-mahout-distribution-0.13.0/bin/load-shell.scala...
import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
sdc: org.apache.mahout.sparkbindings.SparkDistributedContext = org.apache.mahout.sparkbindings.SparkDistributedContext@3ca1f0a4

                _                 _
_ __ ___   __ _| |__   ___  _   _| |_
 '_ ` _ \ / _` | '_ \ / _ \| | | | __|
 | | | | (_| | | | | (_) | |_| | |_
_| |_| |_|\__,_|_| |_|\___/ \__,_|\__|  version 0.13.0


That file does not exist


scala>

At the scala> prompt, enter:

scala> :load /home/<andy>/apache/mahout/examples
                               /bin/SparseSparseDrmTimer.mscala

Which will load a matrix multiplication timer function definition. To run the matrix timer:

        scala> timeSparseDRMMMul(1000,1000,1000,1,.02,1234L)
            {...} res3: Long = 16321

Note the 14.1 release is missing a class required for this will be fixed in 14.2. We can see that the JVM only version is slow, thus our motive for GPU and Native Multithreading support.

To understand the processes getting performed under the hood of the timer, we may examine the .mscala (mahout scala) code that is both fully functional scala and the Mahout R-Like DSL for tensor algebra:




def timeSparseDRMMMul(m: Int, n: Int, s: Int, para: Int, pctDense: Double = .20, seed: Long = 1234L): Long = {
  val drmA = drmParallelizeEmpty(m , s, para).mapBlock(){
       case (keys,block:Matrix) =>
           val R =  scala.util.Random
           R.setSeed(seed)
           val blockB = new SparseRowMatrix(block.nrow, block.ncol)
           blockB := {x => if (R.nextDouble < pctDense) R.nextDouble else x }
       (keys -> blockB)
  }

  val drmB = drmParallelizeEmpty(s , n, para).mapBlock(){
       case (keys,block:Matrix) =>
           val R =  scala.util.Random
           R.setSeed(seed + 1)
           val blockB = new SparseRowMatrix(block.nrow, block.ncol)
           blockB := {x => if (R.nextDouble < pctDense) R.nextDouble else x }
       (keys -> blockB)
  }

  var time = System.currentTimeMillis()

  val drmC = drmA %*% drmB
 
  // trigger computation
  drmC.numRows()

  time = System.currentTimeMillis() - time

  time  
 
}

For more information, please see the following references:

http://mahout.apache.org/users/environment/in-core-reference.html

http://mahout.apache.org/users/environment/out-of-core-reference.html

http://mahout.apache.org/users/sparkbindings/play-with-shell.html

http://mahout.apache.org/users/environment/classify-a-doc-from-the-shell.html

Note that due to an intermittent out-of-memory bug in a Flink-based test, we have disabled it from the binary releases. To use Flink, please uncomment the line in the root pom.xml in the <modules> block, so it reads <module>flink</module>.

Examples

For examples of how to use Mahout, see the examples directory located in examples/bin

For information on how to contribute, visit the How to Contribute Page

Legal

Please see the NOTICE.txt included in this directory for more information.

Comments

MAHOUT-1615: drmFromHDFS returning the same Key for all Key,Vec Pairs for Text-Keyed SequenceFiles

SparkContext.sequenceFile(...) will yield the same key per partition for Text-Keyed Sequence files if the key a new copy of the key is not created when mapping to an RDD. This patch checks for Text Keys and creates a copy of each Key if necessary.

opened by andrewpalumbo 53
MAHOUT-1493 Port Naive Bayes to Scala DSL

Created this PR to review the patch provided by Christoph Viebig and team, Students of TU, Berlin. It is based on Sebastian's code from before the DSL was abstracted away from Spark.

opened by andrewpalumbo 29
MAHOUT-1616 Hadoop client

I changed the hadoop related dependencies in the project root and mrlegacy to hadoop-client.

Now to build mahout against to hadoop 1.2.1, run mvn clean package For another version of hadoop, try mvn clean package -Dhadoop.version=YOUR_HADOOP_VERSION

Tests pass, but I currently don't have access to an actual cluster. Could somebody please test this on a cluster and report the results?

This would allow us to automatically support multiple hadoop versions, including vendors', and would result in simpler poms. If it works, I will change mahout-spark module accordingly.

opened by gcapan 28
Sparse-vector implementation based on fastutil (MAHOUT-1640)

The collections currently used by Mahout to implement sparse vectors are extremely slow. The proposed patch (localized to RandomAccessSparseVector) uses fastutil's maps and the speed improvements in vector benchmarks are very significant. It would be interesting to see whether these improvements percolate to high-level classes using sparse vectors.

I had to patch two unit tests (an off-by-one bug and an overfitting bug; both were exposed by the different order in which key/values were returned by iterators).

Some more speed might be gained by using everywhere the standard java.util.Map.Entry interface instead of Element.

opened by vigna 27
MAHOUT-1603: Tweaks for Spark 1.0.x

For folks who (like me) got tired of waiting for Mahout data frames support and would like to run Spark SQL expressions directly in the Mahout Spark shell.

(you can thank me later)

PR'ing against a side branch for now, since people are probably testing against 0.9.x on master.

opened by dlyubimov 26
[MAHOUT-1894] Add Support for Spark 2.x

As long as we're sticking to Scala 2.10, running mahout on spark 2.x is simply a matter of

mvn clean package -Dspark.version=2.0.2 or mvn clean package -Dspark.version=2.1.0

The trouble comes with the shell...

I checked Apache Zeppelin to see how they handle multiple spark/scala versions... a brief preview of the descent into hell that is having a shell that handles multiple spark/scala versions

So I took an alternate root. I dropped the Mahout shell all together, changed the mahout bin file to load the spark shell directly, and pass a scala script that takes care of our imports.

When building there is a single deprecation warning regarding the sqlContext and how it is created in the spark-bindings.

I think we should add binaries for Spark 2.0 and Spark 2.1 as a matter of convenience and the Zeppelin integration.

opened by rawkintrevo 20
General cleanup of Scala Similarity and Drivers
Many small cleanup changes:

simplified drivers

removed any reference to cross-indicator and most references to indicator

shortened long lines

cleaned up comments and scaladoc annotations

replaced old use of o.a.m.math.Pair with Scala tuples

Tested spark-itemsimilarity on cluster but not naive Bayes stuff

Decided not to remove scopt. Removing it would be more trouble than it's worth given how small the lib is. It still may be worth using case classes for options to get rid of verbose casts but doesn't seem pressing.

This PR doesn't touch the job.jar assembly in the spark module. I have a pare down of that waiting for other refactoring commits from @dlyubimov.

No other work planned on this PR
opened by pferrel 20
MAHOUT-1636

Started out simplifying driver code and making changes to all drivers to support that. Then ran into the fat job.jar issue of MAHOUT-1636 so created a slimmed down version of the old job.jar by adding excludes to job.xml and changing the name to "dependencies.jar"

The new jar works for spark-itemsimilarity and spark-row-similarity but needs to be tested for the naive bayes drivers.

The dependencies.jar still contains a lot of stuff from mrlegacy, some is in external projects, like jackson that can be excluded with this mechanism but also a lot of mahout code that is unneeded in this jar. This later case would require some other mechanism than a simple clause in the assembly xml file.

I believe the new dependencies.jar is the only thing that needs to be on the classpath when running spark drivers or the spark-shell. I haven't changed this but it is a further refinement we can try.

opened by pferrel 19
Mahout 1999- Automating Multi-Artifact Build Process
Purpose of PR:

Automating Multi-Artifact Release Process

check by clearing .m2 cache then running: mvn clean install -Pall-scala,all-spark,viennacl,viennacl-omp

Important ToDos

Please mark each with an "x"

[x] A JIRA ticket exists (if not, please create this first)[https://issues.apache.org/jira/browse/ZEPPELIN/]

[x] Title of PR is "MAHOUT-XXXX Brief Description of Changes" where XXXX is the JIRA number.

[x] Created unit tests where appropriate

[x] Added licenses correct on newly added files

[x] Assigned JIRA to self

[x] Added documentation in scala docs/java docs, and to website

[x] Successfully built and ran all unit tests, verified that all tests pass locally.

If all of these things aren't complete, but you still feel it is appropriate to open a PR, please add [WIP] after MAHOUT-XXXX before the descriptions- e.g. "MAHOUT-XXXX [WIP] Description of Change"

Does this change break earlier versions?

Is this the beginning of a larger project for which a feature branch should be made?
opened by rawkintrevo 18
MAHOUT-1795: Build math & spark bindings under scala 2.11

The shell isn't building, so only enable it for 2.10.

Unfortunately this means that there's a 2x2 build matrix with hadoop/scala versions. The only way to resolve this was to switch to activating profiles through properties, so if you were previously using -Phadoop1, you'll want to use -Dhadoop1. The default config without any options is effectively -Phadoop2,scala-2.10.

opened by mikekap 16
MAHOUT-1570, sub-pr: a siggestion: let's unify all key class tag extractors.

Unifying "keyClassTag" of checkpoitns and "classTagK" of logical operators and

elevating "keyClassTag" into DrmLike[] trait. No more logical forks any more .

opened by dlyubimov 16
Bump hadoop-yarn-server-common from 2.10.0 to 2.10.2
Bumps hadoop-yarn-server-common from 2.10.0 to 2.10.2.

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0
Bump hadoop-common from 2.10.0 to 3.2.3
Bumps hadoop-common from 2.10.0 to 3.2.3.

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0
[NO-JIRA] Fix resource leak and other potential bugs

Purpose of PR:

This fixes a number of bugs found by the Muse static analysis platform, which I work on. In particular, this pull request removes a resource leak. This is not a new feature and does not have an associated Jira issue or associated unit tests.

The complete results of the analysis can be found on the Muse console. I recommend enabling the Muse platform on the repository so that new bugs are caught at the pull request stage (see an example) before they are merged. Muse is free forever for open source software.

opened by js-musedev 0
MAHOUT-2096 [WIP] next() Called On Possible Empty iterator()
Purpose of PR:

Initially there was no check for the empty iterator when calling the next() function, which could result in an exception.

Important ToDos

Please mark each with an "x"

[x] A JIRA ticket exists (if not, please create this first)[https://issues.apache.org/jira/browse/mahout/]

[x] Title of PR is "MAHOUT-XXXX Brief Description of Changes" where XXXX is the JIRA number.

[ ] Created unit tests where appropriate

[ ] Added licenses correct on newly added files

[ ] Assigned JIRA to self

[ ] Added documentation in scala docs/java docs, and to website

[x] Successfully built and ran all unit tests, verified that all tests pass locally.

If all of these things aren't complete, but you still feel it is appropriate to open a PR, please add [WIP] after MAHOUT-XXXX before the descriptions- e.g. "MAHOUT-XXXX [WIP] Description of Change"

Does this change break earlier versions? - No

Is this the beginning of a larger project for which a feature branch should be made? - Not sure
opened by balashashanka 2
[WIP]MAHOUT-1974 (dense cuda multiplication)
Purpose of PR:

Please give a short description of what this PR is for.

Important ToDos

Please mark each with an "x"

[x] A JIRA ticket exists (if not, please create this first)[https://issues.apache.org/jira/browse/MAHOUT/]

[x] Title of PR is "MAHOUT-XXXX Brief Description of Changes" where XXXX is the JIRA number.

[x] Created unit tests where appropriate

[ ] Added licenses correct on newly added files

[ ] Assigned JIRA to self

[ ] Added documentation in scala docs/java docs, and to website

[ ] Successfully built and ran all unit tests, verified that all tests pass locally.

If all of these things aren't complete, but you still feel it is appropriate to open a PR, please add [WIP] after MAHOUT-XXXX before the descriptions- e.g. "MAHOUT-XXXX [WIP] Description of Change"

Does this change break earlier versions?

Is this the beginning of a larger project for which a feature branch should be made?
opened by andrewpalumbo 4

Owner

The Apache Software Foundation

GitHub

Mirror of Apache SystemML

Apache SystemDS Overview: SystemDS is a versatile system for the end-to-end data science lifecycle from data integration, cleaning, and feature engine

940 Dec 25, 2022

Mirror of Apache Qpid

We have moved to using individual Git repositories for the Apache Qpid components and you should look to those for new development. This Subversion re

125 Dec 29, 2022

Now redundant weka mirror. Visit https://github.com/Waikato/weka-trunk for the real deal

weka (mirror) Computing and Mathematical Sciences at the University of Waikato now has an official github organization including a read-only git mirro

313 Dec 16, 2022

Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

1.8k Dec 28, 2022

Encog java core Apache 2 Encog java core Encog is an advanced machine learning framework that supports a variety of advanced algorithms, as well as support classes to normalize and process data. Machine learning algorithms such as Support Vector Machines, Artificial Neural Networks, Genetic Programming, Bayesian Networks, Hidden Markov Models, Genetic Programming and Genetic Algorithms are supported. License: Apache 2 , .

Encog Machine Learning Framework Encog is a pure-Java/C# machine learning framework that I created back in 2008 to support genetic programming, NEAT/H

739 Dec 17, 2022

Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

1.7k Mar 12, 2021

Apache Spark - A unified analytics engine for large-scale data processing

Apache Spark Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an op

34.7k Jan 2, 2023

Apache Flink

Apache Flink Apache Flink is an open source stream processing framework with powerful stream- and batch-processing capabilities. Learn more about Flin

20.4k Jan 5, 2023

Model import deployment framework for retraining models (pytorch, tensorflow,keras) deploying in JVM Micro service environments, mobile devices, iot, and Apache Spark

The Eclipse Deeplearning4J (DL4J) ecosystem is a set of projects intended to support all the needs of a JVM based deep learning application. This mean

12.7k Dec 30, 2022

Firestorm is a Remote Shuffle Service, and provides the capability for Apache Spark applications to store shuffle data on remote servers

What is Firestorm Firestorm is a Remote Shuffle Service, and provides the capability for Apache Spark applications to store shuffle data on remote ser

246 Nov 29, 2022

Mirror of Apache Mahout

Related tags

Overview

Welcome to Apache Mahout!

Setting up your Environment

Using Mahout as a Library

Building From Source

Prerequisites:

Downloads

Configuration

Building Mahout with Apache Maven

Testing the Mahout Environment

Examples

Legal

Comments

Purpose of PR:

Important ToDos

Purpose of PR:

Purpose of PR:

Important ToDos

Purpose of PR:

Important ToDos

Owner

The Apache Software Foundation

Mirror of Apache SystemML

Mirror of Apache Qpid

Now redundant weka mirror. Visit https://github.com/Waikato/weka-trunk for the real deal

Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Apache Spark - A unified analytics engine for large-scale data processing

Apache Flink

Model import deployment framework for retraining models (pytorch, tensorflow,keras) deploying in JVM Micro service environments, mobile devices, iot, and Apache Spark

Firestorm is a Remote Shuffle Service, and provides the capability for Apache Spark applications to store shuffle data on remote servers

Flink/Spark Connectors for Apache Doris(Incubating)

Word Count in Apache Spark using Java

Mirror of Apache Deltaspike

Mirror of Apache Kafka

Mirror of Apache RocketMQ

Mirror of Apache ActiveMQ

Mirror of Apache ActiveMQ Artemis

Mirror of Apache Storm

Mirror of Apache SIS

Mirror of Apache Cassandra