SAMOA (Scalable Advanced Massive Online Analysis) is an open-source platform for mining big data streams.

Last update: Dec 28, 2022

Related tags

Overview

SAMOA: Scalable Advanced
Massive Online Analysis.

This repository is discontinued. The development of SAMOA has moved over to the Apache Software Foundation. Please subscribe to the dev mailing list to participate in the development, and use Jira to report bugs and propose new features. The new repository is mirrored on GitHub.

SAMOA is a platform for mining on big data streams. It is a distributed streaming machine learning (ML) framework that contains a programing abstraction for distributed streaming ML algorithms.

SAMOA enables development of new ML algorithms without dealing with the complexity of underlying streaming processing engines (SPE, such as Apache Storm and Apache S4). SAMOA also provides extensibility in integrating new SPEs into the framework. These features allow SAMOA users to develop distributed streaming ML algorithms once and to execute the algorithms in multiple SPEs, i.e., code the algorithms once and execute them in multiple SPEs.

Build

###Storm mode

Simply clone the repository and install SAMOA.

git clone [email protected]:yahoo/samoa.git
cd samoa
mvn -Pstorm package

The deployable jar for SAMOA will be in target/SAMOA-Storm-0.0.1-SNAPSHOT.jar.

###S4 mode

If you want to compile SAMOA for S4, you will need to install the S4 dependencies manually as explained in Executing SAMOA with Apache S4.

Once the dependencies if needed are installed, you can simply clone the repository and install SAMOA.

git clone [email protected]:yahoo/samoa.git
cd samoa
mvn -Ps4 package

###Local mode

If you want to test SAMOA in a local environment, simply clone the repository and install SAMOA.

git clone [email protected]:yahoo/samoa.git
cd samoa
mvn package

The deployable jar for SAMOA will be in target/SAMOA-Local-0.0.1-SNAPSHOT.jar.

Slides

SAMOA Slides

G. De Francisci Morales SAMOA: A Platform for Mining Big Data Streams Keynote Talk at RAMSS ’13: 2nd International Workshop on Real-Time Analysis and Mining of Social Streams WWW, Rio De Janeiro, 2013.

License

The use and distribution terms for this software are covered by the Apache License, Version 2.0 (http://www.apache.org/licenses/LICENSE-2.0.html).

Comments

Integrating sentinel into samoa

Hi, I'm Amir Rahnama (@ambodi).

as I discussed this with Gianmarco De Francisci Morales (@gdfm), in this pull request we want to integrate Sentinel (https://github.com/ambodi/sentinel) into SAMOA. Also add more components to it and make it complete.

opened by amir-rahnama 8
Improvements for Samza engine
Configurable HDFS home directory for SAMOA. If it is not set, the default directory is $HOME/.samoa (particulary: /user//.samoa)

Add kryo registration for 2 AMRules-related classes (Perceptron, TargetMean)

Change source delay: from between instances to between batches of instances

Allow Samza EPI without an output stream
opened by caseyvu 3
Automated tests for various algorithms and platforms

This patch adds a module that contains test-related classes. Tests can be defined and configured through templates. We use the test module to implement tests for several classification algorithms for local, threads and storm platforms (for now).

Also includes a fix for handling last event in stream.

Introduces the surefire plugin for tests, so we can get proper test output files and configure test VMs. (we might still have to customize travis for maven's VM parameters)

opened by matthieumorel 2
AMRules

Centralized & DistributedAMRules implementations. HAMR learner includes a combiner processor to combine result streams from model aggregator and default rule learner.

opened by caseyvu 1
API changes
Abstract classes for topology components & their respective implementation in Simple, S4 and Storm Engine

Option for delay between input instances:

changes in EntranceProcessor interface

currently only implemented in PrequentialEvaluation

JUnit tests for simple engine
opened by caseyvu 1
Publish snapshots

Fixes #42. This pull request should automatically publish snapshot artifacts on the Sonatype OSS repository every time a pull-request is merged. I probably need to add some secure environment variable in travis.yml with my Sonatype OSS credentials (so this is just a test). E.g.:

env: global: secure:

opened by gdfm 1
Add custom serialization for samoa events when using S4
pass a custom module defining a custom serializer, that registers samoa event classes and serializers

could be made configurable by passing a configuration file for events/serializers mapping, and registering using reflection
opened by matthieumorel 1

SAMOA (Scalable Advanced Massive Online Analysis) is an open-source platform for mining big data streams.

Related tags

Overview

SAMOA: Scalable Advanced
Massive Online Analysis.

This repository is discontinued. The development of SAMOA has moved over to the Apache Software Foundation. Please subscribe to the dev mailing list to participate in the development, and use Jira to report bugs and propose new features. The new repository is mirrored on GitHub.

Build

Slides

License

Comments

Integrating sentinel into samoa

Improvements for Samza engine

Automated tests for various algorithms and platforms

AMRules

API changes

Publish snapshots

Add custom serialization for samoa events when using S4

Owner

Yahoo Archive

The official home of the Presto distributed SQL query engine for big data

OpenRefine is a free, open source power tool for working with messy data and improving it

A scalable, mature and versatile web crawler based on Apache Storm

A platform for visualization and real-time monitoring of data workflows

Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

Machine Learning Platform and Recommendation Engine built on Kubernetes

Flink CDC Connectors is a set of source connectors for Apache Flink

Netflix's distributed Data Pipeline

Hadoop library for large-scale data processing, now an Apache Incubator project

Program finds average number of words in each comment given a large data set by use of hadoop's map reduce to work in parallel efficiently.

Access paged data as a "stream" with async loading while maintaining order

Hcode Online Judge(HOJ)：An open source online judge system base on SpringBoot, Springcloud Alibaba and Vue.js !

Infinispan is an open source data grid platform and highly scalable NoSQL cloud data store.

Reference implementation for MINAS (MultI-class learNing Algorithm for data Streams), an algorithm to address novelty detection in data streams multi-class problems.

SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).

A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, organization and lifecycle management for both streaming and batch data ecosystems.

Jornada Big Tech: I will have 3 months to study and prepare myself for the Big Tech interviews. Repository containing all my study material.

Reactive Streams Utilities - Future standard utilities library for Reactive Streams.

SAMOA (Scalable Advanced Massive Online Analysis) is an open-source platform for mining big data streams.

Related tags

Overview

SAMOA: Scalable AdvancedMassive Online Analysis.

This repository is discontinued. The development of SAMOA has moved over to the Apache Software Foundation. Please subscribe to the dev mailing list to participate in the development, and use Jira to report bugs and propose new features. The new repository is mirrored on GitHub.

Build

Slides

License

Comments

Integrating sentinel into samoa

Improvements for Samza engine

Automated tests for various algorithms and platforms

AMRules

API changes

Publish snapshots

Add custom serialization for samoa events when using S4

Owner

Yahoo Archive

The official home of the Presto distributed SQL query engine for big data

OpenRefine is a free, open source power tool for working with messy data and improving it

A scalable, mature and versatile web crawler based on Apache Storm

A platform for visualization and real-time monitoring of data workflows

Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

Machine Learning Platform and Recommendation Engine built on Kubernetes

Flink CDC Connectors is a set of source connectors for Apache Flink

Netflix's distributed Data Pipeline

Hadoop library for large-scale data processing, now an Apache Incubator project

Program finds average number of words in each comment given a large data set by use of hadoop's map reduce to work in parallel efficiently.

Access paged data as a "stream" with async loading while maintaining order

Hcode Online Judge(HOJ)：An open source online judge system base on SpringBoot, Springcloud Alibaba and Vue.js !

Infinispan is an open source data grid platform and highly scalable NoSQL cloud data store.

Reference implementation for MINAS (MultI-class learNing Algorithm for data Streams), an algorithm to address novelty detection in data streams multi-class problems.

SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).

A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, organization and lifecycle management for both streaming and batch data ecosystems.

Jornada Big Tech: I will have 3 months to study and prepare myself for the Big Tech interviews. Repository containing all my study material.

Reactive Streams Utilities - Future standard utilities library for Reactive Streams.

SAMOA: Scalable Advanced
Massive Online Analysis.