SAMOA (Scalable Advanced Massive Online Analysis) is an open-source platform for mining big data streams.

Related tags

Big data samoa
Overview

Build Status

SAMOA: Scalable Advanced
Massive Online Analysis.

This repository is discontinued. The development of SAMOA has moved over to the Apache Software Foundation. Please subscribe to the dev mailing list to participate in the development, and use Jira to report bugs and propose new features. The new repository is mirrored on GitHub.

SAMOA is a platform for mining on big data streams. It is a distributed streaming machine learning (ML) framework that contains a programing abstraction for distributed streaming ML algorithms.

SAMOA enables development of new ML algorithms without dealing with the complexity of underlying streaming processing engines (SPE, such as Apache Storm and Apache S4). SAMOA also provides extensibility in integrating new SPEs into the framework. These features allow SAMOA users to develop distributed streaming ML algorithms once and to execute the algorithms in multiple SPEs, i.e., code the algorithms once and execute them in multiple SPEs.

Build

###Storm mode

Simply clone the repository and install SAMOA.

git clone [email protected]:yahoo/samoa.git
cd samoa
mvn -Pstorm package

The deployable jar for SAMOA will be in target/SAMOA-Storm-0.0.1-SNAPSHOT.jar.

###S4 mode

If you want to compile SAMOA for S4, you will need to install the S4 dependencies manually as explained in Executing SAMOA with Apache S4.

Once the dependencies if needed are installed, you can simply clone the repository and install SAMOA.

git clone [email protected]:yahoo/samoa.git
cd samoa
mvn -Ps4 package

###Local mode

If you want to test SAMOA in a local environment, simply clone the repository and install SAMOA.

git clone [email protected]:yahoo/samoa.git
cd samoa
mvn package

The deployable jar for SAMOA will be in target/SAMOA-Local-0.0.1-SNAPSHOT.jar.

Slides

SAMOA Slides

G. De Francisci Morales SAMOA: A Platform for Mining Big Data Streams Keynote Talk at RAMSS ’13: 2nd International Workshop on Real-Time Analysis and Mining of Social Streams WWW, Rio De Janeiro, 2013.

License

The use and distribution terms for this software are covered by the Apache License, Version 2.0 (http://www.apache.org/licenses/LICENSE-2.0.html).

Comments
  • Integrating sentinel into samoa

    Integrating sentinel into samoa

    Hi, I'm Amir Rahnama (@ambodi).

    as I discussed this with Gianmarco De Francisci Morales (@gdfm), in this pull request we want to integrate Sentinel (https://github.com/ambodi/sentinel) into SAMOA. Also add more components to it and make it complete.

    opened by amir-rahnama 8
  • Improvements for Samza engine

    Improvements for Samza engine

    1. Configurable HDFS home directory for SAMOA. If it is not set, the default directory is $HOME/.samoa (particulary: /user//.samoa)
    2. Add kryo registration for 2 AMRules-related classes (Perceptron, TargetMean)
    3. Change source delay: from between instances to between batches of instances
    4. Allow Samza EPI without an output stream
    opened by caseyvu 3
  • Automated tests for various algorithms and platforms

    Automated tests for various algorithms and platforms

    This patch adds a module that contains test-related classes. Tests can be defined and configured through templates. We use the test module to implement tests for several classification algorithms for local, threads and storm platforms (for now).

    Also includes a fix for handling last event in stream.

    Introduces the surefire plugin for tests, so we can get proper test output files and configure test VMs. (we might still have to customize travis for maven's VM parameters)

    opened by matthieumorel 2
  • AMRules

    AMRules

    Centralized & DistributedAMRules implementations. HAMR learner includes a combiner processor to combine result streams from model aggregator and default rule learner.

    opened by caseyvu 1
  • API changes

    API changes

    1. Abstract classes for topology components & their respective implementation in Simple, S4 and Storm Engine

    2. Option for delay between input instances:

    • changes in EntranceProcessor interface
    • currently only implemented in PrequentialEvaluation
    1. JUnit tests for simple engine
    opened by caseyvu 1
  • Publish snapshots

    Publish snapshots

    Fixes #42. This pull request should automatically publish snapshot artifacts on the Sonatype OSS repository every time a pull-request is merged. I probably need to add some secure environment variable in travis.yml with my Sonatype OSS credentials (so this is just a test). E.g.:

    env: global: secure:

    opened by gdfm 1
  • Add custom serialization for samoa events when using S4

    Add custom serialization for samoa events when using S4

    • pass a custom module defining a custom serializer, that registers samoa event classes and serializers
    • could be made configurable by passing a configuration file for events/serializers mapping, and registering using reflection
    opened by matthieumorel 1
Owner
Yahoo Archive
A resting place for retired projects for you to visit. No flowers needed.
Yahoo Archive
The official home of the Presto distributed SQL query engine for big data

Presto Presto is a distributed SQL query engine for big data. See the User Manual for deployment instructions and end user documentation. Requirements

Presto 14.3k Jan 5, 2023
OpenRefine is a free, open source power tool for working with messy data and improving it

OpenRefine OpenRefine is a Java-based power tool that allows you to load data, understand it, clean it up, reconcile it, and augment it with data comi

OpenRefine 9.2k Jan 1, 2023
A scalable, mature and versatile web crawler based on Apache Storm

StormCrawler is an open source collection of resources for building low-latency, scalable web crawlers on Apache Storm. It is provided under Apache Li

DigitalPebble Ltd 776 Jan 2, 2023
A platform for visualization and real-time monitoring of data workflows

Status This project is no longer maintained. Ambrose Twitter Ambrose is a platform for visualization and real-time monitoring of MapReduce data workfl

Twitter 1.2k Dec 31, 2022
Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

Apache Zeppelin Documentation: User Guide Mailing Lists: User and Dev mailing list Continuous Integration: Contributing: Contribution Guide Issue Trac

The Apache Software Foundation 5.9k Jan 8, 2023
Machine Learning Platform and Recommendation Engine built on Kubernetes

Update January 2018 Seldon Core open sourced. Seldon Core focuses purely on deploying a wide range of ML models on Kubernetes, allowing complex runtim

Seldon 1.5k Dec 15, 2022
Flink CDC Connectors is a set of source connectors for Apache Flink

Flink CDC Connectors is a set of source connectors for Apache Flink, ingesting changes from different databases using change data capture (CDC). The Flink CDC Connectors integrates Debezium as the engine to capture data changes.

null 6 Mar 23, 2022
Netflix's distributed Data Pipeline

Suro: Netflix's Data Pipeline Suro is a data pipeline service for collecting, aggregating, and dispatching large volume of application events includin

Netflix, Inc. 772 Dec 9, 2022
Hadoop library for large-scale data processing, now an Apache Incubator project

Apache DataFu Follow @apachedatafu Apache DataFu is a collection of libraries for working with large-scale data in Hadoop. The project was inspired by

LinkedIn's Attic 589 Apr 1, 2022
Program finds average number of words in each comment given a large data set by use of hadoop's map reduce to work in parallel efficiently.

Finding average number of words in all the comments in a data set ?? Mapper Function In the mapper function we first tokenize entire data and then fin

Aleezeh Usman 3 Aug 23, 2021
Access paged data as a "stream" with async loading while maintaining order

DataStream What? DataStream is a simple piece of code to access paged data and interface it as if it's a single "list". It only keeps track of queued

Thomas 1 Jan 19, 2022
Hcode Online Judge(HOJ):An open source online judge system base on SpringBoot, Springcloud Alibaba and Vue.js !

Hcode Online Judge(HOJ) 前言 基于前后端分离,分布式架构的在线测评平台(hoj),前端使用vue,后端主要使用springboot,redis,mysql,nacos等技术。 在线Demo:https://hdoi.cn 在线文档:https://www.hcode.top/

Himit_ZH 179 Dec 31, 2022
Infinispan is an open source data grid platform and highly scalable NoSQL cloud data store.

The Infinispan project Infinispan is an open source (under the Apache License, v2.0) data grid platform. For more information on Infinispan, including

Infinispan 1k Dec 31, 2022
Reference implementation for MINAS (MultI-class learNing Algorithm for data Streams), an algorithm to address novelty detection in data streams multi-class problems.

Reference implementation for MINAS (MultI-class learNing Algorithm for data Streams), an algorithm to address novelty detection in data streams multi-class problems.

Douglas M. Cavalcanti 4 Sep 7, 2022
SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).

SeaTunnel SeaTunnel was formerly named Waterdrop , and renamed SeaTunnel since October 12, 2021. SeaTunnel is a very easy-to-use ultra-high-performanc

The Apache Software Foundation 4.4k Jan 2, 2023
A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, organization and lifecycle management for both streaming and batch data ecosystems.

Apache Gobblin Apache Gobblin is a highly scalable data management solution for structured and byte-oriented data in heterogeneous data ecosystems. Ca

The Apache Software Foundation 2.1k Jan 4, 2023
Jornada Big Tech: I will have 3 months to study and prepare myself for the Big Tech interviews. Repository containing all my study material.

Jornada Big Tech (Big Tech Journey) Jornada Big Tech: I will have 3 months to study and prepare myself for the Big Tech interviews. Repository contain

Camila Maia 87 Dec 8, 2022
Reactive Streams Utilities - Future standard utilities library for Reactive Streams.

Reactive Streams Utilities This is an exploration of what a utilities library for Reactive Streams in the JDK might look like. Glossary: A short gloss

Lightbend 61 May 27, 2021