Distributed and fault-tolerant realtime computation: stream processing, continuous computation, distributed RPC, and more

Nathan Marz

Last update: Dec 26, 2022

Related tags

Big data storm

Overview

IMPORTANT NOTE!!!

Storm has Moved to Apache. The official Storm git repository is now hosted by Apache, and is mirrored on github here:

https://github.com/apache/incubator-storm

Contributing

Source code contributions can be submitted either by sumitting a pull request or by creating an issue in JIRA and attaching patches.

Migrating Git Repos from nathanmarz/storm to apache/incubator-storm

If you have an existing fork/clone of nathanmarz/storm, you can migrate to apache/incubator-storm by doing the following:

Create a new fork of apache/incubator-storm

Point your existing clone to the new fork:

 git remote remove origin
 git remote add origin [email protected]:username/incubator-storm.git

Issue Tracking

The official issue tracker for Storm is Apache JIRA:

https://issues.apache.org/jira/browse/STORM

User Mailing List

Storm users should send messages and subscribe to [email protected].

You can subscribe to this list by sending an email to [email protected]. Likewise, you can cancel a subscription by sending an email to [email protected].

You can view the archives of the mailing list here.

Developer Mailing List

Storm developers should send messages and subscribe to [email protected].

You can subscribe to this list by sending an email to [email protected]. Likewise, you can cancel a subscription by sending an email to [email protected].

You can view the archives of the mailing list here.

Which list should I send/subscribe to?

If you are using a pre-built binary distribution of Storm, then chances are you should send questions, comments, storm-related announcements, etc. to [email protected].

If you are building storm from source, developing new features, or otherwise hacking storm source code, then [email protected] is more appropriate.

What will happen with [email protected]?

All existing messages will remain archived there, and can be accessed/searched here.

New messages sent to [email protected] will either be rejected/bounced or replied to with a message to direct the email to the appropriate Apache-hosted group.

Comments

Ensure we don't overflow the backoff value.

The first attempt to fix this (213102b36f890) did not correctly address the issue. The 32 bit signed integer frequently overflows, resulting in a bad value for Random.nextInt().

See previous pull request @ https://github.com/nathanmarz/storm/pull/713

Here's the exception:

2013-10-30 16:30:04 b.s.m.n.Client [INFO] Reconnect ... [26]
2013-10-30 16:30:04 b.s.m.n.Client [INFO] Reconnect ... [29]
2013-10-30 16:30:04 b.s.m.n.Client [INFO] Reconnect ... [27]
2013-10-30 16:30:04 b.s.m.n.Client [INFO] Reconnect ... [30]
2013-10-30 16:30:04 STDIO [ERROR] Oct 30, 2013 4:30:04 PM org.jboss.netty.channel.DefaultChannelPipeline
WARNING: An exception was thrown by a user handler while handling an exception event ([id: 0x27388d8c] EXCEPTION: java.net.ConnectException: Connection refused: i-0e040e75/10.58.82.253:31002)
java.lang.IllegalArgumentException: n must be positive
    at java.util.Random.nextInt(Random.java:300)
    at backtype.storm.messaging.netty.Client.getSleepTimeMs(Client.java:97)
    at backtype.storm.messaging.netty.Client.reconnect(Client.java:78)
    at backtype.storm.messaging.netty.StormClientHandler.exceptionCaught(StormClientHandler.java:108)
    at org.jboss.netty.handler.codec.frame.FrameDecoder.exceptionCaught(FrameDecoder.java:377)
    at org.jboss.netty.channel.Channels.fireExceptionCaught(Channels.java:525)
    at org.jboss.netty.channel.socket.nio.NioClientBoss.processSelectedKeys(NioClientBoss.java:110)
    at org.jboss.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:79)
    at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312)
    at org.jboss.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:42)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)
2013-10-30 16:30:05 b.s.m.n.Client [INFO] Reconnect ... [29]
2013-10-30 16:30:06 b.s.m.n.Client [INFO] Reconnect ... [28]

opened by brndnmtthws 32

ShellSpout, async ShellBolt, python implementation

The _readTaskIds/_readTuple trampoline/queue stuff in the Python implementation seems funny.. not sure about that — I am not fluent in Python. Still working on a node.js version where the lack of synchronization around task ids after emits is no problem.

I haven't yet tested reliable spouts. Will do that sometime soon.

I made a small change to the protocol, expecting and "end\n" after the pid as well, to be consistent. I could easily add back in the special case.

opened by tomjack 15
Replace 0MQ with Netty

This is required in order to release from Apache. Because 0MQ/JZMQ is LGPL-licensed, it can't be included as a dependency in Apache releases.

This patch simply removes JZMQ/0MQ and replaces it with the netty transport implementation.

We can still make the 0MQ transport available to users who want to use it. But it will have to be a separate project that's not part of Apache. I will follow up with creating a repo that has the 0MQ transport code in isolation.

opened by ptgoetz 13
Apply Thrift SASL client/server framework for authentication/authorization/audit

For issue #397, enclosed please find the basic authentication/authorization/audit framework based on Thrift SASL implementation. Nimbus integration and DRPC integration will be pushed as follow-up pull requests.

Authentication: - Support SASL authentication framework - Implemented SASL mechanisms: Anonymous, Kerberos, Digest - JAAS configuration

Authorization: - Enable plugins for Nimbus authorization against topology submission/killing/activation/deactivation/rebalance. - Plugin will have access to request context such as remote IP address, user name/principal etc.

Audit: - Simple log of operations performed by principals

opened by anfeng 13
Use exec instead of forked child process for JVM

Using os.system() for creating the JVM forks a child of the Python process. This makes it difficult for process managers like daemontools or runit to kill/restart the process at admins' command: the process manager tracks the PID of the Python process instead of that of the JVM; and terminating the Python process unfortunately doesn't kill its JVM child.

Some process managers let you configure the number of times the service forks, or try to detect forks, but it's a bit of a rabbit-hole. Better to make the service not fork in the first place.

For that reason I've replaced the call to os.system() with os.execvp(), which replaces the Python process with the JVM, and thus lets the process manager track the right PID.

os.execvp() requires that you give command-line arguments as an array of strings, rather than one long whitespace-separated string, so I had to change some of the surrounding code accordingly. Note this has the pleasant side-effect that Storm should now also run if you install it in a directory that has spaces in its path name.

opened by ept 12
Plugin mechanism for messaging layer (Issue #372)
This pull request proposes a plugin mechanism for messaging layer.

We are introducing a new Storm config parameter, storm.messaging.transport, to specify a desired plugin. For now, defaults.yaml states that zmq is such a transport.

A messaging plugin needs to implements a simple interface, ITransport, and should have a default constructor and implements, newContext(), to return an IContext. public interface ITransport { public IContext newContext(); }

IContext is defined to have the following methods, where prepare(storm_conf) is invoked immediately after IContext is constructed by Storm core (TransportFactory). public interface IContext { public void prepare(Map storm_conf);

public IConnection bind(String storm_id, int port); public IConnection connect(String storm_id, String host, int port); public void term();

};

IConnection is defined as below, where TaskMessage is a simple Java class contains task ID and message. public interface IConnection {
public TaskMessage recv(); public TaskMessage recv_with_flags(int flags); public void send(int task, byte[] message); public void close(); }

I have modified zmq.clj to implements all these interfaces. local.clj has been revised to implement IContext and IConnection.

worker.clj has been revised slightly to leverage the plugin mechanism. It does not knows ZMQ anymore :-)
opened by anfeng 10
Transition away from Java serialization for storing state on disk or in Zookeeper

Addresses issue #419. Specifically, transitions Supervisor to clojure serialization, StormTopology to thrift, and maybe some other things I'm forgetting.

Also my first go-around with clojure. Any code written in clojure should be considered suspect.

opened by hausdorff 10
Nimbus storage

Nimbus storage abstraction implementation, this is required for Nimbus HA implementation (#360).

I'd run only "lein test" and some handmade sanity tests on my laptop.

opened by Frostman 10
logviewer tries to determine log directory via logback configuration

This introduces a new config variable, logviewer.appender.name, and modifies the logviewer process to locate the root log directory based on the the FileAppender named in the conf. Previously the logviewer assumed that the log directory was {storm.home}/logs.

If the named appender is not found or is not a FileAppender then the logviewer immediately throws a RuntimeException.

This issue is mentioned on the mailing list here: https://groups.google.com/d/msg/storm-user/IKRtIkqQfqc/OUDvyU6mxTMJ

opened by strongh 9
Netty based implementation for storm messaging
This pull request implements Storm messaging layer in Netty.

Client and server could be started in any order

Messages will be delivered in batch if available

Large messages are supported

The basic functionality has been verified via a simple topology test. I have also included some unit test to show its functionality.
opened by anfeng 9
Fix race condition on zeromq sockets between the threads that send tuple...

Hi Nathan, we are facing the same issue as this one: http://groups.google.com/group/storm-user/browse_thread/thread/932da776159fd063 but we do not see java.lang.VerifyError as Dane did.

As we investigate the issue, we find that the cause of the problem is mainly due to the race condition on using and managing socket connections between the sender threads and the refresher thread. If a worker finds that some of its downstream connections have dead, and refreshes the dead connections, the closed socket may contain NULL values and hence the crash happens.

A patch is attached. How do you think about it?

opened by herberteuler 8
Provide Additional String substitutions for *.worker.childopts

Prividing additional substitution strings for the jvm opts provided to workers launched by this supervisor - "%ID%", "%WORKER-ID%", "%STORM-ID%" and "%WORKER-PORT%".

opened by kishorvpatil 0
storm topology online update
This PR is for #540. The core code modification is about 300 lines and many others are generated by thrift.

main points:

interface changes (compatible with old versions) 1.1 zk add :topology-version and :update-duration-sec fields to StormBase:status map 1.2 zk add :topology-version to executor heartbeat 1.3 worker local state add versions dir to storm worker's current running topology-version 1.4 nimbus add updateTopology interface 1.5 add topology-version field to storm.thrift three struct: TopologySummary ExecutorSummary TopologyInfo

update process 2.1 storm client run "storm jar xxxx -c topology.update=true" to invoke topology update process 2.2 storm client upload new jar file to nimbus 2.3 storm client call nimbus updateTopology interface 2.4 nimbus check the new topology and replace stormdist/storm-id dir 2.5 nimbus update StormBase in zk, set :topology-version(for destination version) and :update-duration-sec(for all workers update process duration) fields to StormBase:status map 2.6 supervisors check zk StormBase and do the update work if topology's local version is not the same with zk version 2.6.1 sync-supervisor download the latest code from nimbus to local stormdist/topology-version dir 2.6.2 each supervisor schedule the topology's worker update at a rand(expect-max-update-time) time point 2.6.3 sync-process check local worker version, if it is not the same with sync-supervisor downloaded version and update time point reached, set worker state to a new :update state 2.6.4 sync-process kill workers in :update state as normally 2.6.5 sync-process restart killed worker as normally, expect that read topology and conf from stormdist/topology-version dir 2.6.6 new worker heartbeat to zk with new topology-version, it can be displayed on web ui to check update progress
opened by xiaokang 0
Enable DRPC request to be sent via HTTP and/or Thrift
Currently, DRPC requests could only be sent via Thrift API. We have seen various users asking for HTTP interface.

This pull request enable one to configure drpc to be sent via HTTP and/or thrift via storm.yaml: drpc.port: <THRIFT_PORT> (default: 3772) drpc.http.port: <HTTP_PORT> (default: unavailable)

When drpc server is started, it will look into these configuration parameters to decide whether Thrift port and/or HTTP port should be binded.

DRPC HTTP request will be received via GET via the following URI:

http://<server>:<HTTP_PORT>/drpc/<Func>/<Args>

http://<server>:<HTTP_PORT>/drpc/<Func>/

http://<server>:<HTTP_PORT>/drpc/<Func>
opened by anfeng 1

Owner

Nathan Marz

GitHub http://storm-project.net

Stream summarizer and cardinality estimator.

Description A Java library for summarizing data in streams for which it is infeasible to store all events. More specifically, there are classes for es

2.2k Dec 30, 2022

Access paged data as a "stream" with async loading while maintaining order

DataStream What? DataStream is a simple piece of code to access paged data and interface it as if it's a single "list". It only keeps track of queued

1 Jan 19, 2022

Hadoop library for large-scale data processing, now an Apache Incubator project

Apache DataFu Follow @apachedatafu Apache DataFu is a collection of libraries for working with large-scale data in Hadoop. The project was inspired by

589 Apr 1, 2022

Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

Apache Zeppelin Documentation: User Guide Mailing Lists: User and Dev mailing list Continuous Integration: Contributing: Contribution Guide Issue Trac

5.9k Jan 8, 2023

A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, organization and lifecycle management for both streaming and batch data ecosystems.

Apache Gobblin Apache Gobblin is a highly scalable data management solution for structured and byte-oriented data in heterogeneous data ecosystems. Ca

2.1k Jan 4, 2023

Netflix's distributed Data Pipeline

Suro: Netflix's Data Pipeline Suro is a data pipeline service for collecting, aggregating, and dispatching large volume of application events includin

772 Dec 9, 2022

The official home of the Presto distributed SQL query engine for big data

Presto Presto is a distributed SQL query engine for big data. See the User Manual for deployment instructions and end user documentation. Requirements

14.3k Jan 5, 2023

Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.

Elephant Bird About Elephant Bird is Twitter's open source library of LZO, Thrift, and/or Protocol Buffer-related Hadoop InputFormats, OutputFormats,

1.1k Jan 5, 2023

OpenRefine is a free, open source power tool for working with messy data and improving it

OpenRefine OpenRefine is a Java-based power tool that allows you to load data, understand it, clean it up, reconcile it, and augment it with data comi

9.2k Jan 1, 2023

Machine Learning Platform and Recommendation Engine built on Kubernetes

Update January 2018 Seldon Core open sourced. Seldon Core focuses purely on deploying a wide range of ML models on Kubernetes, allowing complex runtim

1.5k Dec 15, 2022

:elephant: Elasticsearch real-time search and analytics natively integrated with Hadoop

Elasticsearch Hadoop Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Apache Hive, Apache Pig, Apach

1.9k Dec 22, 2022

A platform for visualization and real-time monitoring of data workflows

Status This project is no longer maintained. Ambrose Twitter Ambrose is a platform for visualization and real-time monitoring of MapReduce data workfl

1.2k Dec 31, 2022

Desktop app to browse and administer your MongoDB cluster

UMONGO, the MongoDB GUI UMONGO, the MongoDB GUI About This version of UMongo is provided as free software by EdgyTech LLC. UMongo is open source, and

583 Nov 11, 2022

A scalable, mature and versatile web crawler based on Apache Storm

StormCrawler is an open source collection of resources for building low-latency, scalable web crawlers on Apache Storm. It is provided under Apache Li

776 Jan 2, 2023

Apache Heron (Incubating) is a realtime, distributed, fault-tolerant stream processing engine from Twitter

Heron is a realtime analytics platform developed by Twitter. It has a wide array of architectural improvements over it's predecessor. Heron in Apache

3.6k Dec 28, 2022

Distributed, masterless, high performance, fault tolerant data processing

Onyx What is it? a masterless, cloud scale, fault tolerant, high performance distributed computation system batch and stream hybrid processing model e

2k Dec 30, 2022

A fault tolerant, protocol-agnostic RPC system

Finagle Status This project is used in production at Twitter (and many other organizations), and is being actively developed and maintained. Releases

8.5k Jan 4, 2023

A reactive Java framework for building fault-tolerant distributed systems

Atomix Website | Javadoc | Slack | Google Group A reactive Java framework for building fault-tolerant distributed systems Please see the website for f

2.3k Dec 29, 2022

Stream Processing and Complex Event Processing Engine

Siddhi Core Libraries Siddhi is a cloud native Streaming and Complex Event Processing engine that understands Streaming SQL queries in order to captur

1.4k Jan 6, 2023

Table-Computing (Simplified as TC) is a distributed light weighted, high performance and low latency stream processing and data analysis framework. Milliseconds latency and 10+ times faster than Flink for complicated use cases.

Table-Computing Welcome to the Table-Computing GitHub. Table-Computing (Simplified as TC) is a distributed light weighted, high performance and low la

34 Oct 14, 2022