A platform for visualization and real-time monitoring of data workflows

Twitter

Last update: Dec 31, 2022

Related tags

Big data ambrose

Overview

Status

This project is no longer maintained.

Ambrose

Twitter Ambrose is a platform for visualization and real-time monitoring of MapReduce data workflows. It presents a global view of all the map-reduce jobs derived from your workflow after planning and optimization. As jobs are submitted for execution on your Hadoop cluster, Ambrose updates its visualization to reflect the latest job status.

Ambrose provides the following in a web UI:

A workflow progress bar depicting percent completion of the entire workflow
A table view of all workflow jobs, along with their current state
A graph diagram which depicts job dependencies and metrics
- Visual weighting of jobs based on resource consumption
- Visual weighting of job dependencies based on data volume
Script view with line highlighting (Pig only)

Ambrose is built using the following front-end technologies:

jQuery, UnderscoreJS, RequireJS - Core javascript libraries and JS module definition
D3.js - Diagram generation
Bootstrap - Layout and CSS support

Ambrose is designed to support any workflow runtime. See the following section for supported runtimes.

Follow @Ambrose on Twitter to stay in touch!

Supported runtimes

Examples

Below is a screenshot of the Ambrose workflow UI. The interface presents multiple responsive "views" of a single workflow. Just beneath the toolbar at the top of the window is a workflow progress bar that tracks overall completion of the workflow. Below the progress bar is a graph diagrams which depicts the workflow's jobs and their dependencies. Below the graph diagram is a table of workflow jobs.

All views react to mouseover and click events on a job, regardless of the view on which the event is triggered; Moving your mouse over the first row of the table will highlight that job's table row along with the job's node in the graph diagram. Clicking on a job in any view will select it, updating the highlighting of that job in all views. Clicking again on the same job will deselect it.

Quickstart

To get started with Ambrose, first clone the Ambrose Github repository:

git clone https://github.com/twitter/ambrose.git
cd ambrose

Next, you can try running the Ambrose demo on your local machine. The demo starts a local web server which serves the front-end client resources and sample data. Start the demo with the following command and then browse to http://localhost:8080/workflow.html?localdata=large:

./bin/ambrose-demo

To run Ambrose with a Pig script, you'll need to build the Ambrose Pig distribution:

mvn package

You can then run the following commands to execute script.pig with an embedded web server which hosts the Ambrose web application:

cd pig/target/ambrose-pig-$VERSION-bin/ambrose-pig-$VERSION
AMBROSE_PORT=8080 ./bin/pig-ambrose -f script.pig

Note that the pig-ambrose script calls the pig script present in your local installation of Pig, so make sure $PIG_HOME/bin is in your path. Now, browse to http://localhost:8080/web/workflow.html to see the progress of your script with the Ambrose workflow UI.

Maven repository

Ambrose releases can be found in the Maven Central Repository within package com.twitter.ambrose.

How to contribute

Bug fixes, features, and documentation improvements are welcome! Please fork the project and send us a pull request on Github. You can submit issues on Github as well.

Here are some high-level goals we'd love to see contributions for:

Improve the front-end client
Add other visualization options
Create a new back-end for a different runtime environment
Create a standalone Ambrose server that's not embedded in the workflow client

Versioning

For transparency and insight into our release cycle, releases will be numbered with the follow format:

<major>.<minor>.<patch>

And constructed with the following guidelines:

Breaking backwards compatibility bumps the major
New additions without breaking backwards compatibility bumps the minor
Bug fixes and misc changes bump the patch

For more information on semantic versioning, please visit http://semver.org/.

Authors

License

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0

Comments

Cascading Support (cleaner version)

I have made a pull request in cascading for the changes I made to support Ambrose ... But I get no reply till now https://github.com/cwensel/cascading/pull/26

opened by Ahmed--Mohsen 12
Hive runtime for Twitter Ambrose
Hi,

I'd like add Hive runtime support for this great tool. Some notes about the implementation:

The job progress notification is done by using Hive's (0.11.0+) Hook interfaces

One Hive statement represents one workflow. All statements are executed in a script sequentially, those which result in a MR job are visualized. At the end all workflows can be replayed

Currently Ambrose uses Hadoop 0.22. Note, that Hive is not compatible with this version, therefore I set Hadoop 0.23.x for Ambrose-hive (also tested with Hadoop 1.x.x)

More information can be found in hive/README.md. I hope you'll find this extension useful, any suggestions, remarks..etc are very welcome.

regards, Lorand
opened by lbendig 9
Supporting Cascading Platform

Hello Bill, Andy, Nicolas, Our work focused on extending cascading to allow it to be supported by Ambrose. The members of the team who added this extension are Ahmed Eshra and Ahmed Mohsen. We did this work as part of our project that we did for the class taken at Alexandria University and taught by Iman Elghandour: http://www.alexeng.edu.eg/~ielghand/teaching/cs713/. Hope that our code will be qualified to be added as part of Ambrose:) We are also working on some extensions to the UI with other classmates.

Sorry for the delay, as we worked on some issues in our code .

opened by engeshra 9
Pig script view

Added a draggable, resizable div to display the script for pig. When you mouse over the node or the row in the table, it auto scrolls to the section and highlights the aliases for you. Blue for mapper, yellow for reducer and green for combiner. The div is closable, and you can clear the color with the refresh button.

opened by gzhangT 8
Add hraven integration
Another take on https://github.com/twitter/ambrose/pull/117, with lot of improvements

getClusters in dashboard page based on hRavenClusters.properties

get appId consistent with Hraven (using hraven's apis)

This code still suffers with jackson deserialization problem. That will be solved independently of this PR.
opened by aniket486 6
Hackweek
Most significant changes:

All client javascript code has been refactored / rewritten

All Azkaban graph layout java code has been removed

The existing DAG view has been replaced

I've tested these updates extensively with the demo data, but only with a few actual scripts via pig local mode. However, stability should already be improved as the Azkaban code would through NPEs for some job graphs (those with disconnected components).

Still TBD:

The chord and graph views don't resize with changes to viewport dimensions; The page must be reloaded to properly resize those views.

The view colors and other theme related params should be centralized within core Ambrose object.
opened by sagemintblue 6
Updates 3rdparty lib versions, fixes pig module

I've updated a bunch of 3rdparty lib versions, and attempted to fix the pig module, given pig internal api changes.

@billonahill, @julienledem if you have any feedback on the pig updates, I'd love to hear them.

opened by sagemintblue 5
Adds text to clarify current dashboard filter values

Addresses #82 by adding a line of text between the "Workflows" h2 and the table of workflows itself which conforms to pattern:

Workflows on $cluster with status $status and user $user.

$status and $user default to "any".

There may be more elegant ways to work this into the design, but the dropdown menus within the navbar don't look at all good if I include their current value in the text label and dynamically change it when filter value is updated. The resulting shifting of other navbar elements is not so nice.

opened by sagemintblue 5
Show elapsed time support for Ambrose-Hive
The elapsed time feature didn't work for Hive runtime. AmbroseHiveStatPublisher now instantiates MapReduceJobState with args where the elapsed time is initialized/maintained. Further fixes/improvements:

counterGroupMap was missing from event file

new demo example (has more pretty DAG)

resolve internal join alias name

HiveJobTest: add testing metrics in event json

minor indentation issues
opened by lbendig 4
Refactors client materials moving them from "pig" to "client" module
Also:

Replaces ./bin/ambrose-package with mvn package

Adds Eclipse Jetty to ambrose-pig deps and removes references to Hadoop installation materials from pig-ambrose invocation script

Adds demo pig script and data to amrbose-pig

Updates ambrose-demo script to work with new client web resources location

Updates README.md quick start text
opened by sagemintblue 4
Add shutdown method to StatsWriteService
Currently, a cascading driver program that is integrated with an ambrose listener never actually terminates. The only way to shutdown an HRaveStatsWriteService is via a shutdown hook, eg

Runtime.getRuntime().addShutdownHook(new Thread() { @Override public void run() { shutdown(); } });

This works in pig, which explicitly calls System.exit(run(...)) where run drives the entire pig program. Scalding does not have such an explicit System.exit, so the executor pool hangs forever waiting for instructions.

This PR adds a StatsWriteService#shutdownWriteService method to gracefully shutdown the write service. Add a call to this method in the cascading onComplete method, which is called when all work for the flow is finished.
opened by amatsukawa 3

Owner

Twitter

Twitter 💙 #opensource

GitHub http://twitter.com/ambrose

:elephant: Elasticsearch real-time search and analytics natively integrated with Hadoop

Elasticsearch Hadoop Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Apache Hive, Apache Pig, Apach

1.9k Dec 22, 2022

Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

1.8k Dec 28, 2022

Apache Druid: a high performance real-time analytics database.

12.3k Jan 9, 2023

Real-time Query for Hadoop; mirror of Apache Impala

Welcome to Impala Lightning-fast, distributed SQL queries for petabytes of data stored in Apache Hadoop clusters. Impala is a modern, massively-distri

27 Dec 28, 2022

Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

Apache Zeppelin Documentation: User Guide Mailing Lists: User and Dev mailing list Continuous Integration: Contributing: Contribution Guide Issue Trac

5.9k Jan 8, 2023

SAMOA (Scalable Advanced Massive Online Analysis) is an open-source platform for mining big data streams.

SAMOA: Scalable Advanced Massive Online Analysis. This repository is discontinued. The development of SAMOA has moved over to the Apache Software Foun

424 Dec 28, 2022

Machine Learning Platform and Recommendation Engine built on Kubernetes

Update January 2018 Seldon Core open sourced. Seldon Core focuses purely on deploying a wide range of ML models on Kubernetes, allowing complex runtim

1.5k Dec 15, 2022

OpenRefine is a free, open source power tool for working with messy data and improving it

OpenRefine OpenRefine is a Java-based power tool that allows you to load data, understand it, clean it up, reconcile it, and augment it with data comi

9.2k Jan 1, 2023

Netflix's distributed Data Pipeline

Suro: Netflix's Data Pipeline Suro is a data pipeline service for collecting, aggregating, and dispatching large volume of application events includin

772 Dec 9, 2022

Hadoop library for large-scale data processing, now an Apache Incubator project

Apache DataFu Follow @apachedatafu Apache DataFu is a collection of libraries for working with large-scale data in Hadoop. The project was inspired by

589 Apr 1, 2022

The official home of the Presto distributed SQL query engine for big data

Presto Presto is a distributed SQL query engine for big data. See the User Manual for deployment instructions and end user documentation. Requirements

14.3k Jan 5, 2023

Program finds average number of words in each comment given a large data set by use of hadoop's map reduce to work in parallel efficiently.

Finding average number of words in all the comments in a data set ?? Mapper Function In the mapper function we first tokenize entire data and then fin

3 Aug 23, 2021

Access paged data as a "stream" with async loading while maintaining order

DataStream What? DataStream is a simple piece of code to access paged data and interface it as if it's a single "list". It only keeps track of queued

1 Jan 19, 2022

Distributed and fault-tolerant realtime computation: stream processing, continuous computation, distributed RPC, and more

IMPORTANT NOTE!!! Storm has Moved to Apache. The official Storm git repository is now hosted by Apache, and is mirrored on github here: https://github

8.9k Dec 26, 2022

Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.

Elephant Bird About Elephant Bird is Twitter's open source library of LZO, Thrift, and/or Protocol Buffer-related Hadoop InputFormats, OutputFormats,

1.1k Jan 5, 2023

Stream summarizer and cardinality estimator.

Description A Java library for summarizing data in streams for which it is infeasible to store all events. More specifically, there are classes for es

2.2k Dec 30, 2022

Desktop app to browse and administer your MongoDB cluster

UMONGO, the MongoDB GUI UMONGO, the MongoDB GUI About This version of UMongo is provided as free software by EdgyTech LLC. UMongo is open source, and

583 Nov 11, 2022

A scalable, mature and versatile web crawler based on Apache Storm

StormCrawler is an open source collection of resources for building low-latency, scalable web crawlers on Apache Storm. It is provided under Apache Li

776 Jan 2, 2023

SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).

SeaTunnel SeaTunnel was formerly named Waterdrop , and renamed SeaTunnel since October 12, 2021. SeaTunnel is a very easy-to-use ultra-high-performanc

4.4k Jan 2, 2023

Firehose is an extensible, no-code, and cloud-native service to load real-time streaming data from Kafka to data stores, data lakes, and analytical storage systems.

Firehose - Firehose is an extensible, no-code, and cloud-native service to load real-time streaming data from Kafka to data stores, data lakes, and analytical storage systems.

279 Dec 22, 2022