A platform for visualization and real-time monitoring of data workflows

Related tags

Big data ambrose
Overview

Status

This project is no longer maintained.

Ambrose Build Status

Twitter Ambrose is a platform for visualization and real-time monitoring of MapReduce data workflows. It presents a global view of all the map-reduce jobs derived from your workflow after planning and optimization. As jobs are submitted for execution on your Hadoop cluster, Ambrose updates its visualization to reflect the latest job status.

Ambrose provides the following in a web UI:

  • A workflow progress bar depicting percent completion of the entire workflow
  • A table view of all workflow jobs, along with their current state
  • A graph diagram which depicts job dependencies and metrics
    • Visual weighting of jobs based on resource consumption
    • Visual weighting of job dependencies based on data volume
  • Script view with line highlighting (Pig only)

Ambrose is built using the following front-end technologies:

Ambrose is designed to support any workflow runtime. See the following section for supported runtimes.

Follow @Ambrose on Twitter to stay in touch!

Supported runtimes

Examples

Below is a screenshot of the Ambrose workflow UI. The interface presents multiple responsive "views" of a single workflow. Just beneath the toolbar at the top of the window is a workflow progress bar that tracks overall completion of the workflow. Below the progress bar is a graph diagrams which depicts the workflow's jobs and their dependencies. Below the graph diagram is a table of workflow jobs.

All views react to mouseover and click events on a job, regardless of the view on which the event is triggered; Moving your mouse over the first row of the table will highlight that job's table row along with the job's node in the graph diagram. Clicking on a job in any view will select it, updating the highlighting of that job in all views. Clicking again on the same job will deselect it.

Ambrose workflow screenshot

Quickstart

To get started with Ambrose, first clone the Ambrose Github repository:

git clone https://github.com/twitter/ambrose.git
cd ambrose

Next, you can try running the Ambrose demo on your local machine. The demo starts a local web server which serves the front-end client resources and sample data. Start the demo with the following command and then browse to http://localhost:8080/workflow.html?localdata=large:

./bin/ambrose-demo

To run Ambrose with a Pig script, you'll need to build the Ambrose Pig distribution:

mvn package

You can then run the following commands to execute script.pig with an embedded web server which hosts the Ambrose web application:

cd pig/target/ambrose-pig-$VERSION-bin/ambrose-pig-$VERSION
AMBROSE_PORT=8080 ./bin/pig-ambrose -f script.pig

Note that the pig-ambrose script calls the pig script present in your local installation of Pig, so make sure $PIG_HOME/bin is in your path. Now, browse to http://localhost:8080/web/workflow.html to see the progress of your script with the Ambrose workflow UI.

Maven repository

Ambrose releases can be found in the Maven Central Repository within package com.twitter.ambrose.

How to contribute

Bug fixes, features, and documentation improvements are welcome! Please fork the project and send us a pull request on Github. You can submit issues on Github as well.

Here are some high-level goals we'd love to see contributions for:

  • Improve the front-end client
  • Add other visualization options
  • Create a new back-end for a different runtime environment
  • Create a standalone Ambrose server that's not embedded in the workflow client

Versioning

For transparency and insight into our release cycle, releases will be numbered with the follow format:

<major>.<minor>.<patch>

And constructed with the following guidelines:

  • Breaking backwards compatibility bumps the major
  • New additions without breaking backwards compatibility bumps the minor
  • Bug fixes and misc changes bump the patch

For more information on semantic versioning, please visit http://semver.org/.

Authors

License

Copyright 2013 Twitter, Inc.

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0

Comments
  • Cascading Support (cleaner version)

    Cascading Support (cleaner version)

    I have made a pull request in cascading for the changes I made to support Ambrose ... But I get no reply till now https://github.com/cwensel/cascading/pull/26

    opened by Ahmed--Mohsen 12
  • Hive runtime for Twitter Ambrose

    Hive runtime for Twitter Ambrose

    Hi,

    I'd like add Hive runtime support for this great tool. Some notes about the implementation:

    • The job progress notification is done by using Hive's (0.11.0+) Hook interfaces
    • One Hive statement represents one workflow. All statements are executed in a script sequentially, those which result in a MR job are visualized. At the end all workflows can be replayed
    • Currently Ambrose uses Hadoop 0.22. Note, that Hive is not compatible with this version, therefore I set Hadoop 0.23.x for Ambrose-hive (also tested with Hadoop 1.x.x)

    More information can be found in hive/README.md. I hope you'll find this extension useful, any suggestions, remarks..etc are very welcome.

    regards, Lorand am2

    opened by lbendig 9
  • Supporting Cascading Platform

    Supporting Cascading Platform

    Hello Bill, Andy, Nicolas, Our work focused on extending cascading to allow it to be supported by Ambrose. The members of the team who added this extension are Ahmed Eshra and Ahmed Mohsen. We did this work as part of our project that we did for the class taken at Alexandria University and taught by Iman Elghandour: http://www.alexeng.edu.eg/~ielghand/teaching/cs713/. Hope that our code will be qualified to be added as part of Ambrose:) We are also working on some extensions to the UI with other classmates.

    Sorry for the delay, as we worked on some issues in our code .

    opened by engeshra 9
  • Pig script view

    Pig script view

    Added a draggable, resizable div to display the script for pig. When you mouse over the node or the row in the table, it auto scrolls to the section and highlights the aliases for you. Blue for mapper, yellow for reducer and green for combiner. The div is closable, and you can clear the color with the refresh button.

    opened by gzhangT 8
  • Add hraven integration

    Add hraven integration

    Another take on https://github.com/twitter/ambrose/pull/117, with lot of improvements

    • getClusters in dashboard page based on hRavenClusters.properties
    • get appId consistent with Hraven (using hraven's apis)

    This code still suffers with jackson deserialization problem. That will be solved independently of this PR.

    opened by aniket486 6
  • Hackweek

    Hackweek

    Most significant changes:

    • All client javascript code has been refactored / rewritten
    • All Azkaban graph layout java code has been removed
    • The existing DAG view has been replaced

    I've tested these updates extensively with the demo data, but only with a few actual scripts via pig local mode. However, stability should already be improved as the Azkaban code would through NPEs for some job graphs (those with disconnected components).

    Still TBD:

    • The chord and graph views don't resize with changes to viewport dimensions; The page must be reloaded to properly resize those views.
    • The view colors and other theme related params should be centralized within core Ambrose object.
    opened by sagemintblue 6
  • Updates 3rdparty lib versions, fixes pig module

    Updates 3rdparty lib versions, fixes pig module

    I've updated a bunch of 3rdparty lib versions, and attempted to fix the pig module, given pig internal api changes.

    @billonahill, @julienledem if you have any feedback on the pig updates, I'd love to hear them.

    opened by sagemintblue 5
  • Adds text to clarify current dashboard filter values

    Adds text to clarify current dashboard filter values

    Addresses #82 by adding a line of text between the "Workflows" h2 and the table of workflows itself which conforms to pattern:

    Workflows on $cluster with status $status and user $user.

    $status and $user default to "any".

    There may be more elegant ways to work this into the design, but the dropdown menus within the navbar don't look at all good if I include their current value in the text label and dynamically change it when filter value is updated. The resulting shifting of other navbar elements is not so nice.

    opened by sagemintblue 5
  • Show elapsed time support for Ambrose-Hive

    Show elapsed time support for Ambrose-Hive

    The elapsed time feature didn't work for Hive runtime. AmbroseHiveStatPublisher now instantiates MapReduceJobState with args where the elapsed time is initialized/maintained. Further fixes/improvements:

    • counterGroupMap was missing from event file
    • new demo example (has more pretty DAG)
    • resolve internal join alias name
    • HiveJobTest: add testing metrics in event json
    • minor indentation issues
    opened by lbendig 4
  • Refactors client materials moving them from

    Refactors client materials moving them from "pig" to "client" module

    Also:

    • Replaces ./bin/ambrose-package with mvn package
    • Adds Eclipse Jetty to ambrose-pig deps and removes references to Hadoop installation materials from pig-ambrose invocation script
    • Adds demo pig script and data to amrbose-pig
    • Updates ambrose-demo script to work with new client web resources location
    • Updates README.md quick start text
    opened by sagemintblue 4
  • Add shutdown method to StatsWriteService

    Add shutdown method to StatsWriteService

    Currently, a cascading driver program that is integrated with an ambrose listener never actually terminates. The only way to shutdown an HRaveStatsWriteService is via a shutdown hook, eg

        Runtime.getRuntime().addShutdownHook(new Thread() {
          @Override
          public void run() {
            shutdown();
          }
        });
    

    This works in pig, which explicitly calls System.exit(run(...)) where run drives the entire pig program. Scalding does not have such an explicit System.exit, so the executor pool hangs forever waiting for instructions.

    This PR adds a StatsWriteService#shutdownWriteService method to gracefully shutdown the write service. Add a call to this method in the cascading onComplete method, which is called when all work for the flow is finished.

    opened by amatsukawa 3
Owner
Twitter
Twitter 💙 #opensource
Twitter
:elephant: Elasticsearch real-time search and analytics natively integrated with Hadoop

Elasticsearch Hadoop Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Apache Hive, Apache Pig, Apach

elastic 1.9k Dec 22, 2022
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

Oryx Project 1.8k Dec 28, 2022
Apache Druid: a high performance real-time analytics database.

Website | Documentation | Developer Mailing List | User Mailing List | Slack | Twitter | Download Apache Druid Druid is a high performance real-time a

The Apache Software Foundation 12.3k Jan 9, 2023
Real-time Query for Hadoop; mirror of Apache Impala

Welcome to Impala Lightning-fast, distributed SQL queries for petabytes of data stored in Apache Hadoop clusters. Impala is a modern, massively-distri

Cloudera 27 Dec 28, 2022
Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

Apache Zeppelin Documentation: User Guide Mailing Lists: User and Dev mailing list Continuous Integration: Contributing: Contribution Guide Issue Trac

The Apache Software Foundation 5.9k Jan 8, 2023
SAMOA (Scalable Advanced Massive Online Analysis) is an open-source platform for mining big data streams.

SAMOA: Scalable Advanced Massive Online Analysis. This repository is discontinued. The development of SAMOA has moved over to the Apache Software Foun

Yahoo Archive 424 Dec 28, 2022
Machine Learning Platform and Recommendation Engine built on Kubernetes

Update January 2018 Seldon Core open sourced. Seldon Core focuses purely on deploying a wide range of ML models on Kubernetes, allowing complex runtim

Seldon 1.5k Dec 15, 2022
OpenRefine is a free, open source power tool for working with messy data and improving it

OpenRefine OpenRefine is a Java-based power tool that allows you to load data, understand it, clean it up, reconcile it, and augment it with data comi

OpenRefine 9.2k Jan 1, 2023
Netflix's distributed Data Pipeline

Suro: Netflix's Data Pipeline Suro is a data pipeline service for collecting, aggregating, and dispatching large volume of application events includin

Netflix, Inc. 772 Dec 9, 2022
Hadoop library for large-scale data processing, now an Apache Incubator project

Apache DataFu Follow @apachedatafu Apache DataFu is a collection of libraries for working with large-scale data in Hadoop. The project was inspired by

LinkedIn's Attic 589 Apr 1, 2022
The official home of the Presto distributed SQL query engine for big data

Presto Presto is a distributed SQL query engine for big data. See the User Manual for deployment instructions and end user documentation. Requirements

Presto 14.3k Jan 5, 2023
Program finds average number of words in each comment given a large data set by use of hadoop's map reduce to work in parallel efficiently.

Finding average number of words in all the comments in a data set ?? Mapper Function In the mapper function we first tokenize entire data and then fin

Aleezeh Usman 3 Aug 23, 2021
Access paged data as a "stream" with async loading while maintaining order

DataStream What? DataStream is a simple piece of code to access paged data and interface it as if it's a single "list". It only keeps track of queued

Thomas 1 Jan 19, 2022
Distributed and fault-tolerant realtime computation: stream processing, continuous computation, distributed RPC, and more

IMPORTANT NOTE!!! Storm has Moved to Apache. The official Storm git repository is now hosted by Apache, and is mirrored on github here: https://github

Nathan Marz 8.9k Dec 26, 2022
Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.

Elephant Bird About Elephant Bird is Twitter's open source library of LZO, Thrift, and/or Protocol Buffer-related Hadoop InputFormats, OutputFormats,

Twitter 1.1k Jan 5, 2023
Stream summarizer and cardinality estimator.

Description A Java library for summarizing data in streams for which it is infeasible to store all events. More specifically, there are classes for es

AddThis 2.2k Dec 30, 2022
Desktop app to browse and administer your MongoDB cluster

UMONGO, the MongoDB GUI UMONGO, the MongoDB GUI About This version of UMongo is provided as free software by EdgyTech LLC. UMongo is open source, and

Antoine Girbal 583 Nov 11, 2022
A scalable, mature and versatile web crawler based on Apache Storm

StormCrawler is an open source collection of resources for building low-latency, scalable web crawlers on Apache Storm. It is provided under Apache Li

DigitalPebble Ltd 776 Jan 2, 2023
SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).

SeaTunnel SeaTunnel was formerly named Waterdrop , and renamed SeaTunnel since October 12, 2021. SeaTunnel is a very easy-to-use ultra-high-performanc

The Apache Software Foundation 4.4k Jan 2, 2023
Firehose is an extensible, no-code, and cloud-native service to load real-time streaming data from Kafka to data stores, data lakes, and analytical storage systems.

Firehose - Firehose is an extensible, no-code, and cloud-native service to load real-time streaming data from Kafka to data stores, data lakes, and analytical storage systems.

Open DataOps Foundation 279 Dec 22, 2022