Apache Aurora - A Mesos framework for long-running services, cron jobs, and ad-hoc jobs

Related tags

Database aurora
Overview

Aurora Logo

NOTE: The Apache Aurora project has been moved into the Apache Attic. A fork led by members of the former Project Management Committee (PMC) can be found at https://github.com/aurora-scheduler

Apache Aurora lets you use an Apache Mesos cluster as a private cloud. It supports running long-running services, cron jobs, and ad-hoc jobs. Aurora aims to make it extremely quick and easy to take a built application and run it on machines in a cluster, with an emphasis on reliability. It provides basic operations to manage services running in a cluster, such as rolling upgrades.

To very concisely describe Aurora, it is like a distributed monit or distributed supervisord that you can instruct to do things like run 100 of these, somewhere, forever.

Features

Aurora is built for users and operators.

  • User-facing Features:

  • Under the hood, to help you rest easy:

    • Preemption: important services can 'steal' resources when they need it
    • High-availability: resists machine failures and disk failures
    • Scalable: proven to work in data center-sized clusters, with hundreds of users and thousands of jobs
    • Instrumented: a wealth of information makes it easy to monitor and debug

When and when not to use Aurora

Aurora can take over for most uses of software like monit and chef. Aurora can manage applications, while these tools are still useful to manage Aurora and Mesos themselves.

However, if you have very specific scheduling requirements, or are building a system that looks like a scheduler itself, you may want to explore developing your own framework.

Companies using Aurora

Are you using Aurora too? Let us know, or submit a patch to join the list!

Getting Help

If you have questions that aren't answered in our documentation, you can reach out to one of our mailing lists. We're also often available in Slack: #aurora on mesos.slack.com. Invites to our slack channel may be requested via mesos-slackin.herokuapp.com

You can also file bugs/issues in our Github repo.

License

Except as otherwise noted this software is licensed under the Apache License, Version 2.0

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Comments
  • Upgrade dev enviornment to Mesos 1.6.1

    Upgrade dev enviornment to Mesos 1.6.1

    • Upgraded Mesos dependencies to 1.6.1
    • Uploaded apache-aurora/dev-environment 0.0.17 up to Vagrant Cloud
    • Upgraded Docker2ACI to latest version.
    • Upgraded go to latest version and changed download location to match recent changes by Google.

    Mesos Info:

    • Mesos 1.6 changelog: https://mesos.apache.org/blog/mesos-1-6-1-released/
    • Mesos update instructions: https://mesos.apache.org/documentation/latest/upgrades/#upgrading-from-1-5-x-to-1-6-x

    Testing done:

    • Ran integration tests
    0.22.0 
    opened by ridv 12
  • update thermos_profile cmdline

    update thermos_profile cmdline

    Description:

    The code for creating the .thermos_profile did not work for me. It failed with the error "EOF command not found." After making this change it worked. I also added a second line to .thermos_profile to give an example of how it is done.

    Testing Done:

    Viewed change with GitHub preview. Ran changed code on vagrant VM created in the tutorial.

    0.22.0 
    opened by rcuza 6
  • Staggered (Variable batch) Updates

    Staggered (Variable batch) Updates

    Adding support for variable group sizes when executing an update.

    Design doc for this change is here: https://docs.google.com/document/d/1xGk4ueH8YlmJCk6hQJh85u4to4M1VQD0l630IOchvgY/edit#heading=h.lg3hty82f5cz

    Testing done:

    • Jenkins integration test passed.
    • End to end tests passed.
    • Started an update, stopped the aurora-scheduler, checkout out 0.21.0, vagrant ssh'd, aurorabuild all, monitored job.
    • Started an update and restarted the aurora-scheduler at random intervals between 3 and 13 seconds. Verified group size remained correct.

    Feedback from https://reviews.apache.org/r/66192/ addressed.

    UI Change below: image image image

    Tests added for UI, Thrift change, and Pystachio changes.

    Moving this over to Github since reviewboard decided to act up.

    0.22.0 
    opened by ridv 5
  • Auto Pause for Batch based update strategies

    Auto Pause for Batch based update strategies

    Description:

    Completing the work outlined in the following proposal: https://docs.google.com/document/d/1xGk4ueH8YlmJCk6hQJh85u4to4M1VQD0l630IOchvgY/edit

    Gives Batch and VariableBatch update strategies the ability to automatically pause themselves before starting a new batch.

    Testing Done:

    End to end tests (unrelated end to end test failed during my testing) Integration tests

    To do before merge:

    Add end to end test.

    0.22.0 
    opened by ridv 4
  • Memoize loader.load and loader.load_json

    Memoize loader.load and loader.load_json

    Problem: when reusing the aurora.client.cli.context to process multiple jobkeys from a config_file, the client currently reloads and processes the config file for each jobkey. For large complicated aurora files this can take ~40s to process which makes it expensive to do this every time for each jobkey.

    Solution: memoize the loader.load() on the config file path so that we only load and process each config file once.

    Result: We only need to load and process each config file once. For config files with ~50keys and a 40s load time, this significantly reduces the overall time spent inspecting the jobs.

    0.22.0 
    opened by dknig1b 4
  • Gradle upgrade to 4.10.2

    Gradle upgrade to 4.10.2

    Description:

    • Upgrading to the lastest version of gradle 4.x
    • Updating spotbugs plugin to make it compatible.
    • Changing name of a task for the webui gradle tasks because it collided with a reserved function name.
    • Marking gradle/wrapper/gradle-wrapper.jar as a binary file otherwise the line endings get changed and the JAR file breaks.

    Testing Done:

    • Built project
    opened by ridv 3
  • Add observer flag to disable resource metric collection

    Add observer flag to disable resource metric collection

    Description:

    Add observer command line option --disable_task_resource_collection to disable the collection of CPU, memory, and disk metrics for observed tasks. This is useful in setups where metrics cannot be gathered reliable (e.g. when using PID namespaces) or when it is expensive due to hundreds of active tasks per host.

    Sometimes the hosts are also tightly packed with many small tasks (e.g. ~130 active tasks and ~1000 finished tasks). Even with very relaxed scrape settings of --task_process_collection_interval_secs=3000 and --task_disk_collection_interval_secs=3000 it can take between 150ms-2500ms to render the observer landing page /main. This patch reduces this to about 100ms-150ms.

    There is no immediate downside as metrics reporting is broken anyway due to the PID namespacing: We are running our Mesos agents with enabled PID namespaces (i.e. --isolation='namespaces/ipc,namespaces/pid,...'). In that mode, the PID of the same process is different within the container and outside of it. This breaks the assumption of Thermos that the executor can checkpoint a PID to disk that then can be used by the Observer to show live resource statistics for that PID.

    Testing Done:

    Running in production for over a year now.

    0.22.0 
    opened by StephanErb 3
  • Make schema_helpers.py python3 compatible

    Make schema_helpers.py python3 compatible

    Description:

    This fixes 2 errors in the schema_helpers.py file to make the code Python 3 compatible.

    0.22.0 
    opened by philipp-sontag-by 3
  •  Fix sandbox permission errors with Mesos 1.6.0

    Fix sandbox permission errors with Mesos 1.6.0

    This change makes Aurora compatible with Mesos 1.6.0 which has changed the permission of sandbox folders (see https://issues.apache.org/jira/browse/MESOS-8332).

    Please have a look at the individual commits for further details and descriptions.

    0.22.0 
    opened by StephanErb 3
  • add mesos role feature

    add mesos role feature

    Problems

    We are from eBay platform team. Previously, we used marathon to generate Jenkins master instance in dedicated vms and recieve resource offer from same dedicated vms. For the details, please refer to http://www.ebaytechblog.com/2014/04/04/delivering-ebays-ci-solution-with-apache-mesos-part-i/#.VNQUuC6_SPU

    Now, we found Aurora is more stable and powerful. We are moving from Marathon to Aurora. During the move, we found there is no mesos role in Aurora now. But we need use mesos role way to solve the problem in section "Frameworks stopped receiving offers after a while" of the given url.

    Here is a snippet of the problem description:

    We noticed occurred after we used Marathon to create the initial set of CI masters. As those CI masters started registering themselves as frameworks, Marathon stopped receiving any offers from Mesos; essentially, no new CI masters could be launched. Let’s start with Marathon. In the DRF model, it was unfair to treat Marathon in the same bucket/role alongside hundreds of connected Jenkins frameworks. After launching all these Jenkins frameworks, Marathon had a large resource share and Mesos would aggressively offer resources to frameworks that were using little or no resources. Marathon was placed last in priority and got starved out.

    We decided to define a dedicated Mesos role for Marathon and to have all of the Mesos slaves that were reserved for Jenkins master instances support that Mesos role. Jenkins frameworks were left with the default role “”.* This solved the problem – Mesos offered resources per role and hence Marathon never got starved out. A framework with a special role will get resource offers from both slaves supporting that special role and also from the default role “”.* However, since we were using placement constraints, Marathon accepted resource offers only from slaves that supported both the role and the placement constraints.

    Solution

    So we add role feature is the source code to solve the problem in same way: When accept a resource offer, Aurora will send back the needed resources to Mesos with the mesos role in resource offer.

    How to configure the Mesos role: 1.Add cmd option --mesos_role=${Mesos role name} when start Aurora scheduler.

    We change the test cases according code change. Each changed test case is green.

    opened by zhanglong2015 3
  • Aurora 475 Remove Copyright Apache Software Foundation from the source files license header

    Aurora 475 Remove Copyright Apache Software Foundation from the source files license header

    https://reviews.apache.org/r/21882/

    Some of the files have "Copyright 2013 Apache Software Foundation" in the license header comment. This is not the right license header per ASF rule [1] I believe the year is needed for the NOTICE file. [1] http://www.apache.org/legal/src-headers.html

    opened by hsaputra 3
  • Upgrading Mesos dependency to 1.7.2

    Upgrading Mesos dependency to 1.7.2

    Description:

    Upgrading Aurora's dependencies on Mesos to 1.7.2

    Addresses #104

    Testing Done:

    Unit tests End to end tests fail when creating unified containers. Specifically, this same issue shows up: https://issues.apache.org/jira/browse/AURORA-1781

    Initial investigation showed that this is because the /etc/groups file doesn't exist in the taskf sdirectory of the sandbox which causes the CHROOTed groupadd to fail. May need some help investigating this a bit deeper.

    Creating this as a draft while work is done to fix the end to end tests.

    opened by ridv 0
  • [Discussion] Port End to End tests away from bash

    [Discussion] Port End to End tests away from bash

    Working on trying to fix the end to end test has reminded me how painful the process has become for debugging end to end tests when something goes wrong.

    The current end to end test stand at 1000+ lines of bash. Running individual tests involves modifying the script and adding new tests is a very involved process.

    Ideally we should rewrite these tests in a language that's more apt for the job and easier to contribute to.

    I'll throw out some languages to get the discussion started in order of preference:

    • Python
    • Golang
    • Node
    • Ruby

    Theoretically, we could use thrift bindings directly instead of the aurora client, which would decrease coverage for the client itself, but the client already has its own suite of tests.

    I realize porting this code might be a big undertaking but it needs to be done sooner than later in my opinion.

    Looking forward to hearing everyone's opinions.

    opened by ridv 5
  • Aurora pause abnormally in variable batch update strategy

    Aurora pause abnormally in variable batch update strategy

    When the variable batch size set to [5,5], with autopause to true, it should only pause twice, each happens after one batch completes. But in actual, aurora pause when batch updating is still in progress, and it pause 4 times in total.

    How to reproduce? variable batch size [5,5], with auto pause to true, SLA sets to 70% percent.

    opened by chungjin 2
  • Upgrading dependencies to mitigate vulnerabilities

    Upgrading dependencies to mitigate vulnerabilities

    Description:

    A bot recently reported a large number of vulnerabilities that we inherited from our dependencies.

    Creating a draft PR while I verify that these dependency upgrades do not have a negative impact.

    Components upgraded:

    • Curator
    • Zookeeper
    • Shiro
    • Netty
    • Asynchttpclient
    • Quartz
    • Gradle
    • Gradle plugins
    • Jackson
    • Guice
    • Guava
    • Multiple react components.

    Testing Done:

    TODO

    We should run a few end to end test runs to confirm everything is good.

    After we merge this PR we need to create a PR for packaging which upgrades the gradle version there.

    opened by ridv 0
Owner
The Apache Software Foundation
The Apache Software Foundation
Apache Calcite

Apache Calcite Apache Calcite is a dynamic data management framework. It contains many of the pieces that comprise a typical database management syste

The Apache Software Foundation 3.6k Dec 31, 2022
Apache Druid: a high performance real-time analytics database.

Website | Documentation | Developer Mailing List | User Mailing List | Slack | Twitter | Download Apache Druid Druid is a high performance real-time a

The Apache Software Foundation 12.3k Jan 1, 2023
Apache Hive

Apache Hive (TM) The Apache Hive (TM) data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storag

The Apache Software Foundation 4.6k Dec 28, 2022
The Chronix Server implementation that is based on Apache Solr.

Chronix Server The Chronix Server is an implementation of the Chronix API that stores time series in Apache Solr. Chronix uses several techniques to o

Chronix 262 Jul 3, 2022
Apache Pinot - A realtime distributed OLAP datastore

What is Apache Pinot? Features When should I use Pinot? Building Pinot Deploying Pinot to Kubernetes Join the Community Documentation License What is

The Apache Software Foundation 4.4k Dec 30, 2022
Apache Ant is a Java-based build tool.

Apache Ant What is it? ----------- Ant is a Java based build tool. In theory it is kind of like "make" without makes wrinkles and with

The Apache Software Foundation 355 Dec 22, 2022
Apache Drill is a distributed MPP query layer for self describing data

Apache Drill Apache Drill is a distributed MPP query layer that supports SQL and alternative query languages against NoSQL and Hadoop data storage sys

The Apache Software Foundation 1.8k Jan 7, 2023
Flink Connector for Apache Doris(incubating)

Flink Connector for Apache Doris (incubating) Flink Doris Connector More information about compilation and usage, please visit Flink Doris Connector L

The Apache Software Foundation 115 Dec 20, 2022
HurricaneDB a real-time distributed OLAP engine, powered by Apache Pinot

HurricaneDB is a real-time distributed OLAP datastore, built to deliver scalable real-time analytics with low latency. It can ingest from batch data sources (such as Hadoop HDFS, Amazon S3, Azure ADLS, Google Cloud Storage) as well as stream data sources (such as Apache Kafka).

GuinsooLab 4 Dec 28, 2022
Event capture and querying framework for Java

Eventsourcing for Java Enabling plurality and evolution of domain models Instead of mutating data in a database, Eventsourcing stores all changes (eve

Eventsourcing, Inc. 408 Nov 5, 2022
Multi-DBMS SQL Benchmarking Framework via JDBC

BenchBase BenchBase (formerly OLTPBench) is a Multi-DBMS SQL Benchmarking Framework via JDBC. Table of Contents Quickstart Description Usage Guide Con

CMU Database Group 213 Dec 29, 2022
LSPatch: A non-root Xposed framework fork from Xpatch

Introduction LSPatch fork from Xpatch. LSPatch provides a way to insert dex and so into the target APK by repackaging. The following changes have been

LSPosed 1.9k Jan 2, 2023
Ja-netfilter - A javaagent framework

ja-netfilter v2.0.1 A javaagent framework Usage download from the releases page add -javaagent:/absolute/path/to/ja-netfilter.jar argument (Change to

null 7.3k May 26, 2022
Reladomo is an enterprise grade object-relational mapping framework for Java.

Reladomo What is it? Reladomo is an object-relational mapping (ORM) framework for Java with the following enterprise features: Strongly typed compile-

Goldman Sachs 360 Nov 2, 2022
A javaagent framework

ja-netfilter 2022.2.0 A javaagent framework Usage download from the releases page add -javaagent:/absolute/path/to/ja-netfilter.jar argument (Change t

null 35 Jan 2, 2023