Apache Aurora - A Mesos framework for long-running services, cron jobs, and ad-hoc jobs

Last update: Nov 28, 2022

Related tags

Overview

NOTE: The Apache Aurora project has been moved into the Apache Attic. A fork led by members of the former Project Management Committee (PMC) can be found at https://github.com/aurora-scheduler

Apache Aurora lets you use an Apache Mesos cluster as a private cloud. It supports running long-running services, cron jobs, and ad-hoc jobs. Aurora aims to make it extremely quick and easy to take a built application and run it on machines in a cluster, with an emphasis on reliability. It provides basic operations to manage services running in a cluster, such as rolling upgrades.

To very concisely describe Aurora, it is like a distributed monit or distributed supervisord that you can instruct to do things like run 100 of these, somewhere, forever.

Features

Aurora is built for users and operators.

User-facing Features:
- Management of long-running services
- Cron jobs
- Resource quotas: provide guaranteed resources for specific applications
- Rolling job updates, with automatic rollback
- Multi-user support
- Sophisticated DSL: supports templating, allowing you to establish common patterns and avoid redundant configurations
- Dedicated machines: for things like stateful services that must always run on the same machines
- Service registration: announce services in ZooKeeper for discovery by various clients
- Scheduling constraints to run on specific machines, or to mitigate impact of issues like machine and rack failure
Under the hood, to help you rest easy:
- Preemption: important services can 'steal' resources when they need it
- High-availability: resists machine failures and disk failures
- Scalable: proven to work in data center-sized clusters, with hundreds of users and thousands of jobs
- Instrumented: a wealth of information makes it easy to monitor and debug

When and when not to use Aurora

Aurora can take over for most uses of software like monit and chef. Aurora can manage applications, while these tools are still useful to manage Aurora and Mesos themselves.

However, if you have very specific scheduling requirements, or are building a system that looks like a scheduler itself, you may want to explore developing your own framework.

Companies using Aurora

Are you using Aurora too? Let us know, or submit a patch to join the list!

Getting Help

If you have questions that aren't answered in our documentation, you can reach out to one of our mailing lists. We're also often available in Slack: #aurora on mesos.slack.com. Invites to our slack channel may be requested via mesos-slackin.herokuapp.com

You can also file bugs/issues in our Github repo.

License

Except as otherwise noted this software is licensed under the Apache License, Version 2.0

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Comments

Upgrade dev enviornment to Mesos 1.6.1
Upgraded Mesos dependencies to 1.6.1

Uploaded apache-aurora/dev-environment 0.0.17 up to Vagrant Cloud

Upgraded Docker2ACI to latest version.

Upgraded go to latest version and changed download location to match recent changes by Google.

Mesos Info:

Mesos 1.6 changelog: https://mesos.apache.org/blog/mesos-1-6-1-released/

Mesos update instructions: https://mesos.apache.org/documentation/latest/upgrades/#upgrading-from-1-5-x-to-1-6-x

Testing done:

Ran integration tests

0.22.0
opened by ridv 12
update thermos_profile cmdline

Description:

The code for creating the .thermos_profile did not work for me. It failed with the error "EOF command not found." After making this change it worked. I also added a second line to .thermos_profile to give an example of how it is done.

Testing Done:

Viewed change with GitHub preview. Ran changed code on vagrant VM created in the tutorial.
0.22.0

opened by rcuza 6
Staggered (Variable batch) Updates
Adding support for variable group sizes when executing an update.

Design doc for this change is here: https://docs.google.com/document/d/1xGk4ueH8YlmJCk6hQJh85u4to4M1VQD0l630IOchvgY/edit#heading=h.lg3hty82f5cz

Testing done:

Jenkins integration test passed.

End to end tests passed.

Started an update, stopped the aurora-scheduler, checkout out 0.21.0, vagrant ssh'd, aurorabuild all, monitored job.

Started an update and restarted the aurora-scheduler at random intervals between 3 and 13 seconds. Verified group size remained correct.

Feedback from https://reviews.apache.org/r/66192/ addressed.

UI Change below:

Tests added for UI, Thrift change, and Pystachio changes.

Moving this over to Github since reviewboard decided to act up.
0.22.0
opened by ridv 5
Auto Pause for Batch based update strategies

Description:

Completing the work outlined in the following proposal: https://docs.google.com/document/d/1xGk4ueH8YlmJCk6hQJh85u4to4M1VQD0l630IOchvgY/edit

Gives Batch and VariableBatch update strategies the ability to automatically pause themselves before starting a new batch.

Testing Done:

End to end tests (unrelated end to end test failed during my testing) Integration tests

To do before merge:

Add end to end test.
0.22.0

opened by ridv 4
Memoize loader.load and loader.load_json

Problem: when reusing the aurora.client.cli.context to process multiple jobkeys from a config_file, the client currently reloads and processes the config file for each jobkey. For large complicated aurora files this can take ~40s to process which makes it expensive to do this every time for each jobkey.

Solution: memoize the loader.load() on the config file path so that we only load and process each config file once.

Result: We only need to load and process each config file once. For config files with ~50keys and a 40s load time, this significantly reduces the overall time spent inspecting the jobs.
0.22.0

opened by dknig1b 4
Gradle upgrade to 4.10.2
Description:

Upgrading to the lastest version of gradle 4.x

Updating spotbugs plugin to make it compatible.

Changing name of a task for the webui gradle tasks because it collided with a reserved function name.

Marking gradle/wrapper/gradle-wrapper.jar as a binary file otherwise the line endings get changed and the JAR file breaks.

Testing Done:

Built project
opened by ridv 3
Add observer flag to disable resource metric collection

Description:

Add observer command line option --disable_task_resource_collection to disable the collection of CPU, memory, and disk metrics for observed tasks. This is useful in setups where metrics cannot be gathered reliable (e.g. when using PID namespaces) or when it is expensive due to hundreds of active tasks per host.

Sometimes the hosts are also tightly packed with many small tasks (e.g. ~130 active tasks and ~1000 finished tasks). Even with very relaxed scrape settings of --task_process_collection_interval_secs=3000 and --task_disk_collection_interval_secs=3000 it can take between 150ms-2500ms to render the observer landing page /main. This patch reduces this to about 100ms-150ms.

There is no immediate downside as metrics reporting is broken anyway due to the PID namespacing: We are running our Mesos agents with enabled PID namespaces (i.e. --isolation='namespaces/ipc,namespaces/pid,...'). In that mode, the PID of the same process is different within the container and outside of it. This breaks the assumption of Thermos that the executor can checkpoint a PID to disk that then can be used by the Observer to show live resource statistics for that PID.

Testing Done:

Running in production for over a year now.
0.22.0

opened by StephanErb 3
Make schema_helpers.py python3 compatible
Description:

This fixes 2 errors in the schema_helpers.py file to make the code Python 3 compatible.

reduce was moved to functools and this was backported with Python 2.6.

Replace map with a list comprehension, as it now returns a generator.

0.22.0
opened by philipp-sontag-by 3
Fix sandbox permission errors with Mesos 1.6.0

This change makes Aurora compatible with Mesos 1.6.0 which has changed the permission of sandbox folders (see https://issues.apache.org/jira/browse/MESOS-8332).

Please have a look at the individual commits for further details and descriptions.
0.22.0

opened by StephanErb 3
add mesos role feature

Problems

We are from eBay platform team. Previously, we used marathon to generate Jenkins master instance in dedicated vms and recieve resource offer from same dedicated vms. For the details, please refer to http://www.ebaytechblog.com/2014/04/04/delivering-ebays-ci-solution-with-apache-mesos-part-i/#.VNQUuC6_SPU

Now, we found Aurora is more stable and powerful. We are moving from Marathon to Aurora. During the move, we found there is no mesos role in Aurora now. But we need use mesos role way to solve the problem in section "Frameworks stopped receiving offers after a while" of the given url.

Here is a snippet of the problem description:

We noticed occurred after we used Marathon to create the initial set of CI masters. As those CI masters started registering themselves as frameworks, Marathon stopped receiving any offers from Mesos; essentially, no new CI masters could be launched. Let’s start with Marathon. In the DRF model, it was unfair to treat Marathon in the same bucket/role alongside hundreds of connected Jenkins frameworks. After launching all these Jenkins frameworks, Marathon had a large resource share and Mesos would aggressively offer resources to frameworks that were using little or no resources. Marathon was placed last in priority and got starved out.

We decided to define a dedicated Mesos role for Marathon and to have all of the Mesos slaves that were reserved for Jenkins master instances support that Mesos role. Jenkins frameworks were left with the default role “”.* This solved the problem – Mesos offered resources per role and hence Marathon never got starved out. A framework with a special role will get resource offers from both slaves supporting that special role and also from the default role “”.* However, since we were using placement constraints, Marathon accepted resource offers only from slaves that supported both the role and the placement constraints.

Solution

So we add role feature is the source code to solve the problem in same way: When accept a resource offer, Aurora will send back the needed resources to Mesos with the mesos role in resource offer.

How to configure the Mesos role: 1.Add cmd option --mesos_role=${Mesos role name} when start Aurora scheduler.

We change the test cases according code change. Each changed test case is green.

opened by zhanglong2015 3
Aurora 475 Remove Copyright Apache Software Foundation from the source files license header

https://reviews.apache.org/r/21882/

Some of the files have "Copyright 2013 Apache Software Foundation" in the license header comment. This is not the right license header per ASF rule [1] I believe the year is needed for the NOTICE file. [1] http://www.apache.org/legal/src-headers.html

opened by hsaputra 3
Upgrading Mesos dependency to 1.7.2

Description:

Upgrading Aurora's dependencies on Mesos to 1.7.2

Addresses #104

Testing Done:

Unit tests End to end tests fail when creating unified containers. Specifically, this same issue shows up: https://issues.apache.org/jira/browse/AURORA-1781

Initial investigation showed that this is because the /etc/groups file doesn't exist in the taskf sdirectory of the sandbox which causes the CHROOTed groupadd to fail. May need some help investigating this a bit deeper.

Creating this as a draft while work is done to fix the end to end tests.

opened by ridv 0
[Discussion] Port End to End tests away from bash
Working on trying to fix the end to end test has reminded me how painful the process has become for debugging end to end tests when something goes wrong.

The current end to end test stand at 1000+ lines of bash. Running individual tests involves modifying the script and adding new tests is a very involved process.

Ideally we should rewrite these tests in a language that's more apt for the job and easier to contribute to.

I'll throw out some languages to get the discussion started in order of preference:

Python

Golang

Node

Ruby

Theoretically, we could use thrift bindings directly instead of the aurora client, which would decrease coverage for the client itself, but the client already has its own suite of tests.

I realize porting this code might be a big undertaking but it needs to be done sooner than later in my opinion.

Looking forward to hearing everyone's opinions.
opened by ridv 5
Aurora pause abnormally in variable batch update strategy

When the variable batch size set to [5,5], with autopause to true, it should only pause twice, each happens after one batch completes. But in actual, aurora pause when batch updating is still in progress, and it pause 4 times in total.

How to reproduce? variable batch size [5,5], with auto pause to true, SLA sets to 70% percent.

opened by chungjin 2
Upgrading dependencies to mitigate vulnerabilities
Description:

A bot recently reported a large number of vulnerabilities that we inherited from our dependencies.

Creating a draft PR while I verify that these dependency upgrades do not have a negative impact.

Components upgraded:

Curator

Zookeeper

Shiro

Netty

Asynchttpclient

Quartz

Gradle

Gradle plugins

Jackson

Guice

Guava

Multiple react components.

Testing Done:

TODO

We should run a few end to end test runs to confirm everything is good.

After we merge this PR we need to create a PR for packaging which upgrades the gradle version there.
opened by ridv 0

Owner

The Apache Software Foundation

GitHub https://aurora.apache.org

Transfer Protocol it is an api client for execute, store, and process data on a server for any kind of programs, this client allows you to have a SQL database, files and folders and so on... this client will not work unless you buy our subscription, any doubt, suggest or issues can be notified to [email protected] or [email protected]... We hope you enjoy our services..

TFProtocol (Transfer Protocol) The TFProtocol works by sending text commands from client to the server in a TCP connection. Every time a command is re

1 Jan 13, 2022

Redisson - Redis Java client with features of In-Memory Data Grid. Over 50 Redis based Java objects and services: Set, Multimap, SortedSet, Map, List, Queue, Deque, Semaphore, Lock, AtomicLong, Map Reduce, Publish / Subscribe, Bloom filter, Spring Cache, Tomcat, Scheduler, JCache API, Hibernate, MyBatis, RPC, local cache ...

20.4k Jan 5, 2023

OrientDB is the most versatile DBMS supporting Graph, Document, Reactive, Full-Text, Geospatial and Key-Value models in one Multi-Model product. OrientDB can run distributed (Multi-Master), supports SQL, ACID Transactions, Full-Text indexing and Reactive Queries. OrientDB Community Edition is Open Source using a liberal Apache 2 license.

OrientDB | Develop branch: | 2.2.x branch: | Chat with the community: What is OrientDB? OrientDB is an Open Source Multi-Model NoSQL DBMS with the sup

4.5k Dec 30, 2022

Apache Calcite

Apache Calcite Apache Calcite is a dynamic data management framework. It contains many of the pieces that comprise a typical database management syste

3.6k Dec 31, 2022

Apache Druid: a high performance real-time analytics database.

12.3k Jan 1, 2023

Apache Hive

Apache Hive (TM) The Apache Hive (TM) data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storag

4.6k Dec 28, 2022

The Chronix Server implementation that is based on Apache Solr.

Chronix Server The Chronix Server is an implementation of the Chronix API that stores time series in Apache Solr. Chronix uses several techniques to o

262 Jul 3, 2022

Apache Pinot - A realtime distributed OLAP datastore

What is Apache Pinot? Features When should I use Pinot? Building Pinot Deploying Pinot to Kubernetes Join the Community Documentation License What is

4.4k Dec 30, 2022

Apache Ant is a Java-based build tool.

Apache Ant What is it? ----------- Ant is a Java based build tool. In theory it is kind of like "make" without makes wrinkles and with

355 Dec 22, 2022

Apache Drill is a distributed MPP query layer for self describing data

Apache Drill Apache Drill is a distributed MPP query layer that supports SQL and alternative query languages against NoSQL and Hadoop data storage sys

1.8k Jan 7, 2023

Flink Connector for Apache Doris(incubating)

Flink Connector for Apache Doris (incubating) Flink Doris Connector More information about compilation and usage, please visit Flink Doris Connector L

115 Dec 20, 2022

HurricaneDB a real-time distributed OLAP engine, powered by Apache Pinot

HurricaneDB is a real-time distributed OLAP datastore, built to deliver scalable real-time analytics with low latency. It can ingest from batch data sources (such as Hadoop HDFS, Amazon S3, Azure ADLS, Google Cloud Storage) as well as stream data sources (such as Apache Kafka).

4 Dec 28, 2022

This is an Android application that deals with storage or "data persistence". The app is suitable for any shelter house to store the data of pets such as name, breed, gender and weight of the pet. The app uses a SQLite Database to store the data. The data is stored locally on the users phone. This app uses many other concepts such as building a ContentProvider and using a CursorAdapter and CursorLoader to automatically load the data.

Pets App This app displays a list of pets and their related com.example.android.pets.data that the user inputs. Used in a Udacity course in the Androi

3 Sep 2, 2021

Apache Aurora - A Mesos framework for long-running services, cron jobs, and ad-hoc jobs

Related tags

Overview

Features

When and when not to use Aurora

Companies using Aurora

Getting Help

License

Comments

Description:

Testing Done:

Description:

Testing Done:

To do before merge:

Description:

Testing Done:

Description:

Testing Done:

Description:

Problems

Solution

Description:

Testing Done:

Description:

Testing Done:

Owner

The Apache Software Foundation

Apache Calcite

Apache Druid: a high performance real-time analytics database.

Apache Hive

The Chronix Server implementation that is based on Apache Solr.

Apache Pinot - A realtime distributed OLAP datastore

Apache Ant is a Java-based build tool.

Apache Drill is a distributed MPP query layer for self describing data

Flink Connector for Apache Doris(incubating)

HurricaneDB a real-time distributed OLAP engine, powered by Apache Pinot

Event capture and querying framework for Java

Multi-DBMS SQL Benchmarking Framework via JDBC

LSPatch: A non-root Xposed framework fork from Xpatch

Ja-netfilter - A javaagent framework

Reladomo is an enterprise grade object-relational mapping framework for Java.

A javaagent framework