A reactive Java framework for building fault-tolerant distributed systems

Last update: Dec 29, 2022

Overview

Atomix

Website | Javadoc | Slack | Google Group

A reactive Java framework for building fault-tolerant distributed systems

Please see the website for full documentation.

Atomix 3.0 is a fully featured framework for building fault-tolerant distributed systems. It provides a set of high-level primitives commonly needed for building scalable and fault-tolerant distributed systems. These primitives include:

Cluster management and failure detection
Direct and publish-subscribe messaging
Distributed coordination primitives built on a novel implementation of the Raft consensus protocol
Scalable data primitives built on a multi-primary protocol
Synchronous and asynchronous Java APIs
Standalone agent
REST API

Acknowledgements

Atomix is developed as part of the ONOS project at the Open Networking Foundation. Atomix project thanks ONF for its ongoing support!

YourKit supports open source projects with its full-featured Java Profiler. YourKit, LLC is the creator of YourKit Java Profiler and YourKit .NET Profiler, innovative and intelligent tools for profiling Java and .NET applications.

Comments

failed to join the cluster on example

When running GitHub example we're getting this error.

java -jar examples/leader-election/target/atomix-leader-election.jar logs/server3 localhost:5002 localhost:5000 localhost:5001 23:41:37.336 [copycat-server-localhost/127.0.0.1:5002] INFO i.a.c.server.state.ServerContext - Server started successfully! Exception in thread "main" java.util.concurrent.CompletionException: java.lang.IllegalStateException: failed to join the cluster at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:769) at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1969) at io.atomix.copycat.server.state.ServerState.join(ServerState.java:597) at io.atomix.copycat.server.state.ServerState.lambda$join$57(ServerState.java:591) at io.atomix.copycat.server.state.ServerState$$Lambda$40/812535648.accept(Unknown Source) at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1969) at io.atomix.catalyst.transport.NettyConnection.lambda$handleResponseFailure$5(NettyConnection.java:172) at io.atomix.catalyst.transport.NettyConnection$$Lambda$49/129084479.run(Unknown Source) at io.atomix.catalyst.util.concurrent.Runnables.lambda$logFailure$12(Runnables.java:20) at io.atomix.catalyst.util.concurrent.Runnables$$Lambda$11/399534175.run(Unknown Source) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.IllegalStateException: failed to join the cluster

opened by hvandenb 28
Adding members after cluster initialization?
Lets say I start up a single node, that only defines itself as a member. If I start up a 2nd node, that lists the first as its member, will the first node accept it? This wasn't clear after reading through the doc.

Or, do all member need to be initialized exactly with each other, to start up?

Also, two more quick things.

The snapshot isn't deployed to maven central, please run a mvn deploy to do this.

I'm thinking about using this to replicate a sqlite database. Would the file event log be my best bet? IE writing the INSERTS and such to the file log. Or would I have to use the state machine resource instead?

Thanks.

edit: I was finally able to install this locally by doing a git clone, and running mvn clean install -DskipTests, because there were a lot of test failures.
opened by dessalines 24
Improve log compaction/snapshot timing and implement a hard limit on compaction
This PR is a significant refactoring of how/when log compaction occurs in the Raft cluster. It implements three conditions for taking snapshots and compacting logs:

When the load on a Raft node is low it takes snapshots and compacts logs whenever possible and at a somewhat leisurely pace

When the load on a Raft node is high but the node is running out of disk space, it takes snapshots and compacts logs ASAP, hopefully before hitting the hard limit

When a Raft node doesn't have enough disk space to allocate any more full segments, it stops writes altogether and blocks until there's enough disk space

To determine whether a node is running out of disk space, periodic sampling of disk usage is used to estimate the rate at which disk is being consumed. Monitoring the total available disk space rather than the rate at which the log is growing ensures that parallel nodes and other processes that use disk running on the same machine are taken into account.

To block writes to the cluster when the disk fills up, the JournalWriter will throw an exception when the journal needs to roll over to a new segment but the segment cannot be allocated according to the configured maximum segment size. When this condition occurs, writes to the leader synchronously await log compaction, and AppendRequests to replicate entries to followers are rejected.
opened by kuujo 19
Accept raft role listeners at raft partition server

This feature allows the user to add raft role change listeners to the RaftPartitionGroup.

Would be nice if we could have that, we have the use case where we need to know who is the leader and listen for then role changes (whether we becoming the leader or step down). We want to start processing on leader nodes and stop the processing on followers.

opened by Zelldon 15
ReferencePool serialization exception

Hi, My copycat is version 0.6.SNAPSHOT Trying to start 2 servers on the same machine from 2 different directories and 2 different ports using these commands server1 c:\JDK64\1.8.0.45\bin\java -jar copycat-server.jar 1:my_local_host:8080 2:my_local_host:8090 server2 c:\JDK64\1.8.0.45\bin\java -jar copycat-server.jar 1:my_local_host:8090 2:my_local_host:8080

Got this exception

15/08/31 09:23:24 ERROR concurrent.SingleThreadContext: An uncaught exception occurred net.kuujo.copycat.io.serializer.SerializationException: failed to serialize Java object at net.kuujo.copycat.io.serializer.Serializer.writeSerializable(Serializer.java:625) at net.kuujo.copycat.io.serializer.Serializer.writeObject(Serializer.java:549) at net.kuujo.copycat.io.transport.NettyConnection.writeRequest(NettyConnection.java:214) at net.kuujo.copycat.io.transport.NettyConnection.lambda$send$6(NettyConnection.java:288) at net.kuujo.copycat.io.transport.NettyConnection$$Lambda$42/1838865044.run(Unknown Source) at net.kuujo.copycat.util.concurrent.SingleThreadContext$1.lambda$execute$9(SingleThreadContext.java:28) at net.kuujo.copycat.util.concurrent.SingleThreadContext$1$$Lambda$6/1018547642.run(Unknown Source) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.NotSerializableException: net.kuujo.copycat.util.ReferencePool at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at net.kuujo.copycat.io.serializer.Serializer.writeSerializable(Serializer.java:620) ... 13 more

opened by mikkazan 15
ResourceFactories and Segmented ClassLoaders

With recent changes that introduce ResourceFactories, I'm having trouble creating a custom resource. The custom resource and its classes are defined in a bundle A and Atomix classes are in bundle B. They don't share class loaders.

As a result deserialization code in https://github.com/atomix/atomix/blob/master/resource/src/main/java/io/atomix/resource/ResourceType.java#L83 fails with a ClassNotFoundException

@kuujo: Is it possible to have this be compatible with a OSGi environment where there are multiple class loaders?
enhancement

opened by madjam 13
Fix node catch up after log compaction

Hey,

I hope this PR is conform to your contribution guidelines, unfortunately I didn't found them.

This PR fixes the problem described in #978 (might be more an easy fix, maybe you need to have an look why this segment is closed twice). I was able to reproduce the problem with the an unit test in the RaftTest, but after fixing this issue I would like to assert in the test that all nodes have the same state and log, but I don't know how to test that. But maybe it is also fine if the join was completed without any problems.

closes #978

opened by Zelldon 12
Add member location provider abstraction for pluggable discovery of cluster members

This PR is an attempt at a solution for #659. The implementation adds a MemberLocationProvider abstraction which can be configured on the Atomix instance. The provider is a simple ListenerService which triggers JOIN/LEAVE events containing an Address of new members. The ClusterMembershipService then connects to the Address to exchange higher level Member information.

The default implementation, of course, is a Netty multicast-based location provider which is enabled when withMulticastEnabled is enabled.

opened by kuujo 12

BlockingDistributedLock throws 100% of the time in some scenarios

On my laptop, I can reproduce the error below by doing the follow - I’m using the latest 2.1.0-SNAPSHOT version

Start atomix with 4 PERSISTENT nodes as follows...

        final Atomix atomix = Atomix.builder()
                .withManagementGroup((RaftPartitionGroup.builder("system")
                        .withNumPartitions(1)
                        .withMembers(members.stream().toArray(Member[]::new))
                        .withDataDirectory(dataDir)
                        .build()))
                .withPartitionGroups(RaftPartitionGroup.builder("raft")
                        .withNumPartitions(1)
                        .withMembers(members.stream().toArray(Member[]::new))
                        .withDataDirectory(dataDir)
                        .build())
                .withMembers(members.stream().toArray(Member[]::new))
                .withLocalMember(local)
                .withClusterName("Cluster name")
                .build()

Take 1 node down
Take 1 more node down (now we have 2 nodes ups out of 4)
Start one of the previously downed nodes
Any node attempting to take a lock will get the following exception

io.atomix.primitive.PrimitiveException$Timeout: null
        at io.atomix.core.lock.impl.BlockingDistributedLock.complete(BlockingDistributedLock.java:77)
        at io.atomix.core.lock.impl.BlockingDistributedLock.tryLock(BlockingDistributedLock.java:57)
        at com.example.runOneIteration(xScheduler.java:55)
        at com.google.common.util.concurrent.AbstractScheduledService$ServiceDelegate$Task.run(AbstractScheduledService.java:193)
        at com.google.common.util.concurrent.Callables$4.run(Callables.java:119)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

opened by rroller 12

Use phi accrual failure detectors for Raft elections and session timeouts

This PR refactors how leadership elections and session expirations are handled in Raft.

It adds a phi accrual failure detector used to determine when to start a new election. In order to avoid multiple servers starting an election at the same time, randomized timers are used to check the current phi value. The Raft election timeout is used as a fallback to ensure the timeout doesn't surpass that point.

Sessions are also expired using phi accrual failure detectors. This is done by the leader sending heartbeats to clients. New sessions are opened with a minimum timeout, and the leader sends heartbeats to the clients at the rate of the minimum session timeout. Sending heartbeats from the leader to clients also ensures clients resolve new leaders as soon as possible. In order to account for the time period during which an old leader crash was being detected and a new leader was being elected, nodes track the last heartbeat time and subtract that time from session timeouts. This means sessions can be expired via the failure detector immediately after a leader change if the client can't be reached by the leader.

opened by kuujo 11
WIP: Distributed group recovery

Lets address some of the atomix client consistency guarantees. Let's take DistributedGroup for example. I create a new group and join it. Now the client's session expires for some reason. Default recoveryStrategy for atomix client is set to RECOVERY (cannot be overridden right now).

Now when client successfully recovers, it gets a brand new session id. The problem is, it is not a member of distributed group anymore.

Another problem is, that members of distributed group are cached. It does sync() when resource opens, but then updates its state based on onJoin and onLeave events. So with new session ID it doesn't receive those on reconnect and doesn't do any "resync" either.

I think that atomix client should be able to handle this. Unfortunately I'm not sure how to properly address this, so at least I'm adding a test case to replicate the issue.

What you guys think?

Thanks, D.

opened by dmvk 11
raft.getCluster().getMember(*).memberId() could Result in an NPE
Expected behavior At line 178 of io.atomix.protocols.raft.roles.ActiveRole, the return value of raft.getCluster().getMember(raft.getLastVotedFor()) should be defended by a null check to avoid NullPointerException.

protected VoteResponse handleVote(VoteRequest request) { //... else { log.debug("Rejected {}: already voted for {}", request, raft.getCluster().getMember(raft.getLastVotedFor()).memberId()); // NPE risk //... } }

Actual behavior & Steps to reproduce raft.getCluster().getMember(raft.getLastVotedFor()) could return null when raft's last vote has no corresponding DefaultRaftMember and it's dereferenced unconditionally.

Minimal yet complete reproducer code (or URL to code)

protected VoteResponse handleVote(VoteRequest request) { //... else { DefaultRaftMember member = request, raft.getCluster().getMember(raft.getLastVotedFor()); log.debug("Rejected {}: already voted for {}", member == null? null:member.memberId()); // NPE risk //... } }

Environment

Atomix: default master

OS: [e.g. uname -a]

JVM [e.g. java -version]

^{:warning:️ Please verify that your issue still occurs on the latest version of Atomix before reporting.
The documentation is currently work-in-progress and is not yet complete.}

Have you searched the CLOSED issues already? How about checking in with the Atomix Google Group?
opened by zhaoyangyingmu 0

Leak on server shutdown while still awaiting other nodes to join

Expected behavior

Restarting Atomix configured to use N raft nodes, bit still not connected to any nodes, should not leak any Atomix instance.

Actual behavior

DefaultRaftServer::shutdown() is not closing it's RaftContext and cleaning up RaftContext::threadContext tasks.

Steps to reproduce

Create a single raft node (expecting N nodes in total), start it and stop it right after, while awaiting it to stop. After running a full GC, the Atomix instance is leaking (some screenshots below).

Minimal yet complete reproducer code (or URL to code)

   public static Atomix createAtomix(String localMemberId,
                                     File dataDirectory,
                                     Map<String, Address> nodes) {
      final Address localAddress = nodes.get(localMemberId);
      if (localAddress == null) {
         throw new IllegalArgumentException("the local member id should been included in the node map");
      }
      final AtomixBuilder atomixBuilder = Atomix.builder().withMemberId(localMemberId).withAddress(localAddress);
      atomixBuilder
         .withMembershipProvider(BootstrapDiscoveryProvider.builder()
                                    .withNodes(
                                       nodes.entrySet().stream()
                                          .map(entry-> Node.builder()
                                             .withId(entry.getKey())
                                             .withAddress(entry.getValue())
                                             .build())
                                          .collect(Collectors.toList())).build());
      // using Profile.consensus(members) is a short-cut of this but it won't left any config choice
      atomixBuilder
         .withManagementGroup(
            RaftPartitionGroup.builder("system")
               .withNumPartitions(1)
               .withMembers(nodes.keySet())
               .withStorageLevel(StorageLevel.DISK)
               .withDataDirectory(new File(dataDirectory, "management"))
               .build())
         .withPartitionGroups(
            RaftPartitionGroup.builder("data")
               .withNumPartitions(1)
               .withMembers(nodes.keySet())
               .withStorageLevel(StorageLevel.DISK)
               .withDataDirectory(new File(dataDirectory, "data"))
               .build());
      return atomixBuilder.build();
   }

   @Test
   public void atomixLeak() {
      File f = new File("./atomix");
      f.deleteOnExit();
      final String localId = "a";
      final Address localAddress = Address.from("localhost:7070");
      final Map<String, Address> nodes = new HashMap<>(3);
      nodes.put(localId, localAddress);
      nodes.put("b", Address.from("localhost:7071"));
      nodes.put("c", Address.from("localhost:7072"));
      Atomix atomix = createAtomix("a", f, nodes);
      try {
         // wait a bit in order to get the RaftServer::start called
         atomix.start().get(2, TimeUnit.SECONDS);
         Assert.fail();
      } catch (TimeoutException te) {
         try {
            atomix.stop().join();
         } catch (Throwable t) {
            Assert.fail();
         }
      } catch (Throwable t) {
         Assert.fail();
      }
   }

It's important to take an heap snapshot after atomix.stop().join() is completed. I'm searching how to implement a minimal reproducer using just RaftServer for the PR I've sent to fix this.

Atomix: 3.2.0-SNAPSHOT

opened by franz1981 1

Nodes are constantly joining and leaving with DnsDiscoveryProvider

I'm new to Atomix. I tried to run embedded Atomix inside my services on Kubernetes. I configured DnsDiscoveryProvided and I see in logs that nodes are constantly joining the cluster and leaving after a while. I have a question about this line:

https://github.com/atomix/atomix/blob/842276c02541267a61bca47f82c2e9d740245fcb/cluster/src/main/java/io/atomix/cluster/discovery/DnsDiscoveryProvider.java#L142

and

https://github.com/atomix/atomix/blob/842276c02541267a61bca47f82c2e9d740245fcb/cluster/src/main/java/io/atomix/cluster/discovery/DnsDiscoveryProvider.java#L150

newNodeIds contains all new discovered nodes. But in the second line (in the wrapping loop), all already discovered nodes are iterated and compared if the node is in newNodeIds. If it's not, an event is created that the node has left the cluster. But you can't compare it to newNodeIds because it contains only nodes that were now discovered (without nodes that are already discovered in the past). So the loop will remove all already discovered nodes.

opened by kmrozek-shareablee 0
Error is not getting printed when an Channel could not connect to an address
Expected behavior

In ChannelPool.Java class in getChannel() method if a channel's future returns error that error is not getting printed in log.

Line 134: LOGGER.debug("Failed to connect to {}", channel.remoteAddress(), e); and Line 104:LOGGER.debug("Failed to connect to {}", address, error);

Actual behavior These lines should have additional {} to print error. Line 134: LOGGER.debug("Failed to connect to {} due to error {}", channel.remoteAddress(), e); and Line 104:LOGGER.debug("Failed to connect to {} due to error {}", address, error);

Steps to reproduce

Steps to reproduce the behavior.

Minimal yet complete reproducer code (or URL to code)

Environment

Atomix: [e.g. 3.0.0]

OS: [e.g. uname -a]

JVM [e.g. java -version]

^{:warning:️ Please verify that your issue still occurs on the latest version of Atomix before reporting.
The documentation is currently work-in-progress and is not yet complete.}

Have you searched the CLOSED issues already? How about checking in with the Atomix Google Group?
opened by ashishmittal19 0
Is the repository open to take PRs for open issues?

Hi, I am fascinated by atomix project. I see a number of open issues(53 at the time of this writing). Just wanted to know if the repository is open to take PRs for any of these issues? If yes, is there any contributing guide?

opened by the123saurav 0

onos to atomix get timeout

Expected behavior my atomix and onos on the same device, I compiler onos-2.2.4 with atomix-3.1.8(openjdk-11), When I tested with 2000 devices, onos will search some stats messages from atomix,I will get many timeout, This situation will make onos get oom Actual behavior when testing,I get this exception:

2020-10-27T19:47:46,355 | DEBUG | raft-partition-group-raft-6 | RaftSessionConnection            | 129 - io.atomix.utils - 3.1.8 | SessionClient{29}{type=AtomicCounterType{name=atomic-counter}, name=sys-clock-counter} - CommandRequest{session=29, sequence=1106859, operation=PrimitiveOperation{id=DefaultOperationId{id=incrementAndGet, type=COMMAND}, value=null}} failed! Reason: {}
java.util.concurrent.TimeoutException: Request type raft-partition-1-command timed out in 5000 milliseconds
        at io.atomix.cluster.messaging.impl.AbstractClientConnection$Callback.timeout(AbstractClientConnection.java:159) ~[?:?]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:834) [?:?]

I print the log to try to find the problem, like this:

final class RemoteServerConnection extends AbstractServerConnection {
  private static final byte[] EMPTY_PAYLOAD = new byte[0];
  private final Logger log = LoggerFactory.getLogger(getClass());

  private final Channel channel;

  RemoteServerConnection(HandlerRegistry handlers, Channel channel) {
    super(handlers);
    this.channel = channel;
  }

  @Override
  public void reply(ProtocolRequest message, ProtocolReply.Status status, Optional<byte[]> payload) {
    ProtocolReply response = new ProtocolReply(
        message.id(),
        payload.orElse(EMPTY_PAYLOAD),
        status);
    log.info("RemoteServerConnection reply, message subject {} message type {} message id {} message status {}", message.subject(), message.type(), message.id(), status.name());
    channel.writeAndFlush(response, channel.voidPromise());
  }
}

Then,I found a strange problem,if this log is info, the Timeout will get fewer,But if log is not info or not added,The timeout will get more. So, I have two question: