Efficient reliable UDP unicast, UDP multicast, and IPC message transport

Last update: Jan 9, 2023

Overview

Aeron

Efficient reliable UDP unicast, UDP multicast, and IPC message transport. Java and C++ clients are available in this repository, and a .NET client is available from a 3rd party. All three clients can exchange messages across machines, or on the same machine via IPC, very efficiently. Message streams can be recorded by the Archive module to persistent storage for later, or real-time, replay. Aeron Cluster provides support for fault-tolerant services as replicated state machines based on the Raft consensus algorithm.

Performance is the key focus. A design goal for Aeron is to be the highest throughput with the lowest and most predictable latency of any messaging system. Aeron integrates with Simple Binary Encoding (SBE) for the best possible message encoding and decoding performance. Many of the data structures used in the creation of Aeron have been factored out to the Agrona project.

For details of usage, protocol specification, FAQ, etc. please check out the Wiki.

For those who prefer to watch a video then try Aeron Messaging from StrangeLoop 2014. Things have advanced quite a bit with performance and features, but the basic design still applies.

For the latest version information and changes see the Change Log with Java downloads at Maven Central.

Commercial support, training, and development on Aeron is available from [email protected]. Premium features such as Solarflare ef_vi transport bindings for a further 40-60% reduction in latency, and security with ATS (Aeron Transport Security) for encrypted communications is available to customers on commercial support.

How do I use Aeron?

How does Aeron work?

How do I hack on Aeron?

Build

Java Build

Build the project with Gradle using this build.gradle file.

You will require the Java 8+ to build Aeron:

JDK 8 or later, Java versions before 1.8.0_65 are very buggy and can cause tests to fail.

Full clean and build of all modules

    $ ./gradlew

C++ Build

You require the following to build the C++ API for Aeron:

3.6.1 or higher of CMake
C++11 supported compiler for the supported platform
C11 supported compiler for the supported platform
Requirements to build HdrHistogram_c.
JDK 8 or later to compile the SBE schema definitions used by the archive client.

Note: Aeron support is available for 64-bit Linux, OSX, and Windows.

For convenience, the cppbuild script does a full clean, build, and test of all targets as a Release build.

    $ ./cppbuild/cppbuild

For those comfortable with CMake - then a clean, build, and test looks like:

    $ mkdir -p cppbuild/Debug
    $ cd cppbuild/Debug
    $ cmake ../..
    $ cmake --build . --clean-first
    $ ctest

C Media Driver

By default, the C Media Driver is built as part of the C++ Build. However, it can be disabled via the CMake option BUILD_AERON_DRIVER being set to OFF.

Note: C Media Driver is supported on Mac and Linux, the Windows version is experimental.

For dependencies and other information, see the README.

Documentation

If you have doxygen installed and want to build the Doxygen doc, there is a nice doc target that can be used.

    $ make doc

Packaging

If you would like a packaged version of the compiled API, there is the package target that uses CPack. If the doc has been built previous to the packaging, it will be included. Packages created are "TGZ;STGZ", but can be changed by running cpack directly.

    $ make package

Running Samples

Start up a media driver which will create the data and conductor directories. On Linux, this will probably be in /dev/shm/aeron or /tmp/aeron.

    $ java -cp aeron-samples/build/libs/samples.jar io.aeron.driver.MediaDriver

Alternatively, specify the data and conductor directories. The following example uses the shared memory 'directory' on Linux, but you could just as easily point to the regular filesystem.

    $ java -cp aeron-samples/build/libs/samples.jar -Daeron.dir=/dev/shm/aeron io.aeron.driver.MediaDriver

You can run the BasicSubscriber from a command line. On Linux, this will be pointing to the /dev/shm shared memory directory, so be sure your MediaDriver is doing the same!

    $ java -cp aeron-samples/build/libs/samples.jar io.aeron.samples.BasicSubscriber

You can run the BasicPublisher from a command line. On Linux, this will be pointing to the /dev/shm shared memory directory, so be sure your MediaDriver is doing the same!

    $ java -cp aeron-samples/build/libs/samples.jar io.aeron.samples.BasicPublisher

You can run the AeronStat utility to read system counters from a command line

    $ java -cp aeron-samples/build/libs/samples.jar io.aeron.samples.AeronStat

Media Driver Packaging

The Media Driver is packaged by the default build into an application that can be found here

aeron-driver/build/distributions/aeron-driver-${VERSION}.zip

Troubleshooting

On linux, the subscriber sample throws an exception
```
 java.lang.InternalError(a fault occurred in a recent unsafe memory access operation in compiled Java code)
```
This is actually an out of disk space issue.

To alleviate, check to make sure you have enough disk space.

In the samples, on Linux, this will probably be either at /dev/shm/aeron or /tmp/aeron (depending on your settings).

See this thread for a similar problem.

Note: if you are trying to run this inside a Linux Docker, be aware that, by default, Docker only allocates 64 MB to the shared memory space at /dev/shm. However, the samples will quickly outgrow this.

You can work around this issue by using the --shm-size argument for docker run or shm_size in docker-compose.yaml.

License (See LICENSE file for full license)

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Comments

Subscription intermittently stops delivering messages

Hi, similar issue to #234, but with slightly different behaviour, so I'm raising a new issue. One of the subscriptions on a specific channel causes all of the others to stall and results in back pressure on the publication.

Here is AeronStat output for stream -1762877036. Note registrationId 1167 is show a sub-pos much smaller than the others.

AeronStat

983:               65,184 - pub-lmt: 853 880428294 -1762877036 aeron:udp?group=239.193.0.1:10000|interface=172.29.1.1
984:               32,416 - snd-pos: 853 880428294 -1762877036 aeron:udp?group=239.193.0.1:10000|interface=172.29.1.1
1099:               29,920 - sub-pos: 1428 880428294 -1762877036 aeron:udp?group=239.193.0.1:10000|interface=172.29.1.1
1134:               29,920 - sub-pos: 460 880428294 -1762877036 aeron:udp?group=239.193.0.1:10000|interface=172.29.1.1
1135:               29,920 - sub-pos: 428 880428294 -1762877036 aeron:udp?group=239.193.0.1:10000|interface=172.29.1.1
1136:               29,920 - sub-pos: 309 880428294 -1762877036 aeron:udp?group=239.193.0.1:10000|interface=172.29.1.1
1137:               29,920 - sub-pos: 241 880428294 -1762877036 aeron:udp?group=239.193.0.1:10000|interface=172.29.1.1
1140:               29,920 - sub-pos: 197 880428294 -1762877036 aeron:udp?group=239.193.0.1:10000|interface=172.29.1.1
1141:               29,920 - sub-pos: 76 880428294 -1762877036 aeron:udp?group=239.193.0.1:10000|interface=172.29.1.1
1227:               29,920 - sub-pos: 1332 880428294 -1762877036 aeron:udp?group=239.193.0.1:10000|interface=172.29.1.1
1435:               29,920 - sub-pos: 718 880428294 -1762877036 aeron:udp?group=239.193.0.1:10000|interface=172.29.1.1
1436:               29,920 - sub-pos: 776 880428294 -1762877036 aeron:udp?group=239.193.0.1:10000|interface=172.29.1.1
1437:                  384 - sub-pos: 1167 880428294 -1762877036 aeron:udp?group=239.193.0.1:10000|interface=172.29.1.1
1438:               32,416 - rcv-hwm: 1270 880428294 -1762877036 aeron:udp?group=239.193.0.1:10000|interface=172.29.1.1
1464:               29,920 - sub-pos: 1293 880428294 -1762877036 aeron:udp?group=239.193.0.1:10000|interface=172.29.1.1
1712:               29,920 - sub-pos: 1962 880428294 -1762877036 aeron:udp?group=239.193.0.1:10000|interface=172.29.1.1
1761:               29,920 - sub-pos: 1671 880428294 -1762877036 aeron:udp?group=239.193.0.1:10000|interface=172.29.1.1
1973:               29,920 - sub-pos: 1836 880428294 -1762877036 aeron:udp?group=239.193.0.1:10000|interface=172.29.1.1
2039:               29,920 - sub-pos: 1928 880428294 -1762877036 aeron:udp?group=239.193.0.1:10000|interface=172.29.1.1
2060:               29,920 - sub-pos: 2038 880428294 -1762877036 aeron:udp?group=239.193.0.1:10000|interface=172.29.1.1

I have some custom monitoring of my handler so that I can track if this subscription is still being polled. You can see the streamId and registrationId matches the problematic subscription and pollCount is increasing, which means I'm confident our polling code hasn't crashed.

$>bean com.lmax:name=tfx-group__InstrumentNotificationServicePublic,service=cfdx,type=AeronReceiver 
#bean is set to com.lmax:name=tfx-group__InstrumentNotificationServicePublic,service=cfdx,type=AeronReceiver
$>get *
#mbean = com.lmax:name=tfx-group__InstrumentNotificationServicePublic,service=cfdx,type=AeronReceiver:
RegistrationId = 1167;
TopicName = tfx-group::InstrumentNotificationServicePublic;
StreamId = -1762877036;
PollCount = 441575;
Uri = aeron:udp?group=239.193.0.1:10000|interface=172.29.1.1;

$>get *
#mbean = com.lmax:name=tfx-group__InstrumentNotificationServicePublic,service=cfdx,type=AeronReceiver:
RegistrationId = 1167;
TopicName = tfx-group::InstrumentNotificationServicePublic;
StreamId = -1762877036;
PollCount = 691272;
Uri = aeron:udp?group=239.193.0.1:10000|interface=172.29.1.1;

Polling code:

    public int poll()
    {
        final int poll = subscription.poll(handler, 20);
        ++pollCount;
        return poll;
    }

Looking at the log inspect for the appropriate subscription image show that aeron thinks the term is unused and dirty. I.e. the first show no messages, but has not been cleaned the remaining 2 terms are all zeros. The LogInspector suggests that there are no messages in that term.

 ======================================================================
 Fri Jun 17 03:38:17 UTC 2016 Inspection dump for /dev/shm/aeron-tradex/images/UDP-ac1d0101-0-efc10001-10000-347A4506-96ECA194-4F6.logbuffer
 ======================================================================
 Time of last SM: Thu Jan 01 00:00:00 UTC 1970
 Initial term id: -1970406852
    Active index: 0
     Term length: 65536
      MTU length: 4096

 default Data Header{frame_length=0 version=0 flags=11000000 type=1 frame_length=0 term_offset=0 session_id=880428294 stream_id=-1762877036 term_id=-1970406852 reserved_value=0}

 Index 0 Term Meta Data status=CLEAN termOffset=0 termId=0
 Index 1 Term Meta Data status=CLEAN termOffset=0 termId=0
 Index 2 Term Meta Data status=CLEAN termOffset=0 termId=0
 %n======================================================================
 Index 0 Term Data

 Data Header{frame_length=0 version=0 flags=00000000 type=0 frame_length=0 term_offset=0 session_id=0 stream_id=0 term_id=0 reserved_value=0}
 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
 00002C00000000C001008001000006457A3494A1EC963CFA8D8A0000000000000000000000080000000000000000000000000000000000000000000000000000000020000000
 00C00100C001000006457A3494A1EC963CFA8D8A00000000000000002C00000000C00100E001000006457A3494A1EC963CFA8D8A000000000000000000000008000000000000
 ...

enhancement

opened by mikeb01 67

Question: Publication Unblock

Can you please confirm that the following is indeed an issue?

In IpcPublication::checkForBlockedPublisher,

because initially we set

timeOfLastConsumerPositionChange = 0;

and

consumerPosition = producerPosition(); lastConsumerPosition = consumerPosition;

When we transition from having no subcription to having one, it triggers a call to LogBufferUnblocker.unblock even if the publication is not block.

In my use case, the first claimed message is forcefully published by unblock when in fact it should be done via commit.

Maybe setting the timeOfLastConsumerPositionChange=currenttime when a new subscription is added would prevent this.

I can also increase the unblock timeout. However it would be a a very high value given that the difference between current time and timeOfLastConsumerPositionChange = 0 could be very high.

opened by goglusid 53
With many streams over a single channel, Publications report NOT_CONNECTED

I have around 400 streams over a single channel. Some of the streams work fine, but the publication for a number of them report NOT_CONNECTED when doing an offer. The system works without issue when I have a channel per stream.

I've attached a couple of log files filtered down to one of the streams that is not working (streamId: -1523337006). The app.txt is our application logs showing successful creation of the publication and subscription. The cnc-stat is the relevant lines from the AeronStat output. media-driver-admin.txt is all of the related messages from the media driver logs with 'aeron.event.log=admin`.

app.txt cnc-stat.txt media-driver-admin.txt

opened by mikeb01 49
Duplicate Messages Received for Long Running Processes
What we found

We created a test client application that will connect two peers over UDP and send an incrementing set of sequence numbers to a local client's remote peer. The local sequence number is incremented by 1 on each message sent. When reading messages, the remote peer asserts that the stream of sequence numbers contained in the data messages is strictly increasing by one. The critical assertion is that if a sequence number read by a Peer is not incrementing by one, we either have received a message that looks like it was sent in the past if it is less than the expected value or looks like it was sent in the future if it was greater than the expected value. In our case, we were able to reproduce the "received a message sent in the past" with Setup 2 (below).

This scenario occurred after the test was running for about 9 hours. At that time the sequence number log indicates we received a duplicate message with a sequence number from approximately 6 hours ago.

Note that we only notice duplicate delivery with Setup 2 (two instances) and not with Setup 1 where both media drivers on the same machine but still communicating using UDP.

This is demonstrated by ConnectionSample.cpp: https://gist.github.com/bedding/08a0609a880595e89df63d994cc06f03

Test Platform

Version of Aeron we used: aeron/master at 84f757c8e0ee638c5448c8c39f650f4714e0a842

Setup 1 :

1 EC2 instance: c4-8xl

2 media driver processes, one test client per media driver.

Both media drivers and both client processes are on the same machine, communicating via UDP.

Setup 2 :

2 EC2 instances

C4-8XL

C3-8XL

Each instance has a media driver process and test client process.

What the test does

Peer: a single Aeron client process

Connection

Control channel: consists of a stream to receive control messages (such join request, join response and join acknowledgement) and a stream to publish control messages to the remote peer.

Data channel: consists of a pair of streams to send and received incremented sequence numbers.

Steps

Start peer 1 with endpoint A, which waits for peer 2 to start.

Start peer 2 with endpoint B, which initiates the connection by sending a join request to peer 1 via the control channel connected to endpoint A.

Peer 1 will increment its sequence number (Sn1) and send it with a data message to Peer 2.

Peer 2 reads the data message and checks if the sequence number is always increasing by 1.

Peer 2 also sends it incrementing sequence number (Sn2) to Peer 1, which does the same check for the sequence number Sn2.

Notes

Based on previous issue I opened, I integrated with the newer version of Aeron and also used a guard based on isConnected() function on both publication and subscription

We also implemented a peer connection acknowledgement to ensure that connections are acknowledged by both peers, meaning both channels are connected on both sides.

We also send data messages on control channel to make sure that message drop or duplicate message does not occur on any stream.
opened by bedding 38
Multicast: first subscriber gets full message replay, following subscribers not ?
(Setup: your BasicXX example classes using multicast addresses [JDK 1.8_11 Ubuntu 14.04, AMD Opteron]).

Observation:

Start publisher sending "Hello world". Let it run for a while

Start 1st BasicSubscriber => Subscriber receives all messages starting at sequence 0

Start 2cnd Subscriber: => Subscriber receives a sequence near the current sequence of sender

How large will the replay for the 1st subscriber become ? In case of a high volume publisher e.g. started some hours before the first subscriber, 1st subscriber might get flooded with messages. Can I suppress replay upon join somehow ?
enhancement question
opened by RuedigerMoeller 38
Aeron Publisher offer succeeds but subscribers do not receive any data
Hi, I ran this test on the latest version 1.14.0, java version is: 1.8.0_161-b12, I am on a macOS. Please note the sequence of steps below:

I took the BasicPublisher class, changed the sleep to 300 ms instead of 1 second

Secondly I modified the aeron.client.liveness.timeout=300000000 (300 ms instead of 10 second)

Now start the MediaDriver(I used the low-latency-media-driver), BasicPublisher and please start two BasicSubscriber instances

What I noticed is that the BasicPublisher publishes and both the subscribers receive data but after some time both stop receiving data. But what is surprising is the Publisher does not notice this and keeps offering successfully. So now the subscribers do not receive any data but publisher is publishing

This only happens when I reduce the client liveness timeout. This is exactly what we found in our test environment as well

Also the following error was printed in the error log file:

*** 7 observations from 2019-01-12 20:48:36.354+0530 to 2019-01-12 20:49:14.939+0530 for: io.aeron.driver.exceptions.ControlProtocolException: Unknown Subscription: 3 at io.aeron.driver.DriverConductor.onRemoveSubscription(DriverConductor.java:741) at io.aeron.driver.ClientCommandAdapter.onMessage(ClientCommandAdapter.java:136) at org.agrona.concurrent.ringbuffer.ManyToOneRingBuffer.read(ManyToOneRingBuffer.java:157) at io.aeron.driver.ClientCommandAdapter.receive(ClientCommandAdapter.java:64) at io.aeron.driver.DriverConductor.doWork(DriverConductor.java:154) at org.agrona.concurrent.AgentRunner.doDutyCycle(AgentRunner.java:268) at org.agrona.concurrent.AgentRunner.run(AgentRunner.java:161) at java.lang.Thread.run(Thread.java:748)

Please let me know if this is a bug and in the mean time what is the best way to workaround this problem? Let me know if you need any more details

Thanks!
opened by harnitbakshi 37

Driver gets wedged with two subscribers, one publisher

Using the latest versions of PublisherTool and SubscriberTool, if I start one subscriber with an embedded driver, start another one on the same channel:stream ID without, and start a publisher...

$ java -cp aeron-tools/build/libs/tools.jar uk.co.real_logic.aeron.tools.SubscriberTool --driver=embedded
$ java -cp aeron-tools/build/libs/tools.jar uk.co.real_logic.aeron.tools.SubscriberTool
$ java -cp aeron-tools/build/libs/tools.jar uk.co.real_logic.aeron.tools.PublisherTool

the two subscribers get messages for a few seconds and then wedge, getting nothing, and the publisher is blocked from sending. If I kill the publisher and start it again, I get an exception on the PublisherTool app:

java.lang.IndexOutOfBoundsException: index=16777208, length=24, capacity=16777216
    at uk.co.real_logic.agrona.concurrent.UnsafeBuffer.boundsCheck(UnsafeBuffer.java:795)
    at uk.co.real_logic.agrona.concurrent.UnsafeBuffer.putBytes(UnsafeBuffer.java:724)
    at uk.co.real_logic.aeron.common.concurrent.logbuffer.LogAppender.appendPaddingFrame(LogAppender.java:293)
    at uk.co.real_logic.aeron.common.concurrent.logbuffer.LogAppender.appendUnfragmentedMessage(LogAppender.java:206)
    at uk.co.real_logic.aeron.common.concurrent.logbuffer.LogAppender.append(LogAppender.java:137)
    at uk.co.real_logic.aeron.Publication.offer(Publication.java:165)
    at uk.co.real_logic.aeron.tools.PublisherTool$PublisherThread.onNext(PublisherTool.java:416)
    at uk.co.real_logic.aeron.tools.RateController$SecondsAtBitsPerSecondInternal.sendNext(RateController.java:479)
    at uk.co.real_logic.aeron.tools.RateController.next(RateController.java:53)
    at uk.co.real_logic.aeron.tools.PublisherTool$PublisherThread.run(PublisherTool.java:357)
    at java.lang.Thread.run(Thread.java:745)

If I kill and restart the publisher again, I don't get another exception, but it thinks it's sending for a few more messages and then wedges again. Meanwhile, the subscribers haven't gotten any new messages since the very first run of the Publisher.

This could be some sort of bug in the tools themselves, but the fact that I get an exception from the LogAppender on the Publisher upon restarting the tool makes me think that perhaps it's not.

bug

opened by strangelydim 28

Performance issue with Subscription::poll()

I'm using Aeron to send and receive a large number of objects, so I did a simple perf test on the Subscription::poll and I found that polling more fragments per Subscription::poll() doesn't bump up the performance very much -- I'm sending 500k objects every 30ms, and one object is about 20 bytes. So that each object can be wrapped in exact one fragment. The objects are sent over IPC.

I counted the time taken by my fragment handler and polling from the aeron subscription object. It turns out that polling 100k per poll() call (5 polls in total) takes about 380ms to finish receiving all 500k objects, while polling 1 per call (500k polls in total) takes 500ms. And the time taken by running my own fragment handler is about 300us in total, which is only about 0.1%. Also, I was expecting the performance improvement of batching poll() will be much more than just ~30%. Following are the code snippet of my test, can anyone throw some lights on this? I'm wondering if I'm doing something wrong here

void FragmentHandler(aeron::concurrent::AtomicBuffer &buffer, aeron::util::index_t offset,
                     aeron::util::index_t length, aeron::Header &header, 
                     std::int64_t& handlerTime)
{
    Timer timer;
    timer.Start();

    // do stuff

    handlerTime += timer.Stop();
}

void Receive(std::shared_ptr<aeron::Subscription> sub, int numberOfObjects)
{
    Timer timer;
    timer.Start();

    std::int64_t handlerTime = 0;

    aeron::fragment_handler_t handler = std::bind(FragmentHandler,
                                                  std::placeholders::_1, std::placeholders::_2,
                                                  std::placeholders::_3, std::placeholders::_4,
                                                  std::ref(handlerTime));
    
    for (int i = 0; i < numberOfObjects; )
    {
        // change the number here doesn't help much on performance
        int objectRead = sub.poll(handler, 1);
        i+= objectRead;
    }

    auto duration = timer.Stop();
    
    std::cout << handlerTime << ", " << duration << ", " 
       << double(handlerTime) / duration << std::endl;
}

Thanks!

opened by bedding 25

Stalled Publication/Subscription

Version 1.0.1

I'm debugging a case where sub.poll suddenly doesn't receive any more messages, and it stays that way. On the sending side pub.offer is successful (positive return value and connected=true closed=false).

Only two nodes.

node1: 172-31-10-77
node2: 172-31-8-204

node1 tries to send to node2, but node2 is not started yet. ~30 seconds later node2 is started and they successfully exchange a few messages (I have application level logs for that), and then node2 stops receiving more messages in sub.poll. onFragment in the FragmentAssembler is not invoked even though I repeatedly call poll. Those messages are below mtu size.

The systems are rather loaded in this scenario but not overloaded, and the load is stopped after a while.

AeronStat from node1 172-31-10-77:

 23:                    1 - recv-channel: aeron:udp?endpoint=ip-172-31-10-77:25520
 24:        1,627,656,896 - snd-pos: 8 1287205435 1 aeron:udp?endpoint=ip-172-31-8-204:25520
 25:                    1 - send-channel: aeron:udp?endpoint=ip-172-31-8-204:25520
 26:        1,636,045,504 - pub-lmt: 8 1287205435 1 aeron:udp?endpoint=ip-172-31-8-204:25520
 27:              170,432 - sub-pos: 1 434696915 1 aeron:udp?endpoint=ip-172-31-10-77:25520 @0
 28:              170,432 - rcv-hwm: 7 434696915 1 aeron:udp?endpoint=ip-172-31-10-77:25520
 29:           19,750,176 - sub-pos: 2 434696916 2 aeron:udp?endpoint=ip-172-31-10-77:25520 @0
 30:           19,750,176 - rcv-hwm: 9 434696916 2 aeron:udp?endpoint=ip-172-31-10-77:25520
 31:           19,750,176 - rcv-pos: 9 434696916 2 aeron:udp?endpoint=ip-172-31-10-77:25520
 32:           28,953,216 - pub-lmt: 10 1287205436 2 aeron:udp?endpoint=ip-172-31-8-204:25520
 33:           20,564,608 - snd-pos: 10 1287205436 2 aeron:udp?endpoint=ip-172-31-8-204:25520

AeronStat from node2 172-31-8-204:

 23:                    1 - recv-channel: aeron:udp?endpoint=ip-172-31-8-204:25520
 24:                    1 - send-channel: aeron:udp?endpoint=ip-172-31-10-77:25520
 25:            8,559,040 - pub-lmt: 3 434696915 1 aeron:udp?endpoint=ip-172-31-10-77:25520
 26:              170,432 - snd-pos: 3 434696915 1 aeron:udp?endpoint=ip-172-31-10-77:25520
 27:           28,138,784 - pub-lmt: 4 434696916 2 aeron:udp?endpoint=ip-172-31-10-77:25520
 28:           19,750,176 - snd-pos: 4 434696916 2 aeron:udp?endpoint=ip-172-31-10-77:25520
 29:                2,176 - sub-pos: 1 1287205435 1 aeron:udp?endpoint=ip-172-31-8-204:25520 @0
 30:        1,627,656,896 - rcv-hwm: 5 1287205435 1 aeron:udp?endpoint=ip-172-31-8-204:25520
 31:                2,176 - rcv-pos: 5 1287205435 1 aeron:udp?endpoint=ip-172-31-8-204:25520
 32:           20,500,992 - sub-pos: 2 1287205436 2 aeron:udp?endpoint=ip-172-31-8-204:25520 @0
 33:           20,500,992 - rcv-hwm: 6 1287205436 2 aeron:udp?endpoint=ip-172-31-8-204:25520
 34:           20,500,992 - rcv-pos: 6 1287205436 2 aeron:udp?endpoint=ip-172-31-8-204:25520

The problematic session is 1287205435. The other streams seem to progress. I kept it running for minutes after the stall.

Stream 1 is our control stream and it's low traffic, a few messages per second on this stream.

I have all files, if you need more information.

opened by patriknw 22

After recording for about 5 minutes, the Aeron Client check timeout exception

After recording for about 5 minutes, the Aeron Client check timeout exception. The exception is below:

io.aeron.exceptions.ConductorServiceTimeoutException: Timeout between service calls over 5000000000ns at io.aeron.ClientConductor.onCheckTimeouts(ClientConductor.java:508) at io.aeron.ClientConductor.doWork(ClientConductor.java:431) at io.aeron.ClientConductor.doWork(ClientConductor.java:143) at org.agrona.concurrent.AgentInvoker.invoke(AgentInvoker.java:88) at io.aeron.archiver.ArchiveConductor.doWork(ArchiveConductor.java:116) at org.agrona.concurrent.AgentRunner.run(AgentRunner.java:140) at java.lang.Thread.run(Thread.java:745)

opened by jordanxlj 21
Possible memory visibility issue in PublicationImage
We believe we found some misbehavior with respect to the memory visibility of the buffer that Aeron returns while polling a subscription. We are also linking a barebones setup that is able to reproduce the misbehavior, can be found here.

https://github.com/dkonik/AeronTesting

The Behavior:

We observe that the bytes inside the buffer that is passed to the FragmentHandler while polling seem to (on occasion) change inside the handler. More specifically, at the beginning of the FragmentHandler we call buffer.getLong(offset), then we do some work, then at the end of the FragmentHandler we call buffer.getLong() again and the value that is returned is not the same as the first time getLong was called. For example, with the following code in the FragmentHandler:

final long receivedIndex = buffer.getLong(offset); if (receivedIndex != previous + 1) { System.out.println("MISSED MESSAGE AT INDEX: " + previous + "->" + receivedIndex); System.out.printf("%x %x %x %x %x %x %x %x\n", buffer.getByte(offset), buffer.getByte(offset + 1), buffer.getByte(offset + 2), buffer.getByte(offset + 3), buffer.getByte(offset + 4), buffer.getByte(offset + 5), buffer.getByte(offset + 6), buffer.getByte(offset + 7), buffer.getByte(offset + 8)); System.out.println("Offset(again): " + buffer.getLong(offset) + "\n"); } ++messageNum; previous = receivedIndex;

We get output of:

MISSED MESSAGE AT INDEX: 188610->96091 c3 e0 2 0 0 0 0 0 Offset(again): 188611 MISSED MESSAGE AT INDEX: 96091->188612 c4 e0 2 0 0 0 0 0 Offset(again): 188612 MISSED MESSAGE AT INDEX: 371101->278582 9e a9 5 0 0 0 0 0 Offset(again): 371102 MISSED MESSAGE AT INDEX: 278582->371103 9f a9 5 0 0 0 0 0 Offset(again): 371103

As you can see, for the first one, you would expect the returned value to be 188610 + 1, but the first time getLong is called a different number is returned (96091). Afterwards, when getLong is called again, the correct value is returned, implying that Aeron is giving us a buffer before the memory contained by that buffer is fully visible in all threads.

Additionally, we have only observed this while sending 5-10 packets back to back as fast as possible (as opposed to say, a one packet at a time ping-pong style).

System Configurations:

Both boxes are Haswells with a direct connection between them (no switch), and rhel 7 version 3.10.0-327.22.2.el7.x86_64
opened by mjpt777 20
[java] leader transfer manually for aeron cluster

When releasing new version of application code, we need to shut down leader node. And it will take 10s to have a new leader, which means the service will stop for at least 10s.

If we have leader transfer functionality, this will reduce to 200ms.

The code has been tested, please have a look, thanks very much.

opened by spiritlcx 0
[cluster]The cluster receives a message about a connection that has been closed

When the cluster restarts at the same time, clients will not receive the close event and will not be able to recover automatically。 It is normal for the client to send heartbeats and commands。

opened by WorkingChen 0
MediaDriver keepalive timeout due to driver being slow on munmaps

In the case where we have an application subscribing to a large number of streams, if we then shut that application down we have a massive influx of freeing the publication images. Due to munmap being potentially slow (in the milliseconds) and the large number of streams, we can see a single driver_conductor work loop take over 1 second. This then triggers media driver keepalive timeouts for clients.

Would it be possible to queue up the munmaps so that other driver_conductor work isn't entirely blocked by a large number of munmaps? i.e. only free N publication images per driver_conductor work loop?

opened by reissGRVS 3
Add CodeQL workflow for GitHub code scanning
Hi real-logic/aeron!

This is a one-off automatically generated pull request from LGTM.com :robot:. You might have heard that we’ve integrated LGTM’s underlying CodeQL analysis engine natively into GitHub. The result is GitHub code scanning!

With LGTM fully integrated into code scanning, we are focused on improving CodeQL within the native GitHub code scanning experience. In order to take advantage of current and future improvements to our analysis capabilities, we suggest you enable code scanning on your repository. Please take a look at our blog post for more information.

This pull request enables code scanning by adding an auto-generated codeql.yml workflow file for GitHub Actions to your repository — take a look! We tested it before opening this pull request, so all should be working :heavy_check_mark:. In fact, you might already have seen some alerts appear on this pull request!

Where needed and if possible, we’ve adjusted the configuration to the needs of your particular repository. But of course, you should feel free to tweak it further! Check this page for detailed documentation.

Questions? Check out the FAQ below!

FAQ

Click here to expand the FAQ section

How often will the code scanning analysis run?

By default, code scanning will trigger a scan with the CodeQL engine on the following events:

On every pull request — to flag up potential security problems for you to investigate before merging a PR.

On every push to your default branch and other protected branches — this keeps the analysis results on your repository’s Security tab up to date.

Once a week at a fixed time — to make sure you benefit from the latest updated security analysis even when no code was committed or PRs were opened.

What will this cost?

Nothing! The CodeQL engine will run inside GitHub Actions, making use of your unlimited free compute minutes for public repositories.

What types of problems does CodeQL find?

The CodeQL engine that powers GitHub code scanning is the exact same engine that powers LGTM.com. The exact set of rules has been tweaked slightly, but you should see almost exactly the same types of alerts as you were used to on LGTM.com: we’ve enabled the security-and-quality query suite for you.

How do I upgrade my CodeQL engine?

No need! New versions of the CodeQL analysis are constantly deployed on GitHub.com; your repository will automatically benefit from the most recently released version.

The analysis doesn’t seem to be working

If you get an error in GitHub Actions that indicates that CodeQL wasn’t able to analyze your code, please follow the instructions here to debug the analysis.

How do I disable LGTM.com?

If you have LGTM’s automatic pull request analysis enabled, then you can follow these steps to disable the LGTM pull request analysis. You don’t actually need to remove your repository from LGTM.com; it will automatically be removed in the next few months as part of the deprecation of LGTM.com (more info here).

Which source code hosting platforms does code scanning support?

GitHub code scanning is deeply integrated within GitHub itself. If you’d like to scan source code that is hosted elsewhere, we suggest that you create a mirror of that code on GitHub.

How do I know this PR is legitimate?

This PR is filed by the official LGTM.com GitHub App, in line with the deprecation timeline that was announced on the official GitHub Blog. The proposed GitHub Action workflow uses the official open source GitHub CodeQL Action. If you have any other questions or concerns, please join the discussion here in the official GitHub community!

I have another question / how do I get in touch?

Please join the discussion here to ask further questions and send us suggestions!
opened by lgtm-com[bot] 0
Erroneous next fragment term offset computation
This is mostly a nitpick, and I'm not sure my analysis below is correct, so please correct me if I'm wrong.

In FragmentAssembler#handleFragment, the term offset of the next fragment is computed as

BitUtil.align(offset + length + HEADER_LENGTH, FRAME_ALIGNMENT)

offset and length are (most likely) provided by TermReader#read so their values are frameOffset + HEADER_LENGTH and frameLength - HEADER_LENGTH respectively, and therefore offset + length is equal to frameOffset + frameLength. Aligning that value to the FRAME_ALIGNMENT to obtain the offset of the next frame makes sense. But adding + HEADER_LENGTH before doing the alignment doesn't.

It just so happens that HEADER_LENGTH is equal to FRAME_ALIGNMENT and so the operations x -> BitUtil.align(x, FRAME_ALIGNMENT) and x -> x + HEADER_LENGTH commute. But if HEADER_LENGTH was not a multiple of FRAME_ALIGNMENT then this code would break. So for cleanliness and to avoid a bad surprise if the transport protocol ever changes (unlikely I guess), it would be best to instead write BitUtil.align(offset + length, FRAME_ALIGNMENT) + HEADER_LENGTH.
opened by jrsala-auguration 0

Releases(1.40.0)

1.40.0(Oct 21, 2022)
Memory align allocated buffers in PublicationTest so it works on Apple M1 processors.

Check that NoOpLock is only allowed to be used when using Aeron client in invoker mode.

Handle case of a delayed concurrent offer to a publication in which other threads have raced terms ahead without throwing an exception.

Collapse term appenders into publications to reduce memory footprint and avoid data dependent loads.

Short circuit Image polling operation when bound limit is less than current position to prevent term overrun.

Add different aliases for consensus module/service container subscriptions. PR #1366.

Stop an active cluster log replay when ClusterBackup is closed rather than waiting for timeout.

Send unavailable counter events to Aeron clients when a client closes or times out.

Allow Consensus Module Agent to be run via an Invoker in addition to having its own thread.

Apply liveness checks to Archive and Cluster mark files so that multiple instances cannot be run in the same directory and corrupt files.

[Java] Use fixed format for timestamps in agent debug logs.

Allow Archive replicate to overwrite all metadata for an empty recording.

[C] Handle log buffer files with term_length == AERON_LOGBUFFER_TERM_MAX_LENGTH on Windows. PR #1360.

[C] Fix inclusion of symbols for debug builds on Windows.

Remove localhost defaults for Archive and Cluster to help avoid mis-configuration in production. PR #1356.

Await 'REPLICATE_END' when catching up as a follower across multiple leadership terms to avoid clashing session-id.

Allow setting of receive socket buffer and window on cluster log channel subscribers. PR #1345.

Fix application of send socket buffer lengths as configured when using MDC.

Fix ArchiveTool.dump when fragment length is set <= 0.

Capture closing sessions into snapshot so session close event is lost on cluster shutdown.

Remove brackets from counters labels to make it easier for extract to Prometheus.

Send cluster client session open acknowledgement before appending to the log to avoid race with service sending egress on open event. Issue #1351.

[C] Fix off by one error local socket address into channel indicator counter.

Add protocol version support to cluster consensus protocol.

Add more context to error messages on Archive ReplaySession. PR #1349.

Apply strict validation of consensus module snapshot state when messages are offered from clustered services. A number of customers have not been strict with all cluster nodes being deterministic and doing exactly the same thing which can result in corrupted and diverged snapshots.

Consensus module state snapshot can be inspected with the describe-latest-cm-snapshot option to ClusterTool.

If a consensus module snapshot is shown to be corrupt it may be fixed by running ConsensusModuleSnapshotPendingServiceMessagesPatch and if non-support customers wish to have help then they can contact [email protected]. The patch can fix the leader and the fixed snapshot then needs to be replicated to the followers which can be done with AeronArchive.replicate using the correct recording ids.

Add a tool to replicate a specific recording between archives. PR #1363.

[C++] use getAsString calls for pollers for record descriptors for channel fields. Add test from PR #1348.

Add ClusteredService.doBackgroundWork which can be used for maintaining external connections beyond ingress and egress.

Increase default message timeout from 5 to 10 seconds for Archive clients.

Add EOS flag to status messages (SMs) once a stream is totally received so the sender can take clean up action.

When EOS status message is received by a sender then allow the publication linger on unicast to be cut short so resources are received sooner.

When EOS status message is received by a sender then remove the receiver from flow control for multicast and MDC with tagged and min FC.

Fix the closing of session specific subscriptions to prevent resource leak.

Add scripts for testing raw network performance on Windows.

Close egress from cluster on change of leader so clients can detect it before a new leader is elected.

Don't timeout and close cluster client session if quorum cannot be temporarily reached.

Add logging support for ClusterBackup state changes.

Close cluster clients when complete cluster is restarted.

Support automatic reconnect from cluster client when the same leader is re-elected after a net split or temporarily loosing quorum.

Add authentication for ClusterBackup to a cluster.

Validate Archive mark file length before reading when mapped read-only to avoid access violations.

Preserve iteration order for cluster client session based on session id so snapshots can have binary compatibility.

Capture leadership term id for cluster backup queries.

Account for padding when sweeping pending services messages to avoid out of bounds exception.

Prevent -1 leadership term ids appearing in the RecordingLog.

Allow Archive replication and replay request to specify session level file IO max buffer length for throttling a stream.

Add support for custom app version validation to clustered services with AppVersionValidator.

Add false sharing protection to DutyCycleTracker.

Update doc on ReplayMerge to indicate the AeronArchive client should not be shared. Issue #1340.

Upgrade to Versions 0.43.0.

Upgrade to Mockito 4.8.1.

Upgrade to Google Test 1.12.1.

Upgrade to JUnit 5.9.1.

Upgrade to ByteBuddy 1.12.18.

Upgrade to Gradle 7.5.1.

Upgrade to SBE 1.27.0.

Upgrade to Agrona 1.17.1.

Java binaries can be found here.
Source code(tar.gz)
Source code(zip)
1.39.0(Jul 14, 2022)
[Java] Fix IllegalStateException that could exist for an MDS subscription on the rapid recycling of ReplayMerge operations.

[C] Align ring buffer implementations and feature set with Java.

[Java] Make sure that C and Java are aligned on resend window. Re-instate the max message length being accounted in the bottom of the resend window for Java.

Add duty cycle duration tracking to all agents across all modules.

[C++] Improve efficiency by reducing the number of copy operations for fragment assembly when a stream has many fragmented messages.

[C] Default to CLOCK_REALTIME for send/receive timestamps.

[Java] Add setters for send/receive timestamp clocks to the MediaDriver.Context.

Fix handling of fragment assemble when reliable=false is set for a channel and loss occurs.

Improve handling of short sends on MDC publication to backoff from overloading a socket.

Add round-robin facility to MDC publication for increased fairness.

[Java] Publish aeron-test-support package as a JAR.

[Java] Downgrade "unknown replay" errors to warnings for cluster catchup.

[Java] Add appVersion to event logging for consensus module and check for correct app version when replaying log.

[Java] Prevent timeout warnings with cluster dynamic nodes and log replication.

[Java] Add cluster dynamic join state change logging events.

Add counters for the number of receivers in min and tagged flow control strategies.

[Java] Avoid race unmapping buffers on concurrent close of media drivers.

Modify flow control strategies to have new method for when elicited setups are sent and add counters manager to init methods. Modify Min and Tagged flow control to use setup snd-lmt as min position until timeout or receiver added on SM.

[Java] Account for possible padding in log buffer when checking for bottom resend window for retransmits.

[C] Flush output when printing configuration.

[C] Raise warning on failure to setup media timestamping.

[Java] Update recordingId on any signal with a valid recording id when handling signals for snapshot replication.

[Java] When attempting ClientSession.tryClaim, ensure that there is enough buffer space when returning a mocked offer for a follower.

[C] Ensure publication image is released before it it freed.

[C] Fix scanf that could result in buffer overflow when parsing HTTP for configuration.

[Java] Change default cluster session timeout from 5 to 10 seconds.

Prevent receiver joining min/tagged flow control if they are more than a window behind.

[C] Add sample for working with large messages.

[Java] Add logging event for appending a cluster session close.

Upgrade to BND 6.3.1.

Upgrade to Mockito 4.6.1.

Upgrade to ByteBuddy 1.12.10.

Upgrade to SBE 1.26.0.

Upgrade to Agrona 1.16.0.

Java binaries can be found here.
Source code(tar.gz)
Source code(zip)
1.38.2(Apr 29, 2022)
C Driver/Client Release Only

[C] Driver - Ensure the correct control address is used when adding multicast destinations with MDS.

[C] Driver - Allow thread affinity on CPU 0.

[C] API - Check handler parameter before polls. Check images for NULL before polling images.

No Java binaries for this release.
Source code(tar.gz)
Source code(zip)
1.38.1(Apr 14, 2022)
Upgrade to SBE 1.25.3.

Upgrade to Agrona 1.15.1.

Java binaries can be found here.
Source code(tar.gz)
Source code(zip)
1.38.0(Apr 14, 2022)
[Java/C/C++] Ensure driver is in ready state when requesting termination from client.

[Java] Reduce allocation when listing archive directories to find segment files.

[Java] Add flag to ClusterTerminationException to indicate if the termination was expected.

[Java] Expand agent logging for consensus module operations, be careful if using all for cluster events as volume may now be greatly expanded.

[C] Use connect and send to improve latency in C driver when sending data at lower volumes.

[Java] Improve reliability of transferring snapshots to ClusterBackup via archive replication with improved re-try semantics.

[Java] Support adding an IPC ingress destination to cluster leader for ingress optimisation.

[Java] Create replay publication asynchronously to reduce latency pauses in Archive.

[Java/C++] Add new RecordingSignal.REPLICATE_END recording signal to indicate end of a replication operation.

[Java/C++] Make delivery of RecordingSignals to archive client sessions reliable and ordered.

[Java] Support specifying interface with endpoints in cluster config for multi-home members. PR #1290.

[C] Add thread affinity support to C media driver. PR #1298.

[C/C++] Update CMake build to use FetchContent instead of ExternalProject.

[C/C++] Fix build on ARM with clang. PR #1291.

[Java] Improve progress tracking and retry semantics for cluster members catching up in elections.

[C/C++] Enable support for parallel build on Windows.

[Java] Add ability to async remove/close a publication by registration id.

[Java] Fix publication leak in ClusterBackup when backup response timesout.

[C] Improve agent logging in C media driver to be more consistent with Java drive.

[C] Allow for configurable IO vector for sendmmsg and recmmsg in the C media driver. PR #1285.

[C] Support static linking of the C media driver. PR #1261.

[Java/C] Support ability to extend concurrent publications by setting initial values to be equivalent to exclusive publications.

[Java] Fixed bug in PriorityHeapTimerService.cancelTimerByCorrelationId. PR #1281.

[C++] Improve error reporting in Archive client when a response is not received.

[Java/C++] Additional user specified delegating Invoker for Archive client to be used for progressing actions when awaiting responses.

[Java] Rename Archive segment files before delete to avoid races with streams being extended.

[C++] Fixes for ChannelUriStringBuilder. PR #1268.

[Java] Add admin command so that cluster snapshot can be triggered remotely via an authorised session.

[Java] Support authorisation of service actions with a new API AuthorisationService. The hooks for this have been added to Archive requests and Cluster Snapshot requests.

[Java/C] Support adding spy and IPC destinations to MDS subscriptions so destinations can be all channel types.

[Java] Ensure Cluster will start on a consistent initial term id when racing to create first term.

[Java] Prevent unnecessary creation of RecordingLog files when using ClusterTool.

[Java] Add cluster session timeout to set adjusted when debugging.

[C] Fixes to prevent message duplication and unnecessary sending of messages in MDS.

Minimum CMake version was raised to 3.14.

Upgrade to HdrHistogram_c 1.11.4.

Upgrade to BND 6.2.0.

Upgrade to Versions 0.42.0.

Upgrade to Mockito 4.4.0.

Upgrade to ByteBuddy 1.12.9.

Upgrade to Shadow 7.1.2.

Upgrade to Gradle 7.4.2.

Upgrade to JUnit 5.8.2.

Upgrade to Checkstyle 9.3.

Upgrade to SBE 1.25.2.

Upgrade to Agrona 1.15.0.

Java binaries can be found here.
Source code(tar.gz)
Source code(zip)
1.37.0(Nov 26, 2021)
[Java] Improve error messages on channel conflicts.

[C] Remove replicated command prefix in debug agent logging.

[Java] Use async publication add for async connect to an Archive to minimise the impact of name resolution pauses.

[Java] Make ClusterConfig.calculatePort public.

[C] Correct channel length on metadata for stream counters.

[Java] Extract channel value from counter label when longer than what will fit in metadata for StreamStat.

[Java] Relocate HdrHistogram and ByteBuddy in aeron-all JAR.

Upgrade to BND 6.1.0.

Upgrade to ByteBuddy 1.12.2.

Upgrade to Mockito 4.1.0.

Upgrade to SBE 1.25.1.

Upgrade to Agrona 1.14.0.

Java binaries can be found here.
Source code(tar.gz)
Source code(zip)
1.36.0(Nov 19, 2021)
[C/C++] Handle SIGINT in code samples.

[Java] Retry adding cluster member publication in election canvass to address late name registration in containers such as Kubernetes.

[Java] Log resolution failures in Cluster as warning event rather than exception.

[Java] Fix timestamp when publishing new leadership terms. PR #1254.

[C] Use separate transport bindings for the conductor doing name resolution. PR #1253.

[Java/C++] Allow the setting of a RecordingSignalConsumer in the archive client context which is delegated to when processing control channel responses.

[C] Improve error handling and logging on Windows when dealing with network system calls.

[Java] Verify cluster log is always contiguous when joining a new image in a service.

[Java] Fix race condition when sending RecordingSignal.SYNC during archive replication. PR #1252.

[Java/C] Improve choice of subscription for choosing channel URI when labelling receiver counters.

[Java] Sort counters displayed with StreamStat so they are logically grouped.

[Java] Improve error messages so they are more contextual.

[Java] Extend debugging logging for archive and cluster operations.

[Java] Check for errors when cluster snapshots are replayed.

[Java] Improve tracking of cluster commit position when replicating during an election.

[Java] Allow replication to skip over empty leadership terms due to failed elections when initially starting cluster.

[C] Better handling of finding user for default aeron.dir when USER is not set in environment.

[Java/C++] Reduce cache invalidations when using pollers for archive and cluster response streams.

[Java] Add support for changing cluster log params by truncated to the latest snapshot and resetting configuration. PR #1233.

[Java] Don't catch subclasses of Throwable and instead catch Exception so that the JVM can handle subclasses of Error.

[Java/C] Improve validation of ports used in channel URIs.

[C] Support building on Apple ARM.

[Java] Add priority heap backing implementation for cluster timers as an alternative to the default timer wheel implementation

Upgrade to Mockito 4.0.0.

Upgrade to Shadow 7.1.0.

Upgrade to BND 6.0.0.

Upgrade to Gradle 7.2.

Upgrade to ByteBuddy 1.12.1.

Upgrade to Checkstyle 9.1.

Upgrade to SBE 1.25.0.

Upgrade to Agrona 1.13.0.

Java binaries can be found here.
Source code(tar.gz)
Source code(zip)
1.35.1(Sep 6, 2021)
[Java] Fix selection of channel based on add publication registration id rather than original registration id. Issue #1218.

Java binaries can be found here.
Source code(tar.gz)
Source code(zip)
1.35.0(Aug 9, 2021)
[Java] Fix truncation of linger timeout in ChannelUriStringBuilder which lead to a short linger of Archive replays.

[Java] Remove incorrect publication linger validation.

[C] Add sanitize build for MSVC and fix issues found.

[C] Add missing free of counters associated with Cubic congestion control.

[C++] Fix missing use of FragmentAssembler in Archive response and clean up type warnings.

[Java] Fix packaging declaration in POM file.

[Java] Separate thread factories for replay and recording agents in Archive for when setting thread affinity is required.

[Java] Javadoc improvements.

[C] Agent logging fixes. PR #1198.

[Java/C] Support a list of bootstrap neighbours for fault tolerance in gossip protocol for driver naming.

[C] Handle connection reset without error when polling a socket on Windows.

[C++] Don't progress with archive connect until response subscription is available. PR #1196.

[Java] Use async publication adding for response channels from the Archive and response channels for egress and backup queries from the Cluster to reduce latency pauses for existing operations.

[Java] Ability to add publications asynchronously to Aeron client.

[C/Java] Support timestamping of packets for channel send and receive plus media/hardware receive timestamping if supported. PR #1195.

[Java] Ensure termination hook is run on unexpected interrupt during cluster election.

[Java] Reset cluster election state if in election and an exception happens outside the election work cycle.

[Java] Finish deleting pending archive recording for deletion on shutdown.

[Java] Ensure cluster log recording has stopped before restarting the election process to avoid spurious election failure from past recording stopping.

Upgrade to Google Test 1.11.0.

Upgrade to Mockito 3.11.2.

Upgrade to ByteBuddy 1.11.9.

Upgrade to Gradle 7.1.1.

Upgrade to SBE 1.24.0.

Upgrade to Agrona 1.12.0.

Java binaries can be found here.
Source code(tar.gz)
Source code(zip)
1.34.0(Jun 16, 2021)
[Java]: added nanoClock to AeronArchive.Context to control time more directly. PR #1188 .

[Java]: added ClusterBackup.Context.toString.

Various changes for Cubic congestion control, Status Message generation, and overrun determination to handle high loss scenarios with congestion control better.

[Java]: use separate archive contexts for local and remote archive clients in cluster backup. Local archive must be configured to use IPC.

[Java]: relocated ByteBuddy in agent jar.

[Java]: support constructing a ChannelUriStringBuilder from an existing URI. PR #1186 .

Several improvements to handling initial name resolution failures for cluster and cluster clients when using name resolution from containers.

[Java]: improve tag usage for IndexedReplicatedRecording example.

[Java]: more information included in extendRecoding failures.

Added name resolution logging to agents.

Append cycle time threshold to counter label.

[Java]: support connecting to a cluster when a minority of the members are not active in a name service.

[C]: retain entropy in large collections for hashing and include full range of possible masks for UINT32.

[Java]: timeout Archive replication if recording subscription endpoint fails to resolve.

[Java]: added AeronEvent exception type that does not generate a stack trace.

MDC manual destinations can now be initially unresolved.

[Java]: Fix NPE on cluster client after multiple redirects. PR #1179.

[C]: improve common hash functions. PR #1178.

Various fixes for re-resolution of endpoints and adding more tests to re-resolution scenarios.

[C]: add interface URI param to MDC publication channels.

[Java]: MDS will now use the base subscription URI for congestion control, receive window, and socket buffer URI params.

Upgrade to SBE 1.23.0

Upgrade to Agrona 1.11.0

Upgrade to Versions 0.39.0

Upgrade to Unit 5.7.2

Upgrade to Gradle 7.0.2

Upgrade to Shadow 7.0.0

Upgrade to Mockito 3.10.0, then to 3.11.1

Upgrade to ByteBuddy 1.11.0, then to 1.11.2

Java binaries can be found here.
Source code(tar.gz)
Source code(zip)
1.33.1(May 14, 2021)
[C] Fix clean up in CSV name resolver on error.

Improve error messages for channel URI configuration and clash errors.

[Java/C++] Add missing arguments for full replicate and tagged replication API to Archive.

[Java] Avoid channel leak on error configuring send and receive channel.

[Java] Avoid double suffix of exception category to message for RegistationException.

[Java] Allow setting of socket and receiver buffer lengths in ChannelUriStringBuilder from ChannelUri with short form human friendly names.

Java binaries can be found here...
Source code(tar.gz)
Source code(zip)
1.33.0(May 10, 2021)
The focus for this release has been a significant rework of cluster to make consensus more robust especially in recovery scenarios. We consider this the penultimate release to cluster being GA. As of this release we plan to stabilise the API and only make breaking changes if a significant issue is raised by a customer on commercial support.

Expand the range of channel URI params supported by the archive on a per stream basis.

Add support for dynamically switching debug logging on and off. PR #1155.

Add debug logging support for flow control.

[C] Fix memory leak and reassembly of fragmented message greater than 8K in client.

Fix short send of recording start event when tracking recording progress. PR #1155.

Improve clean up of subscriptions and control sessions in the Archive when failures occur.

[Java] Fix bug with flow control gtag being carried over erroneously which can cause issues with ReplayMerge and other features dependent on group flow control semantics.

Reduce the number of memory fences used with min and tagged flow control.

Set initial window segments to 10 for Cubic congestion control and fix issue measuring RTT in the presence of loss.

Add the ability to configure archive replication channel on a per operation basis. This enables the setting of congestion control and socket buffer lengths which are important for cluster backup.

Use Archive replication for cluster replication, dynamic join, and cluster backup. This requires the cluster and archive config to be correct as configuration errors will not be evident until used - be careful of using localhost for endpoints.

Check tag for match when reusing send channel endpoint. PR #1147.

Add the ability to configure socket buffer and receive window on a per channel basis. PR #1143.

Rework Cluster backup and dynamic join to use archive replication.

Add support for using a 0 port for cluster catchup endpoints.

[Java] Better clean up of allocated resources in the driver when failures occur so it can continue without leaks.

[Java] Reduce linger on explicitly closed resources in the client.

[C/C++] Improve the performance of pre-faulting memory mapped file on Linux and Windows. PR #1127.

[C/C++] Clean up warnings in Windows build.

Improve Javadoc.

Provide sender and receiver with their own cached clocks to be more responsive and isolated from conductor stalls.

Add new counters to detect work cycle stalls which track max work cycle latency and count of threshold exceeded observations.

Continue to send status messages and heartbeats when running in DEDICATED or SHARED_NETWORK thread modes to keep connections alive if the driver stalls due to DNS lookups or file IO.

Reduce the number of commands from client from 10 to 2 per work cycle to help prevent timeouts and reduce latency pauses.

Improve validation of adding destinations to publications.

Better handling of race conditions when clients and driver are started/restarted at the same time.

Extend debug logging events.

Improve diagnostics collection on failed cluster tests.

Add disable event codes for debug logging so all can be enabled and merged with a disabled set.

Add error stacks to C driver to aid debugging of issues.

Add storage space warnings and specific exception codes on errors returned to archive client. Archive has new config for low storage thresholds.

Detect archive failures in cluster so appropriate action can be taken.

Propagate recording errors from the archive back to the archive client that initiated the failed operation.

Add specialised ClusterTerminationException for expected cluster termination.

Reduce network syscalls with Java 11+ for higher numbers of active streams.

Respond to cluster client with session open event only after the open session is successfully appended to the cluster log.

Upgrade to Version 0.38.0.

Upgrade to BND 5.3.0.

Upgrade to Mockito 3.9.0.

Upgrade to ByteBuddy 1.10.22.

Upgrade to JUnit 5.7.1.

Upgrade to Gradle 6.8.3.

Upgrade to SBE 1.22.0.

Upgrade to Agrona 1.10.0.

Java binaries can be found here...
Source code(tar.gz)
Source code(zip)
1.31.2(Feb 14, 2021)
Respond to cluster client with session open event only after the open session is successfully appended to the cluster log.

Java binaries can be found here...
Source code(tar.gz)
Source code(zip)
1.32.0(Jan 26, 2021)
[C/Java] Fix unexpected image unavailable when a rush of connections comes in for MDC or multicast publication. Issue #1115.

Increase default flow control receiver timeout from 2 to 5 seconds.

[Java] Cluster performance improvements.

[Java] Improve liveness tracking for followers catching up with a cluster leader when service logic is running slow.

[Java] Configuration option for cluster log consumption fragment limit.

[Java] Improve coordination of Cluster services during an election for log catchup and state changes.

[Java] Rework Cluster elections to better handle edge conditions in resource limited environments.

[Java] Add multicast support for cluster log channel.

[C++] Add missing methods to ExclusivePublication so it is compatible with Publication.

[C] Support compatible command line options for the C driver when running on Windows.

[C] Fix the deletion of directories on driver shutdown when running on Windows.

[C] Fix the transposed observation times in the loss report.

[C/C++] Migrate the C++ client tools to wrap the C tools for AeronStat, DriverTool, LossStat, and ErrorStat.

[C] Reduce memory footprint and copying in client when sending driver commands.

[Java] Delete Archive segments asynchronously when purge, truncate, or delete operations are carried out so that deleting a large number of segments does not block the Archive conductor so that the Archive stays responsive. A new RecordingSignal has been added for tracking the completetion of the delete.

[C/Java] Run Cluster system tests against both the Java and C Media Drivers.

[C/Java] Complete logging and align of feature set with the same configuration that can applied to the Java or C media drivers. PR #1091.

[C] Support URIs larger than the label length on publications and subscriptions in the C media driver to be compatible with the Java media driver.

[Java] Add Java 16-ea to the test matrix.

[Java] Improve tracking of connection activity to more accurately detect the need for address re-resolution.

[C++] Improve samples for better usage illustration and error reporting.

[C] Complete the feature set for the C client so the C++ wrapper client is a pure wrapper, e.g. provide access to a late bound port for a Subscription.

[Java] Allow the setting of different error handler when polling a Subscription, e.g. use a RethrowingErrorHandler to propagate the error out the caller and stop progress.

[C] Fix throughput issue with C Media Driver debugging logging.

[Java] Support variable length entries in Archive Catalog and allow for complete purging of old entries. Requires migration. PR #1069.

[Java] Reduce memory footprint and copying in client for sending driver commands.

[Java] Improved Javadoc.

Upgrade to Checkstyle 8.38.

Upgrade to ByteBuddy 1.10.19.

Upgrade to Mockito 3.7.7.

Upgrade to Versions 0.36.0.

Upgrade to Gradle 6.7.1.

Upgrade to SBE 1.21.0.

Upgrade to Agrona 1.9.0.

Java binaries can be found here...
Source code(tar.gz)
Source code(zip)
1.31.1(Nov 2, 2020)
Fix bug in C++ client managing images under a subscriptions due to bug with GCC 7.3.1 failing to emit an acquire fence.

Fix bug with cleaning up log buffers which could result in segfault in native driver.

Fix bug in C++ client with putValueVolatile.

Add AeronException.Category name to the beginning of error message to indicate the severity in the DistinctErrorLog.

Improved Javadoc.

Schedule Status Messages with more relaxed memory ordering for a ~3% throughput improvement in the Java driver.

Memory order fix for scheduling NAKs and Status Messages in native C driver.

Enable higher-resolution timers on Windows for native driver so sleep periods less than 16ms.

Upgrade to Mockito 3.5.15.

Java binaries can be found here...
Source code(tar.gz)
Source code(zip)
1.31.0(Oct 14, 2020)
Handle failed log buffer delete in C media driver on Windows. This can happen when a client holds a mapped file open and the driver tries to delete it. PR #1073.

Increase default client liveness timeout from 5->10s and publication unblock timeout from 10-15s to be softer on clients that experience bad GC pauses or run in resource starved environments.

Add C++ ChannelUriStringBuilder#initialPosition method to set the initial position of a publication.

Add ownerId to publication limit counters for being able to track which client created a publication.

Improve javadoc and reduce the scope of some methods that should not have been public.

Fix C++ AtomicCounter::getAndSet.

Fix timer cancellation when scheduling in cluster. Issue #1071.

ReplayMerge now substitutes the endpoint from the replayDestination into the replayChannel to simplify configuration.

Support using a port of 0 on the replay destination for ReplayMerge so that it is assigned by the OS from the ephemeral range.

Support using a port of 0 on the replication channel between archives so that it is assigned by the OS from the ephemeral range.

Fix the ability to add and remove a destination with port 0 to an MDS Subscription.

New subscriptions now late join a stream at the min of existing subscriptions rather than max.

Fix implementation of ExclusivePublication::tryClaim in C++ wrapper client.

Add Cubic congestion control support to the C media driver. PR #1065.

Default to building the C++ archive client as part of the native build.

Improve the native Windows build for CLion.

Remove the need for having 7-Zip installed for native build on Windows.

Improve error handling for archive errors in the consensus module so warnings can be issued and retried.

Set media driver heartbeat to -1 on clean shutdown so it can be immediately restarted without waiting for driver timeout.

Add Clang 11 to build mix.

Add Java 15 to build mix.

Change stop replay failures in the cluster from errors to warnings.

Improve ExtendRecordingTest to be a better example.

Fix cluster tutorial scripts.

Improve samples code.

Upgrade to Checkstyle 8.36.2.

Upgrade to Shadow 6.1.0.

Upgrade to ByteBuddy 1.10.17.

Upgrade to HdrHistogram_c 0.11.2.

Upgrade to SBE 1.20.3.

Upgrade to Agrona 1.8.0.

Java binaries can be found here...
Source code(tar.gz)
Source code(zip)
1.30.0(Sep 20, 2020)
Add hooks so ATS (Aeron Transport Security) can be loaded as a premium feature on the C media driver. Issue #203.

Numerous improvements for the native driver on Windows.

Further refinement and additions to the C client which is currently at preview status.

Remove a number of data dependent loads caused by indirection to reduce latency outliers.

Improve logic for expansion of BufferBuilder for fragmented messages to be correct at extremes and to be more efficient.

Set ANY ADDR to correct protocol family for endpoint based on control when IPv6.

Scope Cluster Backup counters by cluster id.

Improve Archive client connect error messages.

Add deadline checking to C++ Archive client connect.

Improve the efficiency of counter searching.

Add extra validation for the relationships between timeouts.

Change tracking of untethered subscriptions so the bottom 1/4 rather then 1/8 of the window is used to make for easier eviction.

Add registration and owner id to counters to help avoid ABA issues and to aid monitoring.

Avoid updating the commit position counter when the consensus module is closed.

Improve active transport tracking to be more timely and accurate.

Make use of cached clocks when referencing counters to reduce system call overhead.

Improve ReplayMerge tests to show a better example of usage.

Add driver and hostname to re-resolution counter for Java and C media drivers.

Fix memory corruption with driver naming resolution events in C media driver.

Fix dynamic agent dissector logging for C media driver.

Improve liveness tracking for channels to reduce overhead and false sharing in the Java Driver.

Add a ChannelUriStringBuilder.toString() method.

Provide a registration id on the add and remove handler methods in clients so they can be removed by the registration id and not rely on the pointer or reference to the callback.

Allow the setting of port 0 on Archive and Cluster control response channels for clients so they are automatically allocated from the ephemeral range.

Improve native code use of atomics across all platforms and especially on Windows.

Improve error messages in the native driver to help indicate which is the offending command and URI.

Auto resize the Archive Catalog when full so the Archive does not need shutdown and manually extended.

Improve startup code for all clients finding a running media driver which is racing to start at the same time.

Support the C++ Archive client on Windows.

Set CMake 3.6.1 as the min required version.

Upgrade to JUnit 5.7.0.

Upgrade to HdrHistogram_c 0.11.1.

Upgrade to Version 0.33.0.

Upgrade to Checkstyle 8.36.

Upgrade to Gradle 6.6.1.

Upgrade to Mockito 3.5.10.

Upgrade to ByteBuddy 1.10.14.

Upgrade to BND 5.1.2.

Upgrade to SBE 1.20.2.

Upgrade to Agrona 1.7.2.

Java binaries can be found here...
Source code(tar.gz)
Source code(zip)
1.29.0(Jul 21, 2020)
Further refinement and additions to the C client which is currently at experimental status.

Improve error messages when parsing URI params.

Fix application of sparse terms in Java Media driver when not used on a per channel basis.

Add support for session based subscriptions on IPC and spies to the C media driver.

Use ssc (Spies Simulate Connection) only in cluster when membership size is 1. This avoids the leader racing ahead of followers which are catching up and a number of cases where the start of a recording can be missed.

Add the ability to have spies simulation connection (ssc) configured on a per stream basis for both Java and C media drivers.

Fix some false sharing issued introduced for channel re-resolution checking to give a tighter latency distribution.

Add state checks to Cluster operations so services do no use features at inappropriate times.

Rework build script to help IDEA recognise generated classes and not give false compilation errors.

Significantly improve throughput of C media driver when used with the Solarflare ef_vi premium extension to provide the best latency and throughput possible.

Fix short send counting in C media driver.

Change Archive session workers to behave more like normal Agents so that stack traces are more informative when debugging.

Improve error handling and cluster elections when dynamic membership is being used and increase test coverage.

Improve session checks when re-adding a publication with the same session id.

Refinements to Cluster Backup.

Change defaults for throughput tests to use 8k rather than 16k MTUs to better fit with jumbograms.

Close recording Archive recording subscriptions with autoStop = true that have an error on first image.

Detect Archive errors in Cluster so waiting operations can abort and be retried.

Fix aeron_ftruncate on Windows for native driver so it behaves more like Linux. This addresses races with client and driver starting at the same time which can result in a corrupt CnC file.

Avoid int overflow with Cluster snapshots greater than 2GB in length. PR #959.

Fix C++ client compile for CentOS 7 with GCC 4.8.5.

Add flow control (fc) and group tag (gtag) URI params to Archive stripped channels.

Configurable buffer length for Archive record and replay file operations to control batch size via aeron.archive.file.io.max.length. New default shows a marked increase in throughput and reduced latency in all our tests.

Capture logs from failed Cluster tests to aid debugging.

Agent logging for untethered subscription state changes in Java and C media driver.

Expanded agent logging for archive activities to aid debugging.

Fix segfault in C media driver if transport cannot bind.

Add Java 14 to CI.

Add native sanitize builds to CI.

Upgrade to Versions 0.29.0.

Upgrade to Checkstyle 8.34.

Upgrade to Mockito 3.4.4.

Upgrade to BND 5.1.1.

Upgrade to ByteBuddy 1.10.13.

Upgrade to HdrHistogram 0.11.0 for C.

Upgrade to Gradle 6.5.1.

Upgrade to SBE 1.19.0.

Upgrade to Agrona 1.6.0.

Java binaries can be found here...
Source code(tar.gz)
Source code(zip)
1.28.2(May 28, 2020)
Fix issue with replaying cluster log when a snapshot is invalidated after a clean termination.

Correct arguments to onReplayNewLeadershipTerm which got transposed in 1.27.0 release.

Validate lower bound of MTU in config so payload must have some contents.

Java binaries can be found here...
Source code(tar.gz)
Source code(zip)
1.28.1(May 27, 2020)
Fix race condition when calling size on C queues.

Remove clashing non const ExclusivePublication::channelStatus() method. Issue #946.

Upgrade to SBE 1.18.2.

Upgrade to Agrona 1.5.1.

Java binaries can be found here...
Source code(tar.gz)
Source code(zip)
1.28.0(May 25, 2020)
An experimental C API client is now available. We are happy to take feedback but be aware the API is subject to change as it gets refined.

Cluster has changed status from experimental to being a preview feature. Many refinements and bug fixes have been made to cluster in the last few months as a result of significant destructive testing. The API is now stable as of this release and will only change before going GA if a significant issue is found. Support is commercially available.

Correct implementation of Cubic congestion control implementation to align with spec.

Add support to the C media driver for session-specific and multi-destination subscriptions (MDS), plus complete the functionality so the C media driver can support Archive.

Support using 0 for port on endpoint or control so OS assigns the port without conflict and then make it available on Publication or Subscription via each getting a new localSocketAddresses() method. Local socket addresses also get their own counters.

Reduced CPU time spent scanning for loss in Java and C drivers so they can scale to a larger number of connections.

Apply consistent approach to merge window for ReplayMerge, Archive replication, and Cluster catchup.

Add the ability to stop a recording by recording identity when the recording id is known.

Use CRC if configured and any possible data to help recover last fragments in a recording that may straddle a OS page after an unclean Archive shutdown.

Support common short name alias for idle strategies in config for both Java and C media driver such as noop, spin, yield, and backoff.

Update false sharing protection to support Java 15 class layout and add it to ExclusivePublication.

Improve Java and C++ samples so they are up to date and give more consistent performance numbers.

Java client close operations for publications, subscriptions, and counters now happen asynchronously so the client does not wait for acknowledgement. This allows for more rapid close of resources.

Add notifications for client heartbeat counters becoming available and unavailable so Aeron clients can be tracked.

Allow for race in creating a new recording in catalog and first segment being written which can happen when a replay is set up right after a recording starts.

Upgrade to javadoc-links 5.1.0.

Upgrade to ByteBuddy 10.10.1.

Upgrade to JUnit 5.6.2.

Upgrade to Gradle 6.4.1.

Upgrade to SBE 1.18.1.

Upgrade to Agrona 1.5.0.

Java binaries can be found here...
Source code(tar.gz)
Source code(zip)
1.27.0(Apr 1, 2020)
Drivers can be named and names are gossiped between drivers so that they can be used to simplify configuration for endpoints. Driver Name Resolution.

Fix header file dependencies for C++ archive client.

Spy subscriptions can now match on channel tag for publications.

Multicast flow control is selected when using manual or dynamic MDC (Multi-Destination-Cast).

Add tryStopRecording methods to the archive clients so they can be called without raising an exception if no recording is active.

Add a counter for the number of active control session on the archive.

Add autoStop overload when starting a recording in the archive so it is automatically cleaned up when the first matching recordings stops.

Resend recording progress events after back pressure to detect tail progress.

Improve URI channel parsing validation. Issue #887.

Reduce allocation when churning publications.

Add CentOS 7 build to CI.

Upgrade to BND 5.0.1.

Upgrade to Junit 5.6.1.

Upgrade to Gradle 6.3.

Upgrade to SBE 1.17.0.

Upgrade to Agrona 1.4.1.

Java binaries can be found here...
Source code(tar.gz)
Source code(zip)
1.26.0(Mar 4, 2020)
Add correlation-id to ArchiveException and provide the ability to get the last used correlation-id in AeronArchive client.

Add re-resolution of endpoints when they timeout and become unconnected which can happen when machines migrate in a cloud environment to the Java driver.

Add TaggedMulticastFlowControl and ability to configure flow control via URI params for Java and C media drivers.

Deprecate PreferredMulticastFlowControl.

Fix mutexes for the C media driver on Windows. PR #867.

Fix handling of sockets the C media driver on Windows. PR #866.

Fix thread handling for the C media driver on Windows. PR #864.

Fix mmap on Windows for the C media driver. PR #865.

SetWaitableTimer expects a duration in 100-nanosecond intervals on Windows in C media driver. PR #868.

Fix NPE when -checksum flag is not used, and validation Checksum classname if it is used with ArchiveTool.

Deal with asynchronous errors from the archive when replicating or Replay Merge.

Fixes for Windows C driver. PR #861.

Warnings clean up in native code.

Fix socket close on Windows for C driver. PR #857.

Fix getting a random value in C driver on Windows. PR #854.

Reduce allocation of direct buffers in the archive to minimum of what is required depending on configuration.

Improve archive behaviour from unexpected outcomes of file read operations.

Migrate to Gradle maven-publish plugin.

Improve closing of resources in aborted or interrupted operation for Java client and modules.

Fix unexpected unavailable image which could occur with mixed use of wildcard and session specific subscriptions on the same channel.

Fix deadlock which could occur in C++ client if destroyed too quickly after creation. Issue #844.

Improve performance of Archive replay. Gains are 25%-50% depending on message length and platform.

Add client shared library support to C++ client. PR #836.

Only use MDS for archive replicate when joining a live stream or using a tagged subscription. This allows for multiple concurrent replication streams of recordings which are not joining live or being tagged.

Make receiver id channel endpoint specific so multi-destination subscriptions get flow controlled independently as they use different sockets. This results in less loss when using Replay Merge.

Improve performance of logging agent to file by batching event writes.

Upgrade to Gradle 6.2.1.

Upgrade to Versions 0.28.0.

Upgrade to Mockito 3.3.0.

Upgrade to HdrHistogram_c 0.9.13.

Upgrade to BND 5.0.0.

Upgrade to SBE 1.16.3.

Upgrade to Agrona 1.4.0.

Java binaries can be found here...
Source code(tar.gz)
Source code(zip)
1.25.1(Jan 21, 2020)
Log to ring buffer with zero copy semantics for improved logging performance. PR #831.

Retain file handle after establishing mapping in Windows C++ client. Issue #826.

Improve encoding performance of logging to file.

Log all events in a consistent manner with standard header.

Be consistent with the use of positional reads and writes in the archive for supported OS synchronisation and slightly improved performance.

Configure Java DistinctErrorLog to be US-ASCII rather than UTF-8 for compatibility with native driver.

Run slow tests daily in CI.

add GNU_SOURCE to clock for native builds on CentOS.

Upgrade to Agrona 1.3.0.

Upgrade to SBE 1.16.1.

Upgrade to JUnit 5.6.0.

Java binaries can be found here...
Source code(tar.gz)
Source code(zip)
1.25.0(Jan 12, 2020)
Where possible only weave in logging hooks when enabled in the Java driver. This can help performance for those who are only logging a few events.

Add ability to log the control channel responses from the Archive.

Fix issue with truncating recordings when truncate position equals stop position and start of segment to ensure file is deleted.

Fix issue with unaligned access to fields in LossReport.

Introduce interceptor bind framework to C driver for supporting loss testing, logging, and media layers other than BSD sockets.

Apply system tests to C driver when running in CI. When apply this a number of bugs got fixed in the C media driver.

Move CI from Travis to GitHub Actions and test on Windows, Linux, and OSX.

Support for agent logging in the C driver to file to match Java with the aeron.event.log.filename.

Support for adding checksums to archive recordings as CRCs which can be verified to detect file corruption.

Add support for applying and verifying checksums to recordings via ArchiveTool.

Add support for fixing recordings after after a system crash running an Archive.

Improve crash recovery for the archive when restarting.

Add cached clocks to C media driver to reduce the overhead of clock calls and improve performance, especially in cloud environments. Issue #606.

Fix thread local storage for Windows C media driver. PR #795.

Fixes for Windows C media driver. PR #794.

Improve EOS reporting in Image.toString() method. PR #792.

Fix recovery of stop position in crashed archive when start position was non-zero.

Provide API for for features that existed in CatalogTool in new ArchiveTool.

Don't linger replay publications in ReplayMerge so resources can be reclaimed sooner.

Default warning of Aeron direction existing on media driver start to false.

Add poll support to C media driver on Windows. PR #784.

Name log buffers based on correlation id.

Provide timestamp with stacktraces in default client error logger. PR #774.

Reject concurrent publications that specify init-term-id, term-id, and term-offset. PR #773.

Add sample illustrating how to build an index and basic time series on a recording that is also replicated in IndexedReplicatedRecording.

Improve performance for getting Header.position() in Java fragment handler.

Add BasicAuthenticator to C++ archive client samples.

Fix issue with configuring threading mode in C media driver. Issue #785.

Improve validation when extending recordings in the archive.

Add taggedReplicate operation to the archive for replicating a stream with provided tags so an external subscription can follow along.

Don't update the recording position in the archive if an exception occurs during a write. Previous behaviour could have erroneously reported progress when disk was full or underlying storage failure.

Fix issue in C media driver when a subscription could have go away yet the publication considered it was still connected.

Fix issue with incremental build dependencies. PR #762.

Fix recording events enabled property name.

Add authentication support to C++ archive client.

Upgrade to Agrona 1.2.0.

Upgrade to SBE 1.16.0.

Upgrade to JUnit 5.6.0-RC1.

Upgrade to Checkstyle 8.28.

Upgrade to HdrHistogram 2.1.12.

Upgrade to ByteBuddy 1.10.5.

Upgrade to Gradle 6.0.1.

Upgrade to javadoc-links 4.1.6.

Upgrade to Mockito 3.2.0.

Upgrade to gtest 1.10.0.

Upgrade to HdrHistogram_c 0.9.12.

Java binaries can be found here...
Source code(tar.gz)
Source code(zip)
1.24.0(Nov 24, 2019)
Add bi-directional version identification to the archive network protocol

Add support for authenticated sessions to the archive.

Support setting of session-id on publications in the C media driver. Issue #623.

Fix setting of initial position on an exclusive publication in the C driver when the initial position is beyond the first term. Issue #750.

Allow for archive error log to be stored in archive mark file when running out of process from a media driver.

Trim down unneeded dependencies in agent and all shadow JARs.

Clean up allocated resources in C++ and Java clients when URI errors occur.

Add boundedPoll to Image for C++ and Java. Issue #744.

Only include what is used in C++ publication headers. Issue #743.

Provide unique type ids to error counters. Issue #741.

Add new archive control messages to agent logging and improve overall agent performance.

Fix pointcut for Archive control message logging. Issue #740.

Close files in Windows C++ client to prevent memory leak. Issue #737.

Improve the performance for MDC dynamic mode in the Java driver.

Set javadoc encoding to UTF-8.

Improve validation of channel URIs for endpoint, control, tags, and distinguishing characteristics in both C and Java drivers.

Fix calculation for archive truncate when offset is beyond first term in a segment.

Check for reentrant calls when in Archive callbacks and throw an exception if detected.

Change sample scripts to use the aeron-all JAR as a better example.

Upgrade to javadoc-links 4.1.4.

Upgrade to Build Scan 3.0.0.

Upgrade to Shadow 5.2.0.

Upgrade to ByteBuddy 1.10.2.

Upgrade to SBE 1.15.0.

Upgrade to Agrona 1.1.0.

Java binaries can be found here...
Source code(tar.gz)
Source code(zip)
1.23.1(Nov 6, 2019)
Correct bug when setting MediaDriver.Context.rejoinStream which set reliableStream property by mistake and update configuration output dump.

Add bind address and port to channel endpoint counter label to help with debugging connections.

Fix narrowing type conversion in C++ client for subscription images. PR #726.

Add progress checks to ReplayMerge and a new terminal state of FAILED which is entered on exception or lost connection to the archive.

Track close following connections with MDS without timing them out which can help with ReplayMerge.

Support manual control on MDC not requiring the control address:port to be specified so it can be automatically assigned.

Add ability to disable the recording events publication in the archive to save resources when it is not required.

Add protocol version of the server to the connect response for archive clients.

Upgrade to SBE 1.14.1.

Upgrade to Agrona 1.0.11.

Java binaries can be found here...
Source code(tar.gz)
Source code(zip)
1.23.0(Oct 27, 2019)
Support the separate configuration of idle strategies for the replay and recording agent in the archive when running dedicated threading mode.

Improve ownership tracking for subscriptions and images in C++ client.

Improve matching of tagged channels

Increase archive storage version to 2.0.0 which requires the use of migration tool for existing archives.

Add operations to purge and restore the history of a recording in the archive.

Add the ability to query start position for a recording.

Add Image specific fragment assemblers for C++ client.

Reduce cacheline padding to save on memory footprint.

Fix double delete in Aeron destructor. Issue #717.

C++ client refinements. PR #716.

Upgrade to javadoc-links 4.1.3.

Upgrade to Gradle 5.6.3.

Upgrade to Checkstyle 8.25.

Upgrade to SBE 1.14.0.

Upgrade to Agrona 1.0.9.

Java binaries can be found here...
Source code(tar.gz)
Source code(zip)
1.22.1(Oct 11, 2019)
Fix command message validation which failed to take account of message offset. Issue #690.

Address some false sharing issues in the Java and C++ clients which can add 50ns of latency to RTT.

Provide original channel URI in error message when parsing fails to port for an endpoint address. PR #714.

Rewrite messages from older clients to the archive to allow for gradual upgrade of clients to the new archive. This support will last for only one minor version.

Separate versioning schema for network protocol from file formats for the archive to allow them to evolve independently.

Only check concurrent recording limits upfront in the archive to avoid later asynchronous errors.

Reclaim mapped memory for IPC publications as soon as ref count is 0 and drained by subscriptions without going into 10 second linger.

Java binaries can be found here...
Source code(tar.gz)
Source code(zip)
1.22.0(Oct 9, 2019)
This release increases the major version on the archive wire protocol and file format. To upgrade it is necessary to update all archive clients and the archive at the same time. Also an archive migration is required by running the CatalogTool with the migrate option. Be sure to backup the archive first before doing a migrate.

Add recording signal reporting on the control stream for an archive. The RecordingSignalAdapter can be used to track signals of operations happening to recordings such as START, STOP, EXTEND, REPLICATE, MERGE, etc.

Improved Javadoc for archive configuration.

Improved checking for clashing session-ids for manually configured publications.

Reduce heartbeat updates to mark files to once per second to reduce IO traffic.

Reclaim mapped memory for images by not lingering when the last subscription is closed. This can reclaim the mapped memory 10 seconds sooner by default.

Fix ref counting to send channel endpoints which could cause a stream to get stopped early when multiple publications use the same channel.

Add Archive replication feature which replicate a recording from one archive to another with the option of merging with a live multicast stream and continuing to support multiple redundant recordings.

Reduce Java memory footprint of Archive client.

Reduce default max concurrent recordings and replay in the archive from 50 to 20.

Improve consistency of error codes and command validation to both Java and C Media Drivers.

Add Image.activeTransportCount() to track active transports when using MDS which can be used to make ReplayMerge more reliable.

Add correlation id to RegistrationException to help with debugging.

Allocate non-sparse files in Java media driver at safepoint to help avoid Time-To-SafePoint (TTSP) issues.

Add the ability to configure congestion control as a channel URI param with the cc=static or cc=cubic options.

Handle channel endpoint errors in the C++ client.

Add support to the Java client for adding an removing destinations to publications and subscriptions asynchronously.

Catch errors when opening receive destinations and report them to the client.

Clean up bound ports on Windows when destinations are removed from MDS Subscripitons.

Improve error messages on channel conflicts.

Add rejoin URI param to channels so that when an image gets timed out to configure if it should stream or not.

Don't try to send archive client close messages when publication is not connected to avoid exceptions.

Improve reliability of counter active and reuse checks.

Clean up pending setup messages when a channel when endpoints are closed.

Use heartbeat timestamp counters to indicate client liveness rather than command messages. This gives more stable behaviour on configurations with multiple clients sending many commands.

Reworking of C Media Driver internals to more easily accommodate other media APIs such as ef_vi and DPDK.

Add option to delete the aeron.dir on shutdown of the media drivers.

Make MediaDriver.close() idempotent.

Abort further reading of archive control stream once listed descriptors have been read so further messages are not missed.

Improve reliability and precision of ReplayMerge.

Update session-id in catalog entries when an archive recording is extended.

Add 'group' URI param to indicate if receiver group semantics, e.g. multicast NAK semantics, can be applied to Multi-Destination-Cast.

More efficient and less allocating IP address dissection in logging agent.

Change Java RecordingReader and CatalogTool so they can read active recordings.

Improve handling of thread interrupt in Java client and archive client.

Add INVOKER option and config check to C media driver.

Add Java client Aeron.Context.awaitingIdleStrategy() configuration option for what to use when making a synchronous call to the driver.

Add log started event with timestamp when logging is enabled.

Add cncVersion to configuration print on driver start.

Fix potential out of bounds access for bytes received update in C media driver.

Upgrade to Checkstyle 8.24.

Upgrade to Mockito 3.1.0.

Upgrade to javadoc-links 4.1.2.

Upgrade to Gradle 5.6.2.

Upgrade to build-scan 2.4.2.

Upgrade to SBE 1.13.3.

Upgrade to Agrona 1.0.8.

Java binaries can be found here...
Source code(tar.gz)
Source code(zip)