Amazon EventBridge had announced last week support for event Archiving & Replay that primary goal is to help users with disaster recovery and guarantee that the producer and consumers can reconcile the event and through that theirs state.
EventBridge itself by principle should not drop the event delivery, in case of failure by default the delivery is going to be retried up 24 hours. Only after which the event will be finally drop. This behaviour is now also configurable through the RetryPolicy and can be adjusted per configured target.
Besides that the there are still cases in which the event that had been delivered by the service might end up drop on the floor. Some of typical cases for that:
- Bug in destination Lambda function, that does not configured DLQ.
- Deleting or disabling the rule.
- Misconfigured Rule input transformer.
The above are only the scenarios specific to use of AWS, in general there might be use cases in which reconciliation of events is actually part of the business requirements like guaranteeing that the warehouse inventory is up to date or all of the invoices in the given point in time had been process.
Arguably one of the problem whenever an incident happens is to be able to detect that scenario and then recover every of the events that hasn’t been processed. With the former typically CloudWatch alarms on failed invocations, Lambda function execution or missing data points can help in later case the archive created on the EventBridge event bus can be used for replaying all of the events from the beginning time of the incident to restore the consumer state.
This can be effectively done providing that consumer can process the events idempotently as replaying the events might also mean delivering duplicates of previously delivered events.
Though interesting enough the process of replaying the events doesn’t need to necessary done only in cases of failure and rather the service itself can be design in a way that the event reconciliation happens on regular basis in particular every day, week or month. Without event need of manual work. In such cases it possible to build the system that is able to self reconcile. This can be easily done today with actually little efford.
EventBridge allows to configured a scheduled events. That can be run based on cron expression. The event then can be wired to Lambda function that will be responsible for starting the replay. An example application that does exactly that can be found at Github repo.
You can simply clone it and deploy it using SAM.
$ git clone https://github.com/jmnarloch/aws-eventbridge-replay-scheduler
$ cd aws-eventbridge-replay-scheduler
$ sam build
$ sam deploy --guided
The CloudFormation template has a cron expression that can be used to configure how often the automated replay should be trigger and Lambda function. The logic of the function uses the same the same cron expression to compute the tumbling window to trigger automated replay of events.
The implementation is quite straightforward:
exports.lambdaHandler = async (event, context) => {
const archive = process.env.AWS_EVENTBRIDGE_ARCHIVE_ARN;
const eventBus = process.env.AWS_EVENTBRIDGE_EVENT_BUS_ARN;
const schedule = process.env.AWS_EVENTBRIDGE_REPLAY_SCHEDULE;
let cron = parser.parseExpression(schedule);
let replayEndTime = cron.prev();
let replayStartTime = cron.prev();
console.log("Replaying events from %s to %s with event time between [%s, %s]",
archive, eventBus, replayStartTime, replayEndTime);
await events.startReplay({
ReplayName: uuid.v4(),
EventSourceArn: archive,
Destination: {
Arn: eventBus
},
EventStartTime: replayStartTime.toDate(),
EventEndTime: replayEndTime.toDate()
});
return {};
};
Some of scheduling examples:
Running the reconciliation every day at 1 AM.
0 1 ? * * *
Running the reconciliation every week on Sunday at 1 AM.
0 1 ? * 0 *
Running the reconciliation last day of month at 1 AM.
0 1 L * ? *
Interestingly enough this allows to also implement a use case of delayed event delivery, providing that the events will be not processed on first publishing.
What are the tradeoff of executing the reconciliation on schedule? The biggest gain is fact that you have a zero touch operations by design and you don’t require a manual operation during an disaster recovery to restore the state, though there will be cases in which they implied additional time needed for event recovery is unacceptable and immediate means for replaying the events is necessary. The down side of continuously replaying the events is additional cost. Though that can be kept in check as the replay is using tumbling window to never replay the events more than once.
Some ideas for future improvements is to make sure that the Lambda function that triggers the replay is maintaining the state so that is guaranteed that replay will be never repeated for the same time window as well in case of failure all of the miss time windows would be covered.
On the EventBridge side an interesting idea would be to allow configuring Replay as Target to trigger it base on the schedule event without need to write code any code.