You know the feeling: one service times out, retries, and suddenly your “simple” checkout flow becomes a crime scene. Half the requests succeeded, half are stuck, and your logs read like a mystery novel written by four different authors.
Event driven workflows are the antidote to that kind of synchronous fragility. Plainly: instead of calling the next service and waiting, you publish an event (“OrderPlaced”) or enqueue a command (“ChargeCard”), and downstream workers react when they can. The workflow becomes a chain of durable handoffs, not a line of blocking API calls.
But here’s the catch. Message queues do not magically give you “exactly once” or “no lost work.” They mostly give you retries, which means duplicates, reordering, and backpressure become your problem. That is not bad news, it’s just the real job.
Right after this, I’ll show you how to design a workflow that survives that reality, with concrete patterns you can ship this week.
What “event driven” actually means once you leave the whiteboard
A useful mental model is this: the producer emits an event about something that already happened, and then it lets go. The sender does not wait for downstream services or even know who they are. That decoupling is the entire point.
Once you operate systems like this in production, another truth shows up fast: delivery guarantees are weaker than your intuition. Most mainstream brokers default to at least once delivery, which means the same message can be delivered multiple times. If you try to fight that at the broker layer, you usually lose.
Experienced distributed systems engineers converge on the same conclusion: correctness lives in the consumer. If your workflow produces the right outcome even when messages arrive twice, out of order, or late, the system behaves predictably under stress. If not, retries turn into chaos.
So the honest definition becomes: event driven workflows are durable, asynchronous handoffs where you assume duplication and build consumers that are correct anyway.
Pick your broker based on failure modes, not vibes
Different queues shine under different constraints. The mistake is choosing based on popularity instead of behavior under failure.
If you want rich routing, explicit retry flows, and dead lettering that feels tangible, traditional message brokers excel. If you want a replayable log of everything that ever happened so you can rebuild state or audit behavior, log based systems are designed for that job. If you want to stop thinking about broker operations almost entirely, managed queues trade power for simplicity.
The key question to ask is not “what’s fastest,” but “what breaks first when consumers are slow, crash mid-task, or deploy at the wrong time.” Your workflow design will reflect that answer whether you think about it upfront or not.
Design events that won’t betray you at 2 a.m.
Most workflow pain comes from sloppy event design, not from the broker.
Start with the contract. Events should represent facts, not intentions. “OrderPlaced” is a fact. “PlaceOrderRequest” is a wish.
Each event should carry:
- A globally unique event ID.
- A business identifier, such as
order_id. - An explicit event type and schema version.
If multiple services react to the same event, keep the payload small and stable. When in doubt, include identifiers and let consumers fetch additional data themselves. This reduces coupling and makes schema evolution survivable.
One more rule that saves a lot of grief: assume duplicates and out of order delivery from day one. That assumption is not pessimism, it aligns your design with reality.
Build the workflow in 4 steps you can actually ship
Let’s use a concrete example:
OrderPlaced → ChargeCard → ReserveInventory → SendReceipt
Step 1: Publish reliably
The classic bug looks like this: you write the order to the database, then publish an OrderPlaced event. The process crashes in between. Now your database and your message stream disagree.
The pragmatic fix is an outbox style approach. Write the order and an “event to be published” record in the same database transaction. A separate process reads those records and publishes them to the queue. This keeps your source of truth and your events aligned without pretending distributed transactions are fun.
Step 2: Consume with explicit acknowledgment
Only acknowledge messages after your side effects are committed.
With queue based systems, that means deleting or acknowledging the message only after processing completes. If your worker crashes before that, the message comes back and another worker will retry it. With log based systems, it means advancing your progress marker only after your handler finishes successfully.
This is where many teams accidentally create data loss by acknowledging too early.
Step 3: Make consumers idempotent, every single one
This is the whole game.
An idempotent consumer produces the same end state whether it processes a message once or five times. You can achieve this by storing processed event IDs, enforcing unique constraints on business actions, or checking state transitions before applying them.
A quick numbers example makes this concrete:
-
Your
ChargeCardworker takes 12 seconds at p95. -
Your message visibility timeout is 30 seconds.
-
A deployment causes cold starts and a temporary spike to 45 seconds.
Messages that take 45 seconds get picked up twice. That is expected behavior. The fix is not panic. Increase the timeout, and still make the charge operation idempotent so a second attempt becomes a no-op instead of a double charge.
Step 4: Handle failure on purpose
Retries are useful until they become a denial of service attack on yourself.
A sane default policy looks like this:
- Retry transient failures with exponential backoff.
- Cap retries by count or total time.
- Route poison messages to a dead letter queue for inspection or reprocessing.
If you do not design the failure path, it will design itself under load.
Reliability tactics that separate “works in staging” from “works in prod”
Ordering: If ordering matters, encode that constraint explicitly. Partition by a business key when using log based systems, or design state machines that tolerate out of order transitions when using queues.
Backpressure: Do not let consumers pull unlimited messages. Limit in flight work so slow dependencies do not cause memory blowups or cascading retries.
Observability: Treat the workflow as a product. Track queue depth, age of oldest message, retry counts, and handler latency. If you cannot answer “how many orders are currently stuck,” you do not really control the system.
FAQ
Do you publish events or commands?
Use events when many downstream reactions are valid and the producer should not care which ones happen. Use commands when exactly one action must occur and ownership is clear.
Can I get exactly once processing?
You can get effectively once outcomes. That usually comes from idempotent consumers plus careful coordination of state and progress tracking. The broker alone will not save you.
What’s the minimum I should do for safety?
Unique event IDs, idempotent handlers, a dead letter queue, and basic metrics on lag and retries. If you skip idempotency, the rest is mostly theater.
How long should I retain messages?
Long enough to recover from realistic outages. Retention is a risk management decision, not a performance tweak.
Honest Takeaway
If you want event driven workflows to feel boring, design for retries. Treat at least once delivery as the default truth and make duplicates harmless through idempotent consumers.
Do the unglamorous work: acknowledgments, dedupe keys, dead letter queues, and metrics. That effort buys you workflows that scale, evolve, and fail gracefully, which is the entire reason to go event driven in the first place.
A seasoned technology executive with a proven record of developing and executing innovative strategies to scale high-growth SaaS platforms and enterprise solutions. As a hands-on CTO and systems architect, he combines technical excellence with visionary leadership to drive organizational success.
























