You can ship a system that looks clean in diagrams and still fails six months later in the least interesting way possible: a queue backs up, retries explode, a dependency flakes, and suddenly your “highly available” architecture becomes a thundering herd generator. The gap is rarely a missing technology. It is usually missing questions. Architecture review question that only validate components and patterns miss the deeper failure contours: operational reality, incentives, load shape, data gravity, and the uncomfortable edges where humans meet automation.
The fastest way to surface those contours is to ask questions that force the design to “run” in your head under stress. Not hypotheticals. Specifics. What happens at 10x traffic, at 2 a.m., during a partial outage, with stale data, with skewed latency, with a bad deploy, with a single noisy tenant, with one team rotating credentials early. The questions below are the ones that consistently expose future system failures while you still have time to change the design.
1. “What is the steady state, and how do we know we are in it?”
A surprising number of systems cannot define “normal” beyond “p95 looks fine.” Force the team to name the steady state in measurable terms: inbound rate, queue depth, error budget burn, storage growth, cache hit rate, and dependency latency. If you cannot describe normal, you will not detect abnormal until customers do. This question also surfaces whether your metrics are causal or decorative. A dashboard full of CPU and memory tells you little about user impact if the real bottleneck is lock contention or a downstream quota. The failure mode here is slow drift: a backlog growing by 2 percent per hour looks harmless until it hits a wall.
2. “What is the load shape, and what happens when it gets spiky or adversarial?”
Average load is a comfort blanket. Real systems see bursts, diurnal patterns, coordinated client retries, and occasionally malicious traffic. Ask for a concrete load model: concurrency, payload sizes, fanout, and burst factor. Then ask what breaks first. If the design depends on autoscaling, talk about scaling latency versus burst duration. I have seen a Kubernetes HPA that scaled in 60 to 120 seconds while traffic spiked in 10 seconds, which meant the first minute was handled by retries and luck. That is not capacity planning, that is gambling. This question exposes whether the architecture is resilient to shape, not just volume.
3. “Where do we apply backpressure, and what do we do when we cannot?”
Backpressure is not a feature you add later. It is a system behavior you must choose. If your ingestion endpoint cannot slow clients down, you need a buffer, and buffers have limits. Make the team show the backpressure chain end to end: client timeouts, admission control, queue policies, and shed strategies. The common failure is infinite buffering disguised as “we have a queue,” followed by disk exhaustion or runaway costs. Another failure is backpressure that only exists in one layer, so upstream keeps sending while downstream collapses. If the answer is “we will rely on retries,” you are already designing for amplification.
4. “What happens during a partial dependency failure, not a full outage?”
Most real incidents are partial: one AZ degraded, a single shard slow, one region with packet loss, or a third party API returning 200 with garbage. Ask what the system does when dependency latency doubles, when error rate hits 1 percent, and when responses become slow and inconsistent. This is where circuit breakers, bulkheads, and timeouts stop being buzzwords and start being architecture. Reference the posture of Google SRE style error budgets: you need to decide when you stop pushing new risk and start buying reliability. If nobody can say what triggers a breaker or what the fallback returns, the future failure is a cascading outage.
5. “What is our retry policy, and what prevents retry storms?”
Retries are a distributed denial of service you inflict on yourself. Make the team state retry budgets, jitter strategy, idempotency guarantees, and whether retries propagate across service boundaries. A good architecture review forces a simple question: “If a downstream slows down by 3x, do we send it more traffic or less?” If the answer is “more,” you have designed a storm. In one system I reviewed, a client retried at the HTTP layer, the service retried at the RPC layer, and the worker retried at the queue layer. Under a 2 percent error rate, effective load tripled in minutes. The fix was not “be careful.” The fix was enforcing budgets and making retries visible as first class traffic.
6. “What are the idempotency boundaries, and what is the deduplication strategy?”
If you cannot draw the idempotency boundary, you will eventually charge someone twice, create duplicate records, or fork state. Architecture reviews should force clarity: which operations are safe to repeat, where request IDs live, how long dedupe keys persist, and what happens when the dedupe store is unavailable. This question exposes a classic failure mode in event driven systems: “at least once” delivery paired with “exactly once assumptions.” If you are using Kafka or similar, ask how consumer offsets relate to side effects, and whether you can tolerate replays without manual cleanup.
7. “How do we handle data growth, and what is the compaction or retention plan?”
Data tends to grow in three ways teams forget: cardinality (new tenants, new dimensions), history (audit and analytics), and duplication (derived and denormalized views). Ask for explicit retention and compaction plans for every store, queue, and log. In a production ingestion pipeline I saw, Kafka retention was set to 7 days “for safety,” but consumer lag during a holiday spike reached 5 days, and the team increased retention to 21 days. Costs jumped, brokers ran hot, and the real issue was a consumer that could not keep up due to a single expensive enrichment call. This question exposes whether you are treating storage as infinite and time as optional.
8. “What is the consistency model, and where do we accept staleness?”
“Eventually consistent” is not a design. It is a promise you must quantify. Ask where strong consistency is required, where staleness is acceptable, and how you communicate that to downstream consumers and product. Also ask how you detect and heal divergence. The failure mode is silent correctness bugs: the system looks healthy while users see wrong state. If you rely on caches, ask about cache invalidation strategies and failure behavior. If invalidation fails, do you serve stale forever, or fail closed. There is no free answer. There is only an answer you have chosen or an answer the incident will choose for you.
9. “How do we prove we can recover, and what is the worst case restore time?”
Backups are not recovery. Architecture review questions should ask for RPO and RTO targets that map to real operational steps: who does what, with what tooling, in what order, under what access model. Then ask about the hard cases: restoring a multi terabyte database, rehydrating caches, replaying an event log, and reconciling side effects. This is also where you surface whether “immutable infrastructure” is real or aspirational. If the design depends on replays, ask how you prevent replaying toxic events that triggered the outage in the first place.
10. “What is the blast radius model for tenants, features, and deploys?”
If one noisy tenant can take down everyone, you have a business risk disguised as technical debt. Ask how you isolate CPU, memory, connections, and queue partitions across tenants. Ask how feature flags are scoped and how quickly you can disable a bad path. Ask whether deploys are progressive and whether rollback is actually safe with schema and data changes. The failure mode is large radius incidents: one customer import job saturates the database, or one feature rollout increases payload size and breaks mobile clients. Isolation is not only infra. It is API contracts, quotas, and deliberate degrade paths.
11. “How will we observe this system when it is failing, not when it is healthy?”
Healthy systems are easy to observe. Failing systems drop logs, time out traces, and overload metrics pipelines. Ask what telemetry survives overload: sampling policies, metrics cardinality limits, and whether you can still slice by tenant and endpoint at the worst moment. Architecture review questions should force the team to name the top five debugging questions and map them to signals. Keep it short and practical:
- What changed recently, and where did latency shift?
- Which dependency is slow, and is it localized or systemic?
- Is error budget burn accelerating or stable?
- Which tenants or routes dominate load and errors?
- Are retries and queue depth increasing together?
If the answer is “we will add more logging,” you are designing for the postmortem, not the incident.
12. “What chaos or failure injection will we run, and what do we expect to learn?”
You do not need a full Netflix Chaos Engineering program to benefit from controlled failure. You need a plan to test assumptions that architecture diagrams cannot validate: timeouts, fallback correctness, scaling behavior, and operator runbooks. Ask what failures you will inject in staging and, carefully, in production: killing instances, adding latency to a dependency, dropping a percentage of requests, expiring credentials early. The value is not theatrics. It is learning where your system has hidden coupling. The failure mode is confidence based on untested assumptions, which tends to last right up until the first real outage.
A small mapping table you can use in reviews
| Review question focus | Failure you are trying to catch early | What “good” often looks like |
|---|---|---|
| Backpressure and retries | Cascading overload | Budgets, jitter, load shedding, visibility |
| Consistency and staleness | Silent correctness bugs | Explicit contracts, repair loops, invariants |
| Recovery and restore | Long outages with no path out | Practiced restore, measured RTO/RPO, runbooks |
| Blast radius | One change takes down everything | Isolation, quotas, progressive delivery |
| Observability under stress | Flying blind during incidents | Survivable telemetry, clear SLOs, good traces |
Architecture review questions fail when they validate intent instead of interrogating behavior. The questions above work because they force specificity: numbers, thresholds, failure modes, and operator actions. You do not need perfect answers to all of them on day one, but you do need to know which ones are unanswered, and what you are betting on. Treat each question as a design pressure test. If the architecture cannot explain how it behaves under stress, the incident will. And it will be less polite about it.
Kirstie a technology news reporter at DevX. She reports on emerging technologies and startups waiting to skyrocket.





















