Troubleshooting System Design Level

2026-04-12

Message Queues

Message queues are entities used to decouple services. Lending themselves naturally to an event-driven architecture, they allow the traditional request-response cycle to be remodelled into event producers and consumers.

By introducing a buffer between services, message queues enable:

Scalability through asynchronous processing
Reliability via retries and persistence
Idempotency in event handling
Observability across distributed workflows

They are a fundamental building block in modern distributed systems, particularly where workloads are unpredictable or need to be processed independently.

Troubleshooting at the System Design Level

During a recent change to my secops system, I needed to reconfigure part of the infrastructure. This required adding a second message queue to return results with the queue/exchange/bindings already set up.

On the surface, everything appeared correct. The infrastructure deployed successfully, and services were running as expected. However, when running end-to-end tests, something was clearly wrong.

Events were being produced, but nothing was being consumed.

No errors. No meaningful logs. Just silence.

This is where debugging moves beyond code and into system design.

The issue wasn’t within a single service, but in the interaction between them:

The producer was publishing to the expected topic
The consumer was running, but not receiving events
The queue itself existed, but was effectively disconnected

The root cause came down to configuration drift — subtle differences introduced during the teardown and rebuild process. Things like topic bindings, permissions, or subscription configurations that are easy to overlook but critical to system behaviour.

What made this particularly challenging was despite the visibility, with the naming conventions correct on both side. The system failed quietly.

This experience reinforced an important idea: distributed systems don’t fail loudly — they fail ambiguously.

Thinking in Systems

Troubleshooting at this level requires a shift in mindset.

Instead of asking “what is this service doing wrong?”, the question becomes:

“how are these components interacting, and where is that interaction breaking down?”

This involves:

Tracing the lifecycle of an event end-to-end
Verifying contracts between producers and consumers
Understanding infrastructure as part of the application, not separate from it
Designing for visibility, not just functionality

It’s less about debugging lines of code, and more about reasoning through flows, dependencies, and failure modes.

Why It Matters

As systems become more distributed, these kinds of issues become more common.

A perfectly functioning service is still useless if it’s disconnected from the system around it.

Being able to:

Identify where communication breaks down
Understand the guarantees (or lack of them) provided by messaging systems
Recognise silent failure modes

…is what separates building systems from simply writing code.

Key Takeaway

Strong engineers don’t just understand code — they understand systems. The ability to reason about interactions, failure modes, and infrastructure is essential when working with distributed architectures.