How I design systems

matt
systemsarchitectureengineeringdistributed-systemssoftware-engineering

There are a million and one ways to design distributed systems. This post attempts to distill the knowledge I've gained over the years into some logical, coherent chunks. While it's probably not comprehensive I intend to keep this post updated as I evolve my thinking.

I don't have really strong opinions about a preference between microservices and monoliths; of course, the internet has many wide-ranging opinions that feel rather binary. To me, the binary nature of discourse around architecture also feels kind of strange given that monoliths and microservices are about service architecture rather than systems architecture. I mention this distinction because I typically find myself designing systems, not individual services.

This post dives into how I conceptualize systems by breaking them down into functional areas. Each area has its own considerations, but they all work together to form a cohesive whole.

Architecture

A quintessential bar debate in software engineering is whether to build using a monolith or microservices. All sorts of people and personalities have lorded over this opinion over the years and frankly I think most of the ink spilled on the topic is wasted and lacks nuance.

In defense of monoliths

Before I even really get into this I wanted to also draw a distinction between systems architecture and application architecture. Sometimes they're the same, but sometimes they're not and it's a distinction worth understanding. Consider a system with a web API and a background job processor—two separate applications by design. You could deploy them as separate services on different machines, but you might choose to run both on the same virtual machine to reduce costs and operational complexity.

SystemAPIJob Controller

In the diagram above we see a monolithic system with an API and a job controller running together. The API can trigger jobs through the job controller, and those jobs appear and complete within the same system boundary. From the outside, this style of system is pretty easy to run and understand. You could easily model and deploy this with systemd and leverage the fast communication paths of a single system to your advantage. As we'll see, it creates scaling challenges when different parts of the system have different resource needs.

When traffic grows, you can still scale horizontally by adding more machines, each running both applications. The application architecture (separate services) doesn't dictate the systems architecture (how you deploy and scale them). This flexibility allows you to optimize for cost, operational simplicity, or performance independently of how the code is structured. Running something like a jobs controller on every virtual machine would then require that jobs are idempotent and avoid scenarios where jobs can or should only be run once. This again adds to operational complexity, which is a major factor in moving away from monoliths.

One of the common tag lines I hear is that monoliths don't scale, which is why you should just start with SOA or microservices. I think this is a bit of a false dichotomy. Monoliths can scale, they just require more discipline and care to do so.

Onward to SOA and microservices

There's three main architectural patterns to consider with respect to systems design: monolithic, service-oriented, and microservices. Personally, I like to start any project with a well-organized monolith. When building a monolith it's important to realize that different parts of the system may one day need to be scaled independently for one reason or another. For this reason, when I write monoliths I like to predivide the system into services; the delineation of services could be as simple as APIs and data access objects (DAOs).

In general if you are disciplined when building a monolith then architecture can be thought of as a progression. When a monolith begins to grow operationally burdensome or resource constrained you have the option to split some of those monolith services into distinct services; some might recognize this pattern as service-oriented architecture or SOA. While SOA definitely opens up scaling opportunities its primary benefit is a separation of concerns. The separation of concerns is, organizationally, a big win that facilitates a more manageable operating model that enables developers to think more along the interfaces of systems independently rather than alongside the implementation details of each of the services components.

Eventually success can put strains on your distinct services to the point that you recognize independent components that could logically scale separately.

At this point there's a confluence of factors that need to be considered that will help you decide if you should go the microservices route:

In general, I try to use microservices sparingly. The access patterns I mentioned above are somewhat understated. For instance, that database separation will result in the loss of relational aspects of a relational database. These losses are largely in trade for the advantages of a high scalability and performance.

Communication

Communicating with your system and between components of the system is a fundamental part of system design. In general, you can think about communication in terms of external and internal communication.

Communication be broken down into two aspects: transport and serialization with two complementary styles: synchronous and asynchronous. Before we go any farther, let's briefly reflect on the OSI model:

OSI Model (7 Layers)7ApplicationUser interfaces, APIs6PresentationData encryption, compression5SessionSession management4TransportTCP, UDP3NetworkIP, routing2Data LinkEthernet, MAC addresses1PhysicalCables, signalsHigher

Transports

Transports can be broken out into two layers: transport (layer 4) and application (layer 7). As far as service communication goes, I bias towards layer 7 transport for a few reasons:

This can be distilled down into the idea that in modern software I do my best to avoid developing too many novel frameworks that don't directly benefit my business goals. Some might orient this around a platitude like, "don't reinvent the wheel" which is often expressed so vaguely that it dismisses potentially beneficial ideas. The calculus of the relativity of novel frameworks, and micro-optimizations in general, are an exercise left to those with skin in the game.

Serialization

Serialization technologies come in many different flavors but I have two that I use predominantely: JSON and Protobuf.

JSONs benefits are that it's human readable, flexible, and relatively easy to work with. It's downside is that its flexibility comes at the cost of performance and complexity where the libaries that process it are concerned. Protobuf's benefits are that it's binary, efficient, and type-safe. Its downside is that it's not human readable and requires a bit more effort to work with.

tldr;

This is a long way of saying I have two preferences:

There are some minor caveats here that I've used components of each of these for streaming data that doesn't fit the mold of a traditional RESTful API.

We didn't get into synchronous and asynchronous communication, but I'll touch on it briefly in Events.

Ingress

Ingresses have some overlap with communication, however, they are a distinct topic in and of themselves.

Ingresses are where external traffic enters your system. I like to tier and chain ingresses together. Referring back to the OSI model, at this stage we're focusing on layer 4 and layer 7 again.

In my ideal setup we leverage a layer 4 load balancer for its ability to negotiate proxy protocol and leverage its layer 4 capabilities when we need them. At the boundary closes to a service I like to implement TLS termination, authentication, and other layer 7 capabilities. Whether you're thinking in terms of virtual machines or Kubernetes containers, this logic works is easy for developers to negotiate, provides a clear chain of separation and observability, and creates a clear path for future evolution.

Ingress PatternsLoad BalancerVirtual MachinenginxApplicationLoad BalancerKubernetes ClusterGatewayServiceApplicationVM PatternKubernetes Pattern

Events

Events are a somewhat tricky topic. Most systems start out entirely synchronous and as they grow demonstrate some need for asynchronousness. How and when we accomplish the transition from synchronous to asynchronous is a decision that needs to be made on a case-by-case basis and often requires a careful consideration of the trade-offs involved. The biggest thing to avoid is to have a user waiting a long time for a response, receiving an inconsistent response of important information, or receiving no response at all. Many times, there's a minimum of information that can be relayed immediately that doesn't break consistency guarantees of an API.

There are, generally speaking, two complementary sets of styles of event-driven architecture: push and pull and ordered and unordered. Push is where the event producer pushes events to the event consumer. Pull is where the event consumer pulls events from the event producer. Ordered is where the events are processed in the order they are received. Unordered is where the events are processed in the order they are received.

One really common usecase for event-driven systems is to complete functions that are not sensitive to order or consistency. For instance, sending emails, or updating a search index.

Databases

Databases are where state lives. I always conceptualize a system as a single system first, and that includes how data flows through it. Even within a monolith there is often a language feature that facilitates breaking a large application down into components; it seems silly not to use it as the programming gods intended.

I think about databases in terms of access patterns and consistency requirements. Not everything needs ACID guarantees, and not everything needs to be immediately consistent. By thinking about these requirements upfront, I can make better decisions about what kind of database or storage solution fits each use case.

Most of my experience has been with relational databases, specifically MySQL and Postgres. Both are excellent choices and I don't have a strong preference between them. Postgres tends to have better support for complex data types and more advanced features out of the box, while MySQL has broader adoption and ecosystem support. For most applications either one will serve you well.

The key is maintaining clear contracts around data access. If I have one module that calls another module's data, I ensure it uses the same interface that would be exposed if that module were externalized. This keeps the door open for future evolution without requiring a complete rewrite.

Caching

Caching is about reducing latency and load. I think about caching at multiple levels: in-memory caches for hot data, distributed caches for shared state, and CDN-level caching for static or near-static content.

The trick with caching is knowing what to cache and when to invalidate. I tend to cache at the boundaries—between services, between layers, and between the system and the outside world. This strategy helps me scale horizontally without creating bottlenecks.

I really like Redis for distributed caching. It's fast, reliable, and has a good feature set for most use cases. But caching as a subject is pretty universal regardless of the technology you choose. It's all in the strategy of knowing your data lifecycles. Understanding when data is created, how it changes, and when it becomes stale is what makes caching effective. Without that understanding you'll either cache too aggressively and serve stale data, or not cache enough and leave performance on the table.

I also think about cache consistency. Sometimes eventual consistency is fine; other times you need stronger guarantees. The decision depends on what you're caching and how critical freshness is for that particular piece of data.

Testing

I prefer a pyramid approach to testing. The bottom of the pyramid is unit testing because it has the widest benefit and is the fastest to run. These tests validate individual functions and components in isolation, catching bugs early when they're cheapest to fix. Next is integration tests, which verify that components work together correctly. Last is E2E tests that are very expensive to run, catch fewer (but typically more critical bugs), can be less reliable because they rely on conditions outside the control scope of testing, and by the time bugs are caught in this phase they're more expensive to fix.

Testing PyramidUnit TestsBefore merge(Early catch)Integration TestsE2E TestsAfter merge(Restart SDLC)Test quantityMore benefitLower costLess benefitHigher costBenefit / Cost to fix

When I test, I test from two perspectives:

  1. Typical unit test scope
  2. Interface scope

Number two is where I establish some contracts (and examples) for how components of my system communicate. Without this kind of testing it's easy for programmers who come after me to drift outside of the design considerations of the system; thereby the software would degrade prematurely.

Of course, if my system maintains some kind of transport layer then I test those contracts as well. For instance, in Go and Python web services I try to write as many tests as I can directly to the REST API I've built. By testing in this fashion I'm testing every contract I have between components and the contracts I have with my users.

Testing contracts between components helps me think about how my code will be used. By making myself a user I inherently build better experiences for the users of my code.

Instrumentation

Instrumentation is always OTEL for me. I like metrics and logging. I haven't seen huge benefits from tracing over a good logging and metrics strategy. When you have solid logging with structured data and comprehensive metrics, you can usually piece together what's happening without the added complexity and overhead of distributed tracing.

Regardless of the final product orientation, whether it's a monolith, all microservices, or a mixture thereof, I monitor the system as a whole rather than as individual parts first and instrument individual components secondarily. The reason for this is because I frontload the work of my projects with testable contracts and monitoring gives me a glimpse into that contract performance— especially at the component level.

When you introduce eventual consistency (like jobs) or network calls there's all kind of retries and exception handling that can occur where a contract is not necessarily violated but could be in degradation (or worse, increasing degradation over time). Good instrumentation helps me catch these issues before they become problems.

Running observability infrastructure continues to be expensive. Many companies evaluate a proper observability spend to be something like 15% of your cloud costs, which is pretty incredible! This is why I focus on metrics and logging—they give me the most bang for my buck without the overhead that tracing introduces.

Conclusion

This is the way I've been building systems for the last few years and it's worked out quite well in terms of stability and maintainability. Possibly most importantly it also leaves room for the natural evolution of a system over time without too much strife. By thinking about systems in terms of these functional areas, I can make better decisions about how to structure code, where to place boundaries, and how to scale as needs grow.

Back to top