Dec 15, 2024·8 min read

Software architecture notes that still help months later

Software architecture notes work best when they capture invariants, banned shortcuts, and outside dependencies instead of every detail.

Why big docs stop helping

Most architecture docs start with honest effort. Then a few releases go out, a service gets renamed, a vendor changes, and one emergency fix skips the plan. The document stays still while the system moves. After that, some of it is true, some of it is old, and nobody feels safe trusting any of it.

Drift happens fast. Teams rarely have time to redraw every diagram or rewrite every reference page after each change. Long documents need constant care, and busy teams usually spend that time shipping, fixing bugs, or answering customers.

People also stop reading long pages for a simple reason. They open docs when they need an answer now. If the answer is buried under twelve sections, a stale chart, and a wall of background, most readers skim for thirty seconds and give up. Then they ask a teammate, guess from the code, or copy the last pattern they saw. That is how old mistakes come back.

The useful part of architecture documentation is smaller than most teams think. You do not need a perfect mirror of the whole system. You need the few decisions that protect the system when memory fades. Which rules must stay true after refactors? Which shortcuts are off limits, even under deadline? Which outside systems can break you if they change?

That is why architecture notes work better when they stay narrow. A short note can survive for months because people will reopen it before a risky change. A giant page usually dies quietly. Nobody deletes it. They just stop trusting it.

A simple standard helps. If someone can read a note in five minutes and avoid one expensive mistake, the note did its job. If a page tries to describe everything, it usually ages badly. If it records invariants, forbidden shortcuts, and external dependencies, it stays useful long after the release that created it.

What belongs in a note

A useful note keeps the parts that stay true even when tickets, tools, and team size change. It is not a mini wiki. It keeps the rules that still matter six months later.

Start with invariants in plain language. These are facts the system depends on, not nice ideas. Write lines like "all writes go through this service," "jobs can run twice without causing damage," or "customer data stays in one region." If someone joins the team tomorrow, they should learn the safe boundaries from those lines alone.

Then add the shortcuts your team will not allow. This often saves more time than a polished diagram. Under pressure, people bypass the API, patch production data by hand, tuck secrets into source code, or let one service reach into another service's database. If the team agrees those moves cause bigger problems later, write that down clearly.

Outside systems belong in the note too. Name the vendors, internal platforms, and third-party tools that can break the plan. If a product depends on PostgreSQL, Cloudflare, GitLab CI, or an OpenAI pipeline, say so. Then add the failure that matters: checkout keeps working if email is down, deploys stop if CI is down, or rate limits force requests into a queue.

Skip the details that change every sprint. Exact endpoint names, class layouts, current server sizes, and temporary flags go stale fast. Put those in code, tests, or runbooks instead. The note should keep the boundaries and assumptions, not every moving part.

A short note often fits on one screen:

rules the system must keep
shortcuts the team refuses to take
outside systems that can fail
what the team should revisit when one of those changes

That is enough. Months later, someone should be able to open the note and see what must not break, where the risk sits, and which quick fix is off limits.

Start with invariants

Refactors change file names, service boundaries, and team ownership. Some rules should stay true anyway. Put those rules at the top of the note, because people need them most when the system gets messy.

An invariant is a rule that survives rewrites, migrations, and rushed fixes. If a change breaks it, the team should stop and ask whether the design changed or the change is wrong. That makes invariants more useful than a long description of how the code looks today.

Keep each invariant short. One sentence is enough in most cases. If someone cannot scan it in five seconds during an incident, it is too long.

Good invariants usually cover a few plain areas:

Data: each customer record has one canonical home, and imports never overwrite confirmed billing data.
Security: only one service can issue auth tokens, and every admin action leaves an audit record.
Ownership: one team approves schema changes for a shared database.
Incident behavior: the system may slow down or go read-only, but it must not lose accepted orders or charge twice.

Write the rule and its limit. "Payment events are append-only" says more than a paragraph about billing. "User deletion stays soft-delete for 30 days" is clearer than "users can be deleted under policy rules."

This also helps when people improvise under pressure. During an outage, engineers often take shortcuts that look harmless for ten minutes and create weeks of cleanup later. A short invariant like "we never edit ledger rows by hand in production" prevents that kind of damage.

If you feel tempted to explain everything, stop and trim. Notes age badly when they try to mirror the full system. Invariants age better because they describe the lines you do not want future work to cross.

A good test is simple: if a new engineer reads the note, can they tell what must remain true after the next big refactor? If yes, the note will still help months later.

Write down shortcuts you will not allow

A note gets much more useful when it names the bad fixes people reach for under pressure. Teams rarely get in trouble because they forgot a big design diagram. They get in trouble because someone took a fast shortcut, nobody wrote it down, and six months later that shortcut became normal.

Be blunt. If a shortcut is off limits, say so in plain words and add one sentence on why. New team members should not have to guess whether a quick fix is clever or a mess waiting to grow.

For a small SaaS team, the usual trouble spots look like this:

No direct database edits in production unless the team writes a migration or a reviewed emergency script. Manual edits drift from application logic and are hard to trace later.
No hidden cron jobs on random servers. If a job matters, the team puts it in the repo, names the owner, and logs failures.
No bypassing auth or rate limits for "internal use." Those exceptions tend to leak into real user paths.
No manual server changes that live only in shell history. If setup matters, the team puts it in code or a written runbook.

These rules work because they stop repeat damage. A direct database fix can solve today's ticket and quietly break tomorrow's deploy. A hidden cron job can keep billing alive for months, then fail when one machine gets replaced. A manual nginx tweak can save ten minutes and cost a day during incident recovery because nobody knows it exists.

Write each rule so a new hire can act on it without asking for translation. "Avoid manual changes" is weak. "Do not change production config by hand. Update Terraform or the deploy script instead" is clear.

If you want notes that age well, include the shortcut, the reason, and the safe path. That gives the team a default move when stress is high, which is when bad habits usually start.

Track outside systems and vendors

Get practical CTO advice

Talk through architecture, infrastructure, and team habits that need a simpler working model.

Book Consultation

A system rarely fails in isolation. It depends on payment providers, email services, DNS, cloud storage, queues, maps, auth tools, and vendor APIs. If your note skips those, people waste time guessing where the real risk sits.

Good notes name every outside dependency that can stop work or break user-facing features. That includes obvious vendors like Stripe or AWS, but also less visible parts such as a message queue, a hosted database backup service, Cloudflare, or an LLM API used in one flow.

For each dependency, write down four things: why you use it, who owns it on the team, what limits apply, and what happens when it fails. Ownership matters more than teams expect. When alerts fire at 2 a.m., someone should know whether to call the backend lead, the DevOps person, or the founder.

A short list often works better than a long paragraph:

Note any rate limits, quota caps, and billing thresholds.
Record contract details that affect architecture, such as data retention, region rules, or shutdown notice periods.
Write the first failure point, not every possible one. Timeouts, webhook delays, expired tokens, and queue backlog cover a lot.
Say which features degrade and which features stop completely.
Add the team owner for each dependency.

This saves real time during incidents. If email delivery fails, users may miss password resets but still use the app. If your auth provider goes down, nobody logs in. If the queue stalls, orders may pile up even though the frontend still looks fine. Those are very different failures, and the note should make that obvious.

One small habit helps: separate "external dependency" from "internal service that feels external." A self-hosted GitLab runner or your own Redis queue can still be a dependency worth listing, because other parts of the system depend on it just as much as they depend on a vendor.

Months later, this section often helps more than diagrams. Vendors change prices, limits, and uptime. Your note should make the blast radius clear before that change hurts production.

Create a note in 20 minutes

A good note starts small. Pick one system or one workflow that people touch often, such as login, billing, deployments, or customer data import. If you try to describe the whole company, you will stall, argue over format, and save nothing.

Most good notes fit on one screen. That is enough if the note captures the few facts that must still be true six months from now.

Use the first 10 minutes to write three to five invariants. These are the rules nobody should break, even when they are in a hurry. Keep them plain and testable. "Payments only go through one service" is useful. "The payment layer should stay clean" is too vague.

Then spend about 5 minutes on shortcuts you will not allow. Past incidents help here. Maybe someone once edited production data by hand, skipped a queue, or stored secrets in app code. Write those down as forbidden moves. This saves more time than a long description of the ideal design.

Finish with outside dependencies and owners. Another 5 minutes is enough.

List the systems you rely on, such as a payment provider, cloud storage, email service, or identity provider.
Name the owner for each one, even if that owner is a person, not a team.
Add the failure you expect first, like rate limits, expired tokens, bad webhooks, or vendor downtime.
Note where alerts or logs live if people need to check them fast.

Save the note where the team already works. If engineers live in the repo, put it near the code. If the team works out of GitLab issues or an internal wiki, store it there instead. A perfect note in the wrong place is easy to forget.

If you already keep technical decision records, this note can sit next to them. It does a different job. A decision record explains why you chose something. This note tells the next person what they must not break.

A simple example from a small SaaS team

Turn incidents into notes

After a rough release or outage, capture the rules worth keeping with Oleg.

Start Advisory

Picture a five-person SaaS team with one web app, monthly billing, email login, and routine emails like receipts, invites, and password resets. They do not need a giant wiki. They need a note that tells the next person what must stay true when the code changes.

A short note for that team might look like this:

Product: TeamFlow

Invariant
- A customer account can only read or change its own data.
- Every query that touches projects, invoices, or audit logs must include account_id from the authenticated session.
- Background jobs and admin tools follow the same rule. No bypass path.

Forbidden shortcut
- During urgent releases, do not use a support-only script to edit billing state directly in the database.
- Change billing state only through the billing service so webhooks, audit logs, and emails stay in sync.

External dependency
- Stripe is the source of truth for payment status.
- We store Stripe customer_id, subscription_id, last webhook event id, and local status.
- If Stripe and our database disagree, we trust the latest verified webhook and replay events before changing access.

That one invariant does a lot of work. It tells every new developer that customer boundaries are not a suggestion or a controller-level check. The rule applies in queries, jobs, scripts, and admin screens. If someone later adds a bulk export job and forgets account_id, the note gives the reviewer a clear reason to block it.

The forbidden shortcut matters most when the team is tired and trying to ship on Friday night. Direct database edits feel faster, but they leave gaps. A customer may get access without a receipt, or lose access while Stripe still shows an active subscription. Writing down one shortcut you will not allow saves hours of cleanup later.

The dependency note helps during onboarding because it answers the question new hires usually ask after the first billing bug: "Which system do we trust?" Instead of tracing code for half a day, they can see the rule, the stored fields, and the recovery path in a minute. That is what good architecture documentation does. It cuts guesswork.

Mistakes that waste time

Most notes fail for a simple reason: they read like a diary. Teams write what happened last month, who said what in a meeting, and why a choice felt right at the time. Months later, nobody wants the story. They want the rule that still applies today.

A short note with current constraints beats a long timeline every time. If a rule changed, update the rule. If the old choice still matters, keep one line on why it matters now.

Another common miss is copying diagrams into notes and then forgetting them. A stale diagram does more harm than no diagram because people trust it. If the queue, service boundary, or data flow changed three times, the old picture sends people in the wrong direction.

Keep diagrams only if someone will update them as part of normal work. Otherwise, write the few facts that must stay true in plain language. A sentence like "billing calls the payment provider directly" ages better than a picture nobody touched for six months.

Vague warnings waste time too. "Do not break this" tells nobody what matters. Say what must stay true instead. For example: "invoice IDs must stay stable after export because accounting imports them by ID." That gives a developer something they can check before shipping.

Dependencies often disappear into meeting notes, chat threads, or a ticket comment. Then a vendor limit, a cron job, or a manual finance step surprises the next person. Notes should name outside systems clearly, including who owns them and what can fail.

Length is its own problem. Once a note gets so long that people have to scroll, skim, and guess, they stop using it. Most teams do better with notes that fit on one screen, or close to it.

A good filter is simple:

Does this note tell me the current rule?
Can I see the outside systems fast?
Would a new engineer understand what must not change?
Can I scan it in two minutes?

If the answer is no, cut harder. A note people read is better than a perfect note nobody opens.

Quick checks before you save it

Make your docs usable

Work with Oleg to turn stale pages into short notes your team trusts.

Get Help

A note only earns its keep if a new engineer can scan it fast and still make a safe change. If that takes twenty minutes, the note is already too heavy. Good notes are short enough to read in one sitting and clear enough that a newcomer can repeat the main rules back to the team.

Read each rule and check the wording. A strong rule says what must stay true, even if the code, vendor, or team changes later. "All billing writes go through one service" is useful. "We currently use service X for billing" is just a snapshot.

Teams also need to name the shortcuts people really try when deadlines get tight. Be plain about them. If engineers keep reaching into the database from random jobs, say that this is not allowed. If people keep bypassing the queue for urgent work, write that down too. Notes age better when they talk about real temptations, not imaginary best practice.

Outside systems need their own small block in the note. Name the service, why you depend on it, who owns the relationship, and what breaks if it goes down. This matters more than a long diagram in most small teams. When a payment provider, auth service, or cloud secret manager fails, people need fast answers, not a history lesson.

A quick review usually catches weak spots:

Someone new can read the note in about five minutes and explain the system limits.
Every rule describes an invariant, not a preference.
The note names the risky shortcuts your team actually attempts.
External services, vendors, and internal owners are easy to find.
After the next release, one engineer can update the note in a few minutes.

That last check matters more than it seems. If updating the note feels annoying, nobody will do it. Keep the format simple, keep the rules concrete, and leave out anything that turns stale by next month.

What to do next

Pick one system this week. Choose the one that wakes people up, slows releases, or keeps the same questions alive in chat. Write one note for that system and stop there. A single useful page beats a folder full of half-finished documentation.

Keep the first version small. Add the rules that should stay true, the shortcuts your team will not accept, and the outside services that can break your day. If you already have notes scattered across tickets, pull the best parts into one place and delete the rest later.

A simple first pass can fit on one page:

what must stay true in production
what engineers must not do, even under pressure
which vendors, APIs, queues, or data sources the system depends on
which recent decision still shapes the design

Then use the next real event as a test. After your next release, outage, or ugly bug, open the note and fix what felt missing. Teams usually learn more from one rough incident review than from a month of neat writing. If the note did not help someone answer a question faster, trim it or rewrite it.

Once one note proves useful, make it a small team habit. You do not need a big process. Ask for a short update when someone changes an invariant, adds a new external dependency, or makes a decision that will still matter in six months. Ten clean minutes after a release can save hours later.

It also helps when one person owns the format. That person does not need to write every note. They just keep the notes short, readable, and consistent enough that people trust them.

If your team wants an outside review, Oleg Sotnikov at oleg.is works with startups and smaller companies as a Fractional CTO and advisor. That kind of review can help when the real problem is not missing documentation, but too much stale documentation and too few clear rules.

Frequently Asked Questions

What is an architecture note, really?

An architecture note is a short page that tells your team what must stay true, which shortcuts you refuse to take, and which outside systems can break the workflow. It does not try to describe the whole system.

How long should an architecture note be?

Keep it short enough to read in about five minutes. If people need to scroll through a long history lesson, they will stop using it when they need help fast.

What counts as an invariant?

Write rules that should survive refactors and rushed fixes. Good examples sound like all billing writes go through one service or accepted orders must never disappear, because a reviewer can check those rules against real changes.

Should I include diagrams?

Only keep a diagram if someone will update it as part of normal work. If your team never maintains diagrams, write the rule in plain language instead, because a stale picture sends people the wrong way.

Which forbidden shortcuts should we write down?

Name the shortcuts that create repeat damage under pressure. Direct database edits, hidden cron jobs, manual production config changes, and auth bypasses make good candidates because teams reach for them when they feel rushed.

How do we document external dependencies?

List every vendor or internal platform that can stop work or break user-facing features. Add why you use it, who owns it, and what fails first, so the team knows where to look during an incident.

Where should we store these notes?

Put the note where engineers already work. For most teams, that means the repo, a nearby docs folder, or the same place you keep technical decisions, because people trust notes they can find in seconds.

When should we update an architecture note?

Update the note when an invariant changes, when you add or remove a dependency, or when an incident exposes a missing rule. You do not need a big review cycle; one engineer can refresh a short note right after the change.

How is this different from an ADR or a runbook?

A decision record explains why you chose something at a certain moment. A runbook explains how to do a task. An architecture note tells the next engineer what they must not break.

We already have messy docs. Where should we start?

Start with one workflow that causes pain, like login, billing, deploys, or data import. Write three to five invariants, one or two forbidden shortcuts, and the outside systems that affect it, then trim anything that reads like history.