Jul 26, 2024·8 min read

Tiny engineering team: running global software with less toil

A tiny engineering team can support global software when the stack stays simple, ownership stays clear, and daily toil stays low.

Tiny engineering team: running global software with less toil

Why more people do not always fix the work

Adding people to a messy system often spreads the mess. New engineers inherit the same unclear setup, the same odd deploy steps, and the same half remembered workarounds. Small teams feel this first, but bigger teams pay for it too.

Salary is only part of the cost. Every extra service adds alerts, dashboards, release steps, and new ways to fail at 2 a.m. If one customer action touches five systems, a small bug now has five places to hide.

Handoffs make it worse. When one engineer owns the API, another owns the queue, and a third owns billing logic, even a simple fix turns into messages, meetings, and waiting. Nobody sees the full path, so each person checks their piece and hopes the problem sits somewhere else.

That is why busy teams often feel overloaded even when they are shipping very little. They spend hours retrying failed deploys, chasing noisy alerts, explaining old decisions, and patching the same weak spots again. It is real work. It just is not product work.

There is another way to run things. Oleg Sotnikov has shown it in practice at AppMaster, where a full size team became a much smaller AI first operation while uptime stayed near perfect. That result did not come from asking fewer people to work harder. It came from simpler architecture, fewer overlapping tools, and clear ownership.

Hiring helps when demand truly exceeds capacity. It helps far less when the stack is noisy, split into too many parts, or built around unclear boundaries. If the system creates confusion every day, more headcount often just gives the confusion more people to slow down.

Where small teams lose time

Most small teams do not run out of effort first. They run out of attention.

The work may fit the team, but constant switching does not. When one product uses three backend languages, two front end stacks, and a pile of one off services, every change gets slower. A bug in billing pulls someone into Python, the next task needs TypeScript, and the deploy script still lives in Bash. Nobody stays sharp across all of it. Small gaps turn into long pauses while people reload context.

Old automation creates a quieter mess. Teams keep scripts that "mostly work" until the day they fail. Then everyone waits for the one person who remembers why a strange flag was added two years ago. If that person is asleep, away, or buried in customer work, the queue stalls.

Alerts make this worse when they fire for short spikes, harmless retries, or jobs that fix themselves. After a while, people stop trusting alerts. That is a bad trade. The team loses focus during the day, and real incidents get missed at night.

The pattern is familiar. One engineer handles support questions, deployment issues, and urgent product edits in the same afternoon. Monitoring sends so many warnings that real problems blend into background noise. Build and deploy steps depend on fragile scripts with little explanation. Different parts of the product need different tools, so even small fixes start with setup work.

This pressure compounds fast. A customer report interrupts planned work. The interruption delays a release. The delayed release creates more support messages. By Friday, the team feels busy all week and still ships very little.

Teams that stay calm remove recurring friction before they add people. Less tool sprawl, fewer mystery scripts, and quieter alerts often save more time than one extra hire.

Choose an architecture your team can actually run

Small teams pay for every extra moving part twice: once when they build it, and again at 2 a.m. when they have to debug it. A simple system with clear edges usually beats a clever one with five services, three queues, and two caches.

Most teams do not need microservices on day one. One app, one database, and one worker process can carry far more traffic than people expect. Split a service only when you can point to a real pain. Maybe one part scales very differently. Maybe it needs a separate deploy schedule. Maybe it fails in a way that should not drag down the rest of the app. If you cannot name the pain clearly, keep it together.

Use storage and messaging your team already knows. If the team can run PostgreSQL well, that is often enough for the main database, many background jobs, and even simple event flows. Adding a new database or queue should take a hard reason, not curiosity. Every new tool brings backups, alerts, odd failure modes, and someone who has to own them.

Keep the request path short. If a user action hits a gateway, then two APIs, then a queue, then a worker, then a cache, incidents get slow and messy. Short paths are easier to trace, test, and explain to the next engineer who joins.

Watch for quiet duplication. Teams add a second cache because one endpoint is slow, a second job runner because the first one feels messy, or a side service for reporting that slowly becomes required. That is how weekly toil sneaks in. If two tools do nearly the same job, pick one and remove the other.

A good test is simple. If your team cannot draw the full production path on a whiteboard in a few minutes, the design is probably too wide for the team you have.

Give every part a clear owner

A small team cannot afford mystery ownership. When a service slows down at 2 a.m., one person should know it first, open the dashboard first, and decide the first fix. If three people half own it, nobody moves fast.

Put one name on every service, scheduled job, queue, database, and dashboard. That does not mean one person does all the work forever. It means one person keeps the map in their head, notices drift early, and makes the final call when tradeoffs appear.

Ownership works best when it includes routine mess, not just new features. The owner should decide logging format, alert thresholds, retry rules, cleanup tasks, and when old code needs to go. Teams get buried in repeat work when everybody can add more of it but nobody feels responsible for removing it.

Write ownership next to the thing itself. Keep it in the repo, the runbook, or the service catalog. It should be easy to answer a few basic questions: who handles incidents first, who approves deploys, who reviews schema changes, who can change alerts, and who covers time off.

Shared ownership sounds friendly, but it often creates delay. Use it only where it makes real sense, such as a common auth service or the CI pipeline, and still name one person as the decider.

This lowers stress too. People stop getting random pings for systems they barely know. New engineers ramp up faster because they can see who to ask. Cleanup work stops feeling optional.

Teams running a lot of infrastructure with very few people need these lines to stay simple. If one engineer owns deploy rules for a service and another owns its database changes, mistakes spread fast. Keep it boring: one part, one owner, one backup.

Keep the toolchain narrow

Give Every System an Owner
Define ownership, backups, and runbooks so incidents stop bouncing around.

A narrow toolchain often cuts more work than one extra hire. Small teams do better when everyone builds, tests, and deploys software the same way. If one service uses GitLab pipelines, another uses shell scripts, and a third depends on a manual checklist, small problems turn into long nights.

Pick one main path and keep using it. One build flow, one test flow, one deploy flow. Familiar steps reduce mistakes, and new engineers get productive faster because they do not have to learn five local customs.

Language sprawl causes the same drag. Every language adds package tools, lint rules, testing style, runtime quirks, and hiring overhead. That trade can make sense in a large company with specialist teams. It is usually a bad deal for a small group supporting software around the clock.

If your core services can live in one or two languages, keep them there. Save exceptions for cases that are clearly worth it, not personal taste. A service written in a third or fourth language often looks harmless at first. Six months later it has its own build hacks, dependency problems, and nobody wants to touch it.

Shared templates help more than teams expect. A good starter for an API, a background job, or an internal admin tool removes dozens of tiny decisions. Logging works the same way. Health checks work the same way. Deployment settings look familiar. The code is not identical, but the shape is.

This is the same discipline Oleg Sotnikov uses in AI first operations: a small set of tools for CI/CD, observability, and deployment instead of a pile of overlapping services. It keeps daily work calmer because the team fixes problems inside one known setup.

A tool should stay only if it earns its place. It should remove repeat manual work, solve a problem nothing else solves, and be supportable by more than one engineer a year from now. Otherwise it is just another thing to babysit.

Retire duplicates quickly. Two monitoring tools, two job runners, or two ways to manage secrets usually mean double maintenance and fuzzy ownership. Standardization is not exciting, but it works. Boring systems fail less often, and when they do fail, someone knows where to look first.

A simple way to cut toil in 30 days

Most teams do not need more headcount first. They need a clear map of what is running, who owns it, and what keeps breaking.

Start with a simple inventory. List every service, scheduled job, dashboard, queue, and deploy script. If it touches production, put it on the sheet. Teams often get an early surprise here. They find old jobs nobody trusts, dashboards nobody checks, and alerts that fire but never lead to action.

Then put one owner next to each line. Use a person's name, not a team label. Ownership does not mean they do every task. It means they decide whether that part stays, changes, or gets removed.

A 30 day cleanup cycle can stay very small:

  1. Week 1: build the inventory and assign owners.
  2. Week 2: count pages, failed deploys, and repeat tickets.
  3. Week 3: write runbooks for the three problems that come back most often.
  4. Week 4: remove one tool or process that adds work and gives little back.

Keep the counts simple. Do not waste time arguing about perfect severity levels or tagging rules. Just count how often people get paged, how often deploys fail, and which tickets keep coming back. If one alert wakes someone up four times in ten days, that is enough proof.

Runbooks should stay short. One screen is usually enough: what broke, how to check it, how to recover, and when to escalate. If a tired engineer cannot follow it quickly, the runbook is too long.

This works because it cuts repeat work at the source. Clear ownership and fewer moving parts let a much smaller team support a lot more than people expect.

At the end of the month, compare the numbers. If pages dropped, deploys got calmer, and fewer tickets came back, keep going. If nothing changed, your inventory is missing parts or the owners are not making decisions yet.

Example: a product used across time zones

Run More With Less
Simplify architecture so a small team can support users across time zones.

A small SaaS company had customers in Europe, Asia, and the US. With only three engineers, someone was always close to being on call. One person watched the customer app, one handled billing, and one kept the admin tools alive, but in practice all three kept getting dragged into the same incidents.

The main problem was not traffic. It was the number of moving parts. The team had two separate background workers, each with its own retry logic, logs, and failure modes. They also pushed updates through different deploy paths, and people still made manual server changes when something looked urgent.

That setup gets expensive fast. A failed payment retry could touch billing, queue processing, and support at the same time. If a deploy drifted from one server to another, the next engineer spent half the night comparing configs instead of fixing the bug.

They changed less than most teams expect. They merged the two background workers into one simple job system, moved all deploys into one pipeline, stopped changing servers by hand, and gave each area one clear owner even though everyone could still help in an emergency.

The payoff came within weeks. Fewer queues meant fewer edge cases. One deploy path meant production matched what they tested. When a job failed, logs came from one place, and rollback took minutes instead of a late night guessing session.

Night alerts dropped because fewer parts could fail at once. The billing engineer no longer had to inspect app worker logs. The app engineer stopped babysitting server drift. The admin tools became boring, which is exactly what internal tools should be.

This kind of change does not make a team look bigger. It makes the work smaller. A three person team can support users across time zones when the system stays predictable, ownership stays clear, and nobody needs heroics to ship a fix.

Mistakes that add work every week

Most weekly pain is self made. Small teams can support a lot more than people expect when they stop creating repeat work.

One common mistake is splitting a product into microservices before traffic, reliability needs, or team size actually demand it. That turns one deploy into several, one log search into several places, and one bug into a long guessing game across service boundaries. If the app still fits in one codebase, keep it there.

Framework drift causes a quieter mess. If each engineer picks a different framework, the team pays for that freedom every day. Reviews slow down, bug fixes take longer, and handoffs turn into relearning sessions. A narrow stack is easier to hire for, easier to document, and much easier to run at 2 a.m.

Manual deploy steps are another tax teams keep paying because they feel temporary. A shell command on one laptop, a dashboard click, a config edit, a forgotten restart - that is enough to make releases stressful. Then the one person who remembers the sequence goes offline, and everyone else starts guessing. If a step happens more than once, automate it.

Tool sprawl adds hidden drag. Teams add one more alert tool, one more CI check, one more place for docs, but rarely remove the old ones. Soon the same incident shows up in three systems, and nobody knows which one tells the truth. The rule is simple: fewer moving parts usually mean fewer weekly surprises.

The worst habit is treating on call pain as normal. Repeated pages are not part of the job description. They usually point to noisy alerts, weak ownership, missing tests, or brittle deploys. If the same problem wakes someone up every week, fix the source, tune the alert, or delete it.

Before you add another engineer

Support Growth Without Chaos
Build a setup your current team can run before the next hire.

Hiring helps when the workload is truly bigger than the team. It does not help much when people spend half the week chasing unclear systems, noisy alerts, and strange deploy steps.

Run a blunt audit before you open a role. Ask one engineer to trace a customer request from the first click to the database. If they cannot explain that path without guessing, the problem is not headcount. The system is harder to run than it should be.

Then test how fast the team can ship a small change. Give a newer teammate a minor fix with a normal review and deploy. If they still cannot ship it safely in one day, the setup is too fragile. Teams slow down when environments differ, deploy steps live in somebody's head, or rollback is unclear.

A short checklist catches most of the waste:

  • One engineer can explain the request path through the app, jobs, and database.
  • A newer teammate can ship a low risk change in a day without asking three people for help.
  • Alerts tell the on call engineer what to do next instead of dumping raw noise into chat.
  • Each service has one owner and one deploy path.
  • The team can name the chores that repeat every week.

That last point matters a lot. Repeated work is where hours disappear. If the same two or three tasks come back every week, write them down and count the time. Ten small chores can easily consume a full engineer day.

If these checks fail, fix the system first. Your next hire should add capacity, not inherit confusion.

What to do next

Start by making the work easier before you make the team bigger. Small teams usually break under scattered tools, unclear ownership, and noisy deploys, not a lack of effort.

Pick one place to check deployments, one place to read logs, and one place to see who owns each service. If people jump between five dashboards and a chat thread just to answer "what changed?", you already know where the wasted hours go.

Spend the next week removing friction before you write a hiring plan. Watch where work stalls. Count how often someone asks for access, waits for a deploy, hunts for logs, or fixes the same setup issue twice. Those are not small annoyances. They quietly eat whole afternoons.

A good first pass is simple: write down the owner for every service, job, and database, choose one deploy path and stop making exceptions, send logs and alerts to one place the whole team uses, delete tools nobody wants to maintain, and fix one of the top repeat problems this week.

As the product grows, keep headcount and complexity in balance. If one new feature needs a new service, a new database, a new queue, and a new dashboard, stop and ask whether the team can still run that setup six months from now. Growth is fine. Hidden upkeep is what hurts.

This is where small teams often make the wrong trade. They add people to carry messy systems instead of making the systems lighter. One careful architecture change can save more time than one extra hire.

If you want an outside review, Oleg Sotnikov does this kind of Fractional CTO and startup advisory work through oleg.is. He helps startups and small businesses simplify architecture, infrastructure, and AI driven development so teams can move faster with less overhead.

Do the cleanup first. Then decide whether you still need another engineer.