Broken engineering decisions: why another hire won't help
Broken engineering decisions often hide in recent incidents and stalled work. Learn how to spot rule gaps before you hire another engineer.

What this looks like day to day
Broken engineering decisions rarely arrive as one dramatic failure. They show up as the same friction over and over.
A payment bug appears in two products built by two different teams. The same approval step slows every release. Engineers reopen an old argument every sprint because nobody wrote down the default answer, the owner, or the condition that would change the decision.
After a while, the pattern gets familiar. Similar bugs appear in different parts of the product. Projects pause at the same handoff. New hires ask basic process questions and get different answers from different people. Urgent work jumps the queue because nobody trusts the normal path.
Picture a small startup that ships quickly but keeps breaking account permissions. Backend engineers fix one case. Frontend engineers patch another. Support adds a manual workaround. Then a new senior engineer joins and expects a clean access model. Instead, they find five exceptions, no clear owner, and no written rule for who can bypass the normal flow.
That is what broken decisions look like in daily work. People stay busy. Tickets move. Output looks fine for a while. But the team keeps paying for the same missing rule again and again.
Why another engineer won't fix it
A new engineer usually lands inside the same system that slowed everyone else down. The code may be messy, but the deeper issue is often simpler: nobody has clear rules for how decisions get made.
When that happens, even a strong hire starts guessing. They copy the loudest voice, follow old habits, or patch around the problem. They might write cleaner code, but they still work inside the same unclear system.
More headcount can even make things worse. If a team already makes weak decisions, extra hands help it make those decisions faster. One engineer adds a shortcut. Another builds on top of it. A third ships it because nobody owns the final call. The team looks productive. The risk grows anyway.
Hiring can also hide ownership gaps for a few weeks. Tickets move again, so everyone feels relief. Then the same pattern comes back with a new label. Last quarter it was "deploy instability." This quarter it is "edge cases" or "integration issues." The wording changes. The missing rule does not.
Take a team that keeps shipping features that break reporting. They hire another backend engineer. That person adds checks, clears the backlog, and patches the immediate issues. Two months later, reporting breaks again because nobody agreed on one rule: schema changes need downstream impact review before they ship. The extra engineer helped with volume, not judgment.
If the same failure keeps returning in new words, do not assume you need another seat filled. You may need a rule, a clear owner, and a team that follows both.
Start with your last three incidents
Your last three incidents usually tell you more than a stack of interviews. Recent incidents still have context around them: rushed messages, half clear assumptions, and the moment someone chose speed over clarity.
Start with the raw material. Pull the incident notes, team chat, and a plain timeline of what happened. If nobody wrote a formal postmortem, build one from messages and alerts.
Then look for the decision that made recovery slower. The outage itself is often less interesting than the call that turned a small problem into a long one.
Maybe someone delayed a rollback because they wanted one more log line. Maybe two engineers changed the same service because no one owned the incident. Maybe a manager kept asking for updates while the team still had no clear lead on the fix.
A short review is enough:
- Write down the first decision that added delay, confusion, or extra risk.
- Name who made it.
- Note what they believed at the time.
- Check whether the team had a rule for that moment.
That last step matters most. Teams often stop at "bad judgment." But if the same shortcut shows up twice, judgment is not the full story. The team either lacks a rule or has one that nobody trusts.
Say your last three incidents all include some version of "wait a bit longer before rollback." That is not three separate mistakes. It is one repeated gap in decision making. The team may not know who can approve a rollback, when customer impact should beat root cause hunting, or how much evidence is enough before acting.
Keep the review factual. Do not turn it into blame. Ask who made the call, what pressure they felt, what signals they saw, and what options they thought they had. People usually follow the habits and incentives around them.
If the same shortcut appears in more than one incident, fix that before you open a new role. A new engineer will still inherit the same fog during an outage.
Trace one stalled project step by step
Now pick one project that slipped even though nobody can point to a single blocker. No outage, no vendor failure, no missing person. These cases are useful because the team usually stayed busy while the work stopped getting closer to release.
Build a plain timeline from idea to launch. Write each handoff in order: request, approval, scope, design, build, review, testing, release. Put a name next to every step. If you cannot say who owned one handoff, you have probably found part of the problem.
Imagine a customer portal update that missed its date by ten weeks. Sales asked for it. Product wrote a short brief. Design waited for backend answers. Backend waited for product to define odd cases. Frontend started anyway, then paused when the API changed. QA never got a stable build. Everyone worked. Nothing shipped.
When you map a delayed project like that, look for the moment where progress turned into motion without output. It often happens when approval moves ahead without one owner, design starts before scope settles, review reopens decisions that should have stayed closed, or testing waits until the entire feature is "done."
After you find the stall, write down the rule people followed at that moment. Keep it blunt. Maybe the rule was "product must answer every edge case before backend starts" or "any senior engineer can change the plan during review." Teams act on rules like these for years without ever saying them out loud.
Then ask whether anyone challenged that rule when the delay became obvious. If nobody did, the problem was not effort. It was a rule gap. Another engineer would have entered the same flow, waited at the same handoff, and lost the same two weeks.
One stalled project can tell you more than a dozen status updates. It shows the exact rule your team obeys when work slows down.
Name the rule gap
When the same mess shows up twice, stop calling it a people problem. Most broken engineering decisions come from one of two things: a rule that does not exist, or a rule that exists and keeps pushing people toward the wrong choice.
A missing rule leaves people guessing. A bad rule gives them clear steps that still create delay, rework, or incidents. Those are different problems, so treat them differently.
If nobody owns rollback approval, that is a missing rule. If every small release has to wait for a weekly meeting, that is a bad rule.
Silent rules do a lot of damage because nobody writes them down. Teams say things like "we usually check with ops first" or "frontend can ship copy changes without review," but those habits live in chat history and memory. A new engineer does not see them. They learn them by breaking something.
You should also look for places where two teams follow different rules without realizing it. Product may think a feature is ready once the demo works. Engineering may think it is ready only after monitoring, rollback steps, and support notes are in place. Both sides sound reasonable. The conflict still creates missed dates and repeated handoffs.
Focus first on the decisions that create repeat work. If people keep reopening the same ticket, rebuilding the same integration, or arguing over who approves a release, the gap is probably bigger than it looks.
A fuzzy habit gets easier to fix when you turn it into one plain sentence:
- Schema changes need a rollback note before review starts.
- Billing changes need one product owner and one engineer to approve scope.
- An incident gets one owner within 15 minutes.
- Teams do not merge work with open questions about data ownership.
That level of detail works. You do not need a long policy. You need a rule people can actually use under pressure.
A simple team example
A small startup keeps missing release dates, so the founders do the obvious thing and hire another backend engineer. Three months later, the calendar still slips, support tickets still spike after launches, and everyone still feels overloaded.
On paper, the team is bigger. In practice, the same two problems keep dragging work down.
First, product sign off arrives too late. Engineers finish the feature on Tuesday. QA checks it on Wednesday. Then the release waits because nobody from product gives a final yes or no until Friday evening. That delay pushes the launch into the next week. The team reads that as an engineering capacity problem even though the code was ready days earlier.
Second, incidents stay open longer than they should. A release goes out, error rates jump, and the team starts arguing in chat. One engineer wants to patch forward. Another wants to roll back. Product asks whether the issue affects paying users. Nobody owns the rollback call, so the debate burns 40 minutes while customers sit in the middle of it.
The fix is smaller than another hire. No release gets scheduled unless one person owns the ship decision from start to finish. That person must get product sign off by a set time, and that same person can trigger rollback without group approval.
The team did not need another backend engineer to solve late approvals. It needed a deadline and a named decider. It did not need a separate operations hire to cut incident noise. It needed one owner with authority during the release window.
After that change, launches got simpler. Product had to decide earlier. Engineers stopped waiting for last minute answers. When a release failed, one person made the rollback call in minutes instead of after a long thread.
That is why decision rules often matter more than headcount.
Mistakes teams make here
The first mistake is turning a system problem into a people problem. One engineer becomes "the reason" a release slipped or a migration stalled. That feels neat. It is usually wrong.
If the same kind of failure shows up across different people, you do not have one weak performer. You have a weak rule. Someone made a bad call, yes, but the larger issue is that nobody had a clear rule for when to escalate, when to stop, or who could say no.
The second mistake is treating every delay as a staffing gap. A team misses dates, so leadership opens a new role. The new engineer then walks into the same unclear scope, the same weak handoffs, and the same approval mess. Headcount helps with volume. It does not fix broken decisions.
The third mistake is writing rules nobody can use. "Communicate earlier" is not a rule. "Test more" is not a rule either. A usable rule sounds like this: if a task changes API behavior, one named person reviews it before work starts.
Another common miss is focusing only on the loudest incident. A big outage gets the meeting, the notes, and the new checklist. Meanwhile, two quieter projects stalled for the same reason: nobody knew who could change scope after work had started. The outage was just the noisiest version of the same pattern.
Teams also change one template, rename one meeting, and assume the issue is fixed. Then the same delay shows up two weeks later because nobody checked whether the new rule changed behavior.
Before you ask for another hire, ask four plain questions. Did the same failure happen with different people? Did anyone map the decision path before opening a role? Is the new rule specific enough to use under pressure? Did the team check whether the first fix actually changed anything?
Weak answers usually mean the team is still patching symptoms.
Quick checks before you open a new role
A hiring plan makes sense only after you decide whether the team has a people problem or a rules problem. Many stalled projects look like understaffing from the outside. Inside, the same bad call gets repeated, nobody owns it, and every engineer works around it in a different way.
Use the last few painful weeks as evidence. If one release slipped because nobody knew who could approve a database change, and another slipped for the same reason, that is not a capacity gap. It is a rule gap.
Work through a short check:
- Write the failed rule in one plain sentence. "Any API change needs approval from product and backend" is a rule. "Communication broke down" is not.
- Check whether the same rule hurt more than one project or incident. If it showed up twice, treat it as a pattern.
- Put one name on the decision. If three people sort of own it, nobody owns it when time gets tight.
- Ask whether a new engineer would hit the same approval maze in week one. If the answer is yes, the role will add salary without removing the delay.
- Try a small rule change for two weeks. Shorten one approval path, assign one owner, or set a default action when nobody responds within a day.
A simple thought experiment helps. Imagine you hire a strong senior engineer tomorrow. She joins, finds a stuck migration, asks for approval, then waits on product, legal, and a founder who is traveling. Two weeks pass. You did not hire the wrong person. You kept the wrong rule.
That is where broken decisions stay hidden. Hiring feels concrete. Rule repair feels slower, even when it saves more time.
What to do next
You do not need a month long audit to spot this. Pick one recent incident and one project that has dragged on far longer than it should. That is usually enough to find the rule your team keeps tripping over.
Do it this week, while the details are still fresh. If you wait for a calmer moment, people forget what happened and fill the gaps with opinion.
Start with four steps:
- Choose one incident that caused real pain, such as downtime, a bad release, or a rollback.
- Choose one project that has been stuck in review, redesign, or approval.
- Write down three rules your team already follows, even if nobody has written them before.
- Rewrite one fuzzy rule so one person owns the decision and knows when to make it.
Keep the rules plain. "Backend changes need a written rollback plan" is clear. "Engineers should be more careful" is useless.
If an incident started with an urgent database change, and your stalled project is blocked because nobody will approve an API change, the same gap may sit under both problems: your team has no clear owner for changes that affect multiple parts of the product.
Run the new rule for two to four weeks before you approve headcount. Watch what changes. Do decisions move faster? Do reviews get shorter? Do fewer tasks bounce between people? If yes, you found a management problem before you paid for a staffing fix.
If nothing improves, the case for hiring gets stronger because you removed one source of noise first. That also makes the next role easier to define and much easier to onboard.
If you want a second opinion before adding headcount, Oleg Sotnikov at oleg.is works as a Fractional CTO and Startup Advisor. A short review of recent incidents, delivery rules, and ownership gaps can tell you whether you need another engineer, a better decision model, or both.
Frequently Asked Questions
How do I know if I need another engineer or a better rule?
Look for repeat patterns. If different people hit the same approval delay, rollback confusion, or scope argument, your team has a decision problem first.
If you fix one rule and work still piles up, then hiring starts to make more sense.
What is a rule gap?
A rule gap means your team either has no clear rule for a decision or follows a rule that leads to delays and rework. People then guess, wait, or argue in the middle of urgent work.
Typical examples include unclear rollback ownership, late sign-off, or review steps that reopen settled scope.
Why do the same engineering problems keep coming back?
Because the team keeps solving symptoms instead of fixing the decision behind them. The label changes from outage to edge case to integration issue, but the same missing owner or approval rule stays in place.
When you see the same shortcut twice, stop treating it like two separate mistakes.
Which incidents should I review first?
Start with the last three painful incidents. They still have fresh context, and they usually show who made the call, what pressure they felt, and where the team lacked a clear default.
Pull chat, notes, and a simple timeline. Then find the first decision that added delay or risk.
How do I analyze a project that slipped without one obvious blocker?
Map the work from request to release. Put every handoff in order and add one name to each step.
Then find the moment when people stayed busy but progress stopped. That stall usually points to the hidden rule your team follows without saying it out loud.
What makes a decision rule actually useful?
Keep it short and blunt. A good rule tells one person what to do in a specific moment under pressure.
For example, a release owner can roll back without group approval, or schema changes need a rollback note before review. If people cannot use the rule in real work, rewrite it.
Should I treat this as a performance problem?
No. Start with the system around that person. If the same type of failure happened with different people, blame will only hide the real problem.
Ask what signals they saw, what choices they thought they had, and whether your team gave them a clear rule to follow.
Can a strong senior engineer fix this on their own?
Sometimes, yes, but only after you fix the obvious rule gaps. A strong engineer can clean up code and move faster, yet they will still get stuck in the same approval maze if ownership stays fuzzy.
Use hiring to add capacity or deeper skill, not to cover up weak decisions.
How long should I test a rule change before opening a new role?
Give the new rule two to four weeks. That is usually enough to see whether approvals move faster, reviews shrink, and fewer tasks bounce between people.
If nothing changes after a fair trial, you have a stronger case for headcount and a clearer role to hire for.
When should I ask a Fractional CTO for help?
Bring one in when your team keeps repeating the same delays and nobody can name the rule or owner behind them. An outside review helps when founders feel the pain but cannot see the pattern.
A Fractional CTO can look at recent incidents, stalled projects, and handoffs, then tell you whether you need a new hire, a rule change, or both.