Security logging that helps investigations, not storage bills
Security logging works best when it captures who changed what and when, keeps useful evidence, and cuts noisy events that waste storage.

Why logs fail when you need them
Most teams do not have too few logs. They have too many of the wrong kind.
A system can record every login refresh, health check, cache miss, and background job tick, then still fail to answer the one question that matters during an incident: who changed what, and when? The useful events get buried under routine noise, so people waste time digging through records nobody reads.
Missing context causes the bigger failure. A line like settings_updated or record_changed sounds useful until you need details. Which setting changed? What was the old value? What is the new value? Which user made the change, from which account, and through which tool? If those fields are missing, the timeline falls apart.
Names also drift across tools. One app logs user_deleted, another says account_removed, and a third writes a vague admin_action with a blob of text. They might describe the same thing, but they do not match cleanly. Search gets slow and messy, especially when a team has to compare app logs, cloud logs, and database history.
Keeping everything rarely fixes this. More storage gives you more lines, not better answers. A team can keep months of data and still struggle with basic questions:
- Did a person make this change, or did automation do it?
- Did the usual approval process cover it, or did it happen outside it?
- Did the same account touch other records around the same time?
A startup can hit this problem fast. One engineer updates a production setting, a script rotates credentials, and a support teammate edits a customer record. If each system logs those actions differently, or leaves out the actor and before-and-after values, nobody can rebuild the story with confidence.
Good security logging is about evidence, not volume. When logs fail, the records are usually noisy, vague, and inconsistent all at once.
Start with the investigation questions
Start with the investigation, not the logging pipeline. When something breaks or a permission changes, nobody wants 40 million lines of noise. They want a short path to the facts.
A useful security logging plan answers five plain questions every time a risky change happens:
- Which person or service account made it?
- Which setting, record, permission, or file changed?
- What exact time did the system accept the change?
- Which app, environment, server, tenant, or database got the new value?
- Which record proves the change happened?
That last question gets missed a lot. A log entry that says "updated successfully" is weak evidence. A better one includes something you can verify later, such as the old value, the new value, a version number, a request ID, or the ticket tied to the action.
Small teams feel this first. A startup might push ten product changes before lunch, rotate secrets in the afternoon, and change cloud permissions before the day ends. If the logs only say "config updated," the team still does not know who touched production, what changed, or whether the change landed in staging by mistake.
Keep the questions specific. "Who changed billing settings in production at 14:03 UTC?" is a real investigation question. "What happened in the system?" is too broad and usually leads nowhere.
What to decide before you log
Pick the actions that can hurt security, uptime, revenue, or customer trust. For each one, decide which actor, object, timestamp, destination, and proof you need to store. If you cannot name those fields, the event is not ready for logging.
This keeps audit trail design honest. Teams often log every click but skip the one thing that matters: the actual state change. Record the moment the system commits the change, not the moment someone opens the settings page.
If a log entry cannot help you confirm who changed what and when, it probably does not deserve long retention.
The events worth keeping
Most teams log too much routine activity and too few meaningful changes. Reads, health checks, and endless background calls fill storage fast, but they rarely help after a real problem. The records worth keeping are the ones that change access, change state, or remove something.
Start with sign-in and sign-out activity. Keep successful and failed login attempts, account lockouts, MFA failures, password resets, and unusual new-device or location checks. A burst of failures followed by one success often matters more than thousands of normal requests.
Keep access changes too. Record when someone gets a new role, loses one, joins an admin group, or receives direct access to a system. Permission drift is easy to miss until it causes damage.
Settings and secret changes matter just as much. Log edits to API keys, environment variables, security settings, firewall rules, integrations, and production config. Do not store the secret itself. Log that it changed, who changed it, and where.
Admin actions in dashboards and cloud tools usually deserve retention. User creation, token creation, policy edits, project deletion, billing changes, and deployment changes often explain an incident faster than app logs do.
Then focus on create, update, and delete actions for sensitive records. Customer data, financial records, contracts, exports, and anything that could cause fraud, privacy issues, or legal trouble belong here.
A quick filter helps: if an event changes who can do something, what the system does, or whether data still exists, keep it. If it only shows that a page loaded or a service checked its own health, you can usually drop it.
For a small team, one deleted deploy key in GitLab or one changed cloud role can matter more than 50,000 routine status messages. That is what a useful audit trail looks like: fewer events, better answers.
What each useful event should record
A log entry should answer one plain question fast: who changed what, and when did it happen? If you need three dashboards and a guess, the event is too thin.
You do not need twenty fields. You need the right ones, written the same way every time:
- the user name or service account that made the change
- the exact time, with timezone
- the system, project, or tenant where it happened
- the old value and new value, when that is safe to store
- a request ID or session ID so you can follow the action across systems
The actor matters more than many teams think. People do not make every change. Scripts, bots, deploy jobs, and API clients change things too, so record the exact account that acted.
Time needs more than a date and hour. Store a full timestamp in one format across every system. UTC is usually the least painful choice because it avoids timezone guesswork during an investigation.
The system, project, or tenant matters most in shared environments. A role change in the wrong tenant can look harmless until you realize it affected the wrong customer or the wrong internal team.
Old and new values turn a vague event into proof. "Role updated" is weak. "Role changed from viewer to admin" tells you what changed right away. If the field contains a secret, do not log the secret itself.
Request IDs and session IDs help you stitch together the full story. One admin click can trigger an API call, a database update, and a background job. Without a shared ID, those records look unrelated.
A small startup can do this without a giant logging system. If an engineer changes billing permissions for one tenant, the event should show the account, the tenant, the time, the old role, the new role, and the request ID. That is enough to confirm intent, spot mistakes, and retrace the path in minutes.
Set retention by value, not habit
Many teams keep logs for 30 or 90 days because that is what the hosting panel suggests. That is a weak rule. Retention should match how useful each log type is when something goes wrong.
Audit events usually deserve the longest life. If you need to answer who changed a billing rule, who granted admin access, or when a token was rotated, last week's debug output will not help much. A clean audit trail still matters months later, especially for access changes, permission updates, settings edits, deployments, and data exports.
Debug logs are different. They help during active troubleshooting, then lose value fast. For many apps, keeping verbose request traces for 3 to 14 days is enough. After that, they mostly burn storage and slow searches down.
Health checks are often the first place to cut noise. If a service writes "OK" every 10 seconds, you do not need every success line for months. Keep failures longer than successes, and drop repetitive success events early. That one change can shrink storage a lot without hurting incident work.
A practical split looks like this:
- Keep audit and change-tracking events for months, sometimes longer if the business handles sensitive actions.
- Keep debug and trace logs for a short window.
- Keep failed health checks and unusual system errors longer than routine "service healthy" messages.
- Move older audit logs to cheaper storage instead of deleting them right away.
Cheaper storage matters. Fast search storage is useful for recent work, but older audit records can sit in archive storage and still do their job. Retrieval may take longer, but the lower cost is often worth it.
The practical rule is simple: keep the records that answer human questions, and shorten the life of machine chatter. One startup team may only need seven days of debug logs, but it may need a year of admin change history because a customer dispute can show up months later.
Review retention after real incidents. If the team keeps asking for records that already expired, extend that category. If nobody has opened a noisy log type in six months, cut it back. Actual investigations give better answers than habit.
Cut noise without losing evidence
Most teams do not really have a logging problem. They have a sorting problem. One chatty service can bury the few events that answer who changed what and when.
Start with the loudest sources. API gateways, background jobs, auth middleware, and verbose app frameworks usually create the biggest pile. Pull a week of volume numbers and rank sources by storage used, not by guesswork.
Then give each source a clear label:
- Audit logs record actions you may need to prove later, such as role changes, deleted records, or billing edits.
- Ops logs help teams run systems, such as failed jobs, restarts, and latency spikes.
- Debug logs help developers inspect code paths and edge cases.
This split makes security logging easier to control. Audit logs usually stay. Ops logs often need less detail or shorter retention. Debug logs should rarely sit in production for long.
Before you cut anything, keep a sample of noisy events for two weeks. That sample acts as a safety net. It shows whether support, engineering, or security ever reads those events during reviews, incident follow-up, or routine checks.
If nobody used a log type in that period, ask why it exists. Sometimes the answer is old habit. Sometimes a framework turned on verbose output by default and nobody noticed. Remove events that never helped anyone answer a real question.
A small startup team can do this in one afternoon. It might find that request logs produce gigabytes per day, while admin permission changes produce only a few lines but matter far more during an investigation.
Do one last test before you call the cleanup done. Run a short incident drill, such as a fake account takeover or an accidental config change, and see if the remaining logs can tell a full story. If the team can still answer who acted, what changed, and when it happened, you cut noise the right way. If not, add back the missing evidence before it turns into a blind spot.
A simple example from a startup team
A small SaaS team pushes a billing update late on Friday. One engineer opens the admin panel and changes a payment retry setting. The old value is 24 hours. The new value is 2 hours.
Nobody sees a problem that evening. The deploy finishes, people log off, and the weekend starts.
On Saturday, support starts getting tickets from customers whose charges fail and subscriptions do not renew the way the team expected. Now the team needs one answer fast: who changed the payment setting, what changed, and when did it happen?
A good audit trail gives that answer in seconds. The team opens the change log for billing settings and finds one clear record tied to the admin action from Friday. It includes the user account that made the change, the exact field name, the old value, the new value, and the timestamp.
That is enough to stop guessing. The engineer confirms the change, sees that the retry window was too short, and rolls it back. The issue is contained in minutes because the team looked at one useful event instead of digging through a pile of noise.
Now imagine the opposite. The team has no clean change log, only routine API logs. They see thousands of entries from checkout requests, webhook calls, token refreshes, background jobs, and health checks. Every line has a timestamp, but none of them say that a human changed a billing setting in the admin panel.
That kind of logging fills storage and slows people down. During an incident, nobody wants 40,000 normal requests. They want the one record that explains the problem.
That is why security logging should focus on actions with real business impact. In a startup, that usually means settings changes, permission changes, billing changes, and other admin actions that can affect customers right away. One clean entry can save a weekend. Ten gigabytes of routine traffic logs usually cannot.
Mistakes that bury useful evidence
A lot of teams collect plenty of logs and still cannot answer a basic question after an incident: who changed it, what changed, and when? The problem is usually not a lack of data. The wrong data survives while the useful trail disappears.
One common mistake is logging every request body by default. That fills storage fast, creates privacy risk, and makes searches slow. Most investigators do not need the full payload. They need the action, the actor, the target, the result, and the timestamp. If someone changed a permission rule, a clean record of that change beats a huge blob of raw JSON.
Another mistake is saving error text but not the actor name. "Access denied" or "validation failed" helps a developer fix a bug, but it does not tell an investigator who triggered the event. User ID, service account, API token, and source IP often matter more than the stack trace. Without them, the log explains the symptom and hides the cause.
Teams also make life harder when they mix audit logs with debug streams. Audit events should not sit next to cache misses, retry noise, SQL timings, and test messages. During a real incident, that mix wastes time. Keep audit data separate, structured, and easy to filter. If people cannot scan it fast, they stop trusting it.
Time causes quieter damage. If each server writes local time only, a multi-server incident turns into guesswork. One machine says 09:14, another says 08:14, and nobody knows which action came first. Use UTC everywhere, and record precise timestamps when event order matters.
Retention settings cause the last common failure. Many teams let default tool rotation delete audit data too soon. Seven days sounds fine until someone asks about a change from last month. Debug logs can expire quickly. Audit trails usually need a longer life. If your logging tool treats both the same way, it will throw away the one record you needed.
Quick checks before you call it done
A logging setup is only useful if a tired person can answer a real question in a few minutes. If they have to open six tools, guess at event names, or scroll through pages of junk, the logs are not ready.
Run a few plain tests on your current setup:
- Change one setting in a test account. Then check whether the log shows who made the change, what changed, and the exact time.
- Follow one action across systems, such as a role change that touches the app, identity provider, and database. If the trail breaks, investigations will slow down fast.
- Ask finance for a rough monthly storage estimate based on current volume and retention. If nobody can answer without digging for days, costs will creep up.
- Give support a simple task, such as finding the last admin action on a customer account. If it takes more than a few minutes, the event names or filters need work.
- Show the event list to a new team member. If names like
cfg_mutation_v2only make sense to the person who wrote them, rename them now.
Good logs read like plain records, not puzzle pieces. "User role changed" beats a vague internal label every time. Small naming fixes save real time during an incident.
Tracing across systems matters more than many teams expect. A single request often jumps between auth, app logic, background jobs, and storage. If each system uses a different timestamp format or drops the request ID, you lose the story.
Cost checks matter too. Log retention should be a choice, not an accident. If finance can estimate growth, you can decide what stays hot, what moves to cheaper storage, and what you can drop.
You should be able to prove who changed what and when, trace the action end to end, and find it again without a specialist sitting beside you.
What to do next
Start with one workflow that can hurt you if someone changes it without a clear record. User role changes, billing plan edits, production config updates, and access to customer data are good places to start. One mapped workflow teaches more than a month of vague logging work.
Write one event schema and keep it boring. Every audit record should use the same fields for actor, action, target, time, source, and result. If teams invent a new format for each service, security logging turns into cleanup work instead of evidence.
- Choose one workflow and list each change that matters.
- Create one audit event format and use it across the app, admin tools, and scripts.
- Run a 20-minute incident drill with one question: who changed this, what changed, and when?
- Fix the first gaps you find before you add more events.
A short drill exposes missing details fast. Ask one person to change a setting in a test environment, then ask another person to investigate using only the logs. If they cannot identify the exact account, before-and-after values, or request source in a few minutes, your audit trail still has holes.
Do not try to log everything next week. That usually creates storage costs, noisy dashboards, and a pile of events nobody trusts. One clean workflow, one shared format, and one honest drill will move you further than a giant logging project.
If you want a second opinion, Oleg Sotnikov at oleg.is works with startups and small businesses on technical strategy, infrastructure, and Fractional CTO support. A focused review can show whether your audit trail will hold up in a real incident or just fill storage.
Frequently Asked Questions
What should security logs help me answer?
Use logs to answer a simple incident question fast: who made the change, what changed, when the system accepted it, where it landed, and what record proves it. If a log cannot help you confirm that in a few minutes, it adds noise more than value.
Which events should I keep first if storage is tight?
Keep events that change access, system behavior, or data. Start with sign-ins and failures, role and permission changes, password resets, secret rotations, config edits, admin actions, deployments, and create, update, or delete actions for sensitive records.
Should I log every request body?
No. Full request bodies fill storage, slow searches, and can expose private data. Log the action, actor, target, result, timestamp, and before-and-after values when they are safe to store.
What fields should every audit event include?
Record the actor, exact UTC timestamp, system or tenant, target object, result, and a request or session ID. When it is safe, add the old value and new value so the event shows proof instead of a vague status message.
How long should I keep audit logs compared with debug logs?
Keep audit and change history for months because teams often need it long after an incident. Keep debug logs for a short window, keep failed health checks longer than successful ones, and move older audit records to cheaper storage when search speed matters less.
How do I reduce log noise without losing evidence?
Pull one week of volume data and find the loudest sources first. Then separate audit logs from ops and debug logs, keep a short sample before you cut anything, and run a small incident drill to make sure you still can trace the story.
Why do request IDs and UTC timestamps matter so much?
Shared IDs connect one action across the app, database, jobs, and cloud tools. UTC timestamps let you sort events in the right order without timezone confusion, which saves a lot of time during a real incident.
What should I avoid storing in logs?
Do not store secrets, tokens, passwords, or full sensitive payloads unless you have a very clear reason and strong controls. Log that the value changed, who changed it, and where it changed instead of dumping the secret itself.
How can a small team check whether its logging actually works?
Pick one risky workflow in a test environment, make a change, and ask someone else to investigate using only the logs. If they cannot find the exact account, before-and-after values, source, and time in a few minutes, fix the gaps before you add more events.
When should I get outside help with audit logging?
Bring in help when your team cannot trace admin changes across systems, cannot agree on event names, or keeps losing the records it needs. A focused review from an experienced CTO or advisor can spot weak retention rules, missing fields, and noisy defaults before they cost you time during an incident.