Problem

What Is Problem?

A problem is the underlying cause of one or more incidents — the root condition that, if left unresolved, will generate repeated service disruptions. Unlike an incident, which represents a single unplanned interruption or degradation of service, a problem represents the structural or systemic defect that produces those interruptions. Problem management exists to identify, document, and eliminate these root causes so that incidents stop recurring. In ITIL terminology, a problem is formally defined once the root cause of an incident (or cluster of related incidents) has been identified, though in practice many organizations track "known errors" — problems for which the cause is understood but a permanent fix has not yet been implemented.

Why Problem Matters

Problems directly determine whether your organization reacts to the same failures repeatedly or builds resilience over time. When problems go unaddressed, incidents recur — often at the worst possible moments — driving up MTTR, eroding customer trust, and consuming engineering capacity that could be spent on feature development or strategic improvements. Over 80% of incidents are repeats of previously resolved issues, a pattern that reflects inadequate problem management discipline. Effective problem management reduces incident volume, stabilizes service delivery, and creates a feedback loop where operational intelligence informs design and development decisions. For teams operating under SLAs or compliance frameworks (ISO 20000, SOC 2, ITIL), documented problem records and root cause analyses are often mandatory audit artifacts. For DevOps and SRE teams, problems surface in postmortems and are tracked as action items — but without formal accountability and integration into change workflows, those action items frequently stall, and the same outages repeat months later.

How Problem Works

Problem management operates in two modes: reactive and proactive. Reactive problem management begins after an incident has been resolved. The incident response team or service desk escalates the incident to problem management if the root cause is unknown or if the same issue has occurred before. A problem manager or designated analyst investigates the incident timeline, logs, configuration changes, and related tickets to identify the underlying cause. Once identified, the problem is logged with a root cause statement, and a known error record may be created if a workaround exists but no permanent fix is available. Proactive problem management analyzes trends in incident data, monitoring alerts, and service performance metrics to identify patterns that indicate latent problems before they cause major disruptions. Both modes feed into change management: once a root cause is confirmed, corrective actions (code fixes, configuration changes, infrastructure upgrades) are documented as change requests, assigned to responsible teams, and tracked through to implementation. The problem record remains open until the fix is deployed, validated, and confirmed to prevent recurrence. In integrated ITSM and IMR platforms, problem records automatically link to related incidents, postmortems, and change tasks, ensuring visibility and accountability across service desk, operations, and engineering teams.

Examples of Problem

- E-commerce platform experiencing intermittent checkout failures: After resolving three separate incidents over two weeks where users could not complete purchases, the SRE team identifies a memory leak in the payment gateway microservice. The problem is logged with the root cause, a code fix is scheduled as a change request, and the problem record is closed once the patched service runs for 30 days without incident.

- Healthcare IT service desk receiving repeated password reset requests from a single department: Investigation reveals that a recent Active Directory policy change inadvertently set password expiration to 7 days for one user group instead of 90. The problem is documented, the policy is corrected via change management, and the service desk stops receiving the flood of related tickets.

- SaaS provider seeing elevated API error rates every Monday morning: Proactive problem analysis of monitoring data shows that a scheduled batch job runs at the same time as peak user login traffic, saturating database connections. The problem is logged, the batch job is rescheduled to off-peak hours, and API error rates normalize, preventing future incidents and customer complaints.

Related Terms

- Incident
- Problem Management
- Change
- Known Error
- Root Cause Analysis

---

Frequently Asked Questions

Who should own a problem record — the service desk, the SRE team, or engineering?
Assign problem ownership to whoever controls the resources needed to implement the permanent fix, not whoever first identified the issue. In practice, this means service desk teams log and triage problems, but ownership transfers to an SRE or engineering lead once root cause investigation requires code-level or infrastructure-level access. Leaving ownership with the service desk when the fix lives in an engineering backlog is one of the most common reasons problem records stall indefinitely.
How do we avoid problem records piling up as a graveyard of unresolved issues nobody acts on?
A: Tie every open problem record to a change request with an assigned owner and a target resolution date — a problem record without a linked change task has no enforcement mechanism and will drift. Review open problem records in your weekly change advisory board or sprint planning cycle so they compete for prioritization alongside feature work rather than sitting in a separate, ignored queue. Platforms that surface problem aging and link records directly to change workflows make this discipline significantly easier to maintain at scale.
When does it make sense to skip formal problem management and just fix the issue directly?
Skip the formal problem record only when the incident is genuinely isolated — a one-time hardware failure with no pattern, no SLA exposure, and no recurrence risk — and document that decision explicitly so future analysts don't re-investigate the same event. For anything touching a shared service, a customer-facing SLA, or a configuration that other teams depend on, the overhead of a problem record is justified because the cost of a repeat incident almost always exceeds the cost of the documentation. The threshold should be defined in your problem management policy, not left to individual judgment call in the moment.
What's the practical difference between a problem and a known error, and does the distinction actually matter operationally?
A problem becomes a known error the moment your team confirms the root cause but cannot yet deploy a permanent fix — the distinction matters because known errors should carry documented workarounds that the service desk can execute immediately to reduce incident impact while the fix is in progress. Treating every open problem as equivalent regardless of diagnostic status means frontline staff waste time re-investigating causes that are already understood. Tracking the transition from problem to known error in your ITSM platform gives you a clear signal for when to update your service desk runbooks and customer-facing status communications.
How should problem management integrate with postmortem processes in a DevOps or SRE environment?
Postmortems generate action items, but without a formal problem record those action items lack the ownership tracking, change linkage, and audit trail that enterprise governance requires — map each postmortem action item that addresses a root cause directly to a problem or known error record in your ITSM platform. This integration ensures that SRE-driven findings feed back into change management workflows rather than living only in a shared document that nobody revisits. It also gives compliance and audit teams a single traceable thread from incident through root cause through corrective change, which standalone postmortem tools cannot provide on their own.

ITxM Platform

Status Pages

iPaaS

Problem

Table of contents