How the Troubleshoot Engine Turns Status into Diagnosis

Kubernetes tells you what’s wrong — CrashLoopBackOff, OOMKilled, Pending — but not why. The status field says the container exited with code 137, but it doesn’t mention that someone updated the Deployment’s memory limits twenty minutes ago. Clusterfudge’s troubleshoot engine bridges that gap by correlating resource status with a timeline of recent cluster changes, producing a diagnosis rather than a status report.

The Problem with Raw Status

A pod status of CrashLoopBackOff is a symptom, not a diagnosis. The container might be crashing because of a bad config change, a missing secret, a broken image tag, or an actual application bug. Kubernetes dutifully reports the symptom, but connecting it to the cause requires you to mentally reconstruct what changed recently — checking events, diffing manifests, scrolling through kubectl describe output.

This is the kind of work that feels like it should be automated. The information is all there; it just needs to be correlated.

Two Data Sources, One Investigation

The engine runs two things in parallel when you investigate a resource: a rule-based diagnostic against the current status, and a timeline query against recent cluster changes.

The diagnostic side pattern-matches against known failure modes. It’s deliberately simple — a series of checks that look at the pod’s status fields:

func (e *Engine) diagnose(inv *Investigation, status map[string]any) {
    reason := strVal(status, "reason")
    phase := strVal(status, "phase")
    exitCode := intVal(status, "exitCode")

    if reason == "CrashLoopBackOff" {
        inv.Problem = "Pod is crash-looping"
        inv.RootCause = "Application error — check container logs"
        inv.Suggestions = append(inv.Suggestions,
            Suggestion{Title: "View container logs", ActionType: "view_logs"})
    }

    if reason == "OOMKilled" {
        inv.Problem = "Container was OOM killed"
        inv.RootCause = "Memory limit exceeded — increase resources.limits.memory"
    }
    // ... more patterns
}

Each check produces a pass, fail, or warn result. Every failure includes a human-readable root cause and one or more actionable suggestions — not just “something is wrong” but “here’s what to look at next.” Suggestions carry an action type (view_logs, describe, restart) that the frontend can wire directly to a navigation action.

The Change Timeline

The second data source is the timeline — a ring buffer that records every resource change the application observes:

type Timeline struct {
    entries []ChangeRecord
    head    int
    count   int
    maxSize int
}

Each ChangeRecord captures the timestamp, resource kind, namespace, name, change type (created, updated, deleted), and — critically — the specific field diffs. When a Deployment’s image tag changes or a ConfigMap gets updated, the timeline records exactly which fields changed and what their old and new values were.

The ring buffer holds the last 1,000 changes by default. That’s enough to capture an hour or more of activity in most clusters, and it uses a fixed amount of memory regardless of cluster size. No external storage, no database — just a slice and a head pointer.

Correlation

When the engine investigates a resource, it queries the timeline for changes related to that resource within the last hour:

inv.RelatedChanges = e.timeline.Query(kind, namespace, name, inv.Since)

if inv.RootCause == "" && len(inv.RelatedChanges) > 0 {
    inv.Problem = "Resource has recent changes"
    inv.RootCause = "Recent modifications detected — review timeline for details"
}

This is where the two data sources connect. A CrashLoopBackOff diagnosis on its own tells you to check logs. A CrashLoopBackOff diagnosis with a timeline showing that the Deployment’s image tag changed five minutes ago tells you the new image is probably broken. The engine doesn’t need to understand the semantic relationship — just surfacing the temporal correlation is enough to point you in the right direction.

Why Rules, Not ML

A rule-based engine is a deliberate choice. The most common Kubernetes failure modes are well-documented — CrashLoopBackOff, OOMKilled, ImagePullBackOff, Pending scheduling failures, non-zero exit codes. A handful of rules covering these patterns handles the majority of real-world issues, and a rule engine handles them reliably, runs in microseconds, requires no training data, and produces deterministic results.

The engine doesn’t try to be clever. It doesn’t guess at causes it isn’t sure about. If none of the rules match and there are no recent changes, the investigation says “No known issues detected” — which is itself useful information when you’re checking whether a resource is healthy.

For the cases where rules aren’t enough — the genuinely novel failures — that’s what the AI debugging integration is for. The troubleshoot engine’s output feeds directly into the AI context, giving the LLM a structured starting point instead of raw YAML.

The Takeaway

The gap between “what’s wrong” and “why it’s wrong” is often just a matter of correlating status with recent changes. A ring buffer of cluster activity and a handful of pattern-matching rules close that gap for most everyday Kubernetes debugging — no external dependencies, no network calls, no latency. The engine runs entirely in the desktop client’s Go process, produces results in milliseconds, and hands off to AI tools only when the rules run out of answers.