Menu

#147 execute_prompt: kill previous agent instances on same member before starting new session

closed
nobody
None
2026-04-24
2026-04-16
Anonymous
No

Originally created by: kumaakh
Originally owned by: kumaakh

Problem

When execute_prompt is called on a member, it can silently spawn multiple LLM processes that never terminate. Two sources combine to cause this:

Source 1 — Internal retries in executePrompt (src/tools/execute-prompt.ts lines 139–152)

Every execute_prompt call contains two internal retry paths:

// Retry 1: stale session
if (result.code !== 0 && input.resume && agent.sessionId) {
  result = await strategy.execCommand(retryCmd, timeoutMs); // spawns 2nd process
}

// Retry 2: server/overload error
if (result.code !== 0 && isRetryable(classifyError(stderr || stdout))) {
  result = await strategy.execCommand(retryCmd, timeoutMs); // spawns 3rd process
}

Each execute_prompt call can spawn up to 3 LLM processes. Neither retry kills the previous process before starting the next one.

Source 2 — SSH exit code is not the LLM exit code

result.code comes from strategy.execCommand which is an SSH exec — not from the LLM process itself. They only match when the SSH connection stays alive for the full run and the LLM exits cleanly. In all failure modes that matter:

  • SSH drops mid-run → non-zero, but LLM is still running on the remote machine
  • Process killed externally → SSH pipe closes, non-zero, LLM may be gone but SSH doesn't know why
  • SSH timeout → SSH exits, LLM keeps running
  • Network blip → non-zero SSH, LLM unaffected

The stale session retry fires on any non-zero SSH exit — including kills and network drops where the LLM is still running. The retry then spawns a second process on top of a live one.

Result observed: 5 LLM processes running simultaneously on odm-ssdev, all blocked inside long-running Bash tool calls (vstest with no timeout), all running conflicting work against the same cameras.


Root cause: no reliable way to know if LLM is running, and no stored PID to kill

The fleet has no record of what PID was spawned for a given member. It cannot:

  • Check whether the previous LLM process is still alive before retrying
  • Kill the specific process that belongs to this member's session

Checking for "any LLM process running" (e.g. Get-Process claude) is not safe — the member may have unrelated Claude sessions (user's own work, other projects). Only the specific PID spawned by this execute_prompt call is safe to kill.


Implementation approach

Step 1 — Capture PID immediately at launch via shell wrapper

execCommand currently returns only after the full process finishes — the PID is not available until it's too late. The shell wrapper must announce the PID to stdout before the LLM does any work, so the fleet server can record it immediately even if the connection later drops.

Modify buildAgentPromptCommand (via getOsCommands()) to wrap the LLM invocation:

Unix:

claude -p "..." --output-format json --max-turns 80 & echo "FLEET_PID:$!"; wait $!; exit $?

Windows (PowerShell):

$p = Start-Process claude.exe -ArgumentList ("-p","...","--output-format","json",...) -PassThru -NoNewWindow -Wait:$false
Write-Host "FLEET_PID:$($p.Id)"
$p.WaitForExit()
exit $p.ExitCode

This is provider-agnostic — the wrapper lives in the OS commands layer, not tied to any specific LLM binary.

Step 2 — Stream stdout in execCommand and parse the PID line early

execCommand must stream stdout rather than buffer it. On receiving a line matching /^FLEET_PID:(\d+)$/:

  1. Extract the PID
  2. Persist it to the member registry immediately (survives fleet server restart)
  3. Continue reading stdout for the normal response

Step 3 — Kill stored PID before any retry

In executePrompt, before each retry:

const prevPid = getStoredPid(agent.id);
if (prevPid) {
  await tryKillPid(agent, prevPid); // non-blocking, handles "not found" gracefully
  clearStoredPid(agent.id);
}
// now safe to spawn retry
result = await strategy.execCommand(retryCmd, timeoutMs);

tryKillPid uses getOsCommands().killPid(pid):

  • Unix: kill -9 <pid>
  • Windows: taskkill /F /PID <pid>

Non-blocking — if the process is already gone, it is a no-op. Does not block on failure.

Step 4 — Kill stored PID at the start of every new execute_prompt

At the top of executePrompt, before writing the prompt file:

const prevPid = getStoredPid(agent.id);
if (prevPid) {
  await tryKillPid(agent, prevPid);
  clearStoredPid(agent.id);
}

This handles the case where a previous execute_prompt call left a process running (SSH drop, external kill, whatever) and a new call arrives later.

Persistence

The stored PID lives in the member registry (same store as sessionId, tokenUsage, etc.) so it survives fleet server restarts. Field name: activePid?: number. Cleared when:

  • The process is successfully killed
  • The LLM process exits normally (end of executePrompt success path)

Files to change

File Change
src/os/os-commands.ts Add killPid(pid) to OS command interface + Unix/Windows implementations
src/providers/claude.ts (and others) buildPromptCommand emits PID-announcing shell wrapper
src/tools/execute-prompt.ts Kill stored PID before each retry and at start of new call; clear PID on success
src/utils/agent-helpers.ts Add getStoredPid / setStoredPid / clearStoredPid helpers
src/types.ts Add activePid?: number to Agent type
src/services/strategy.ts (SSH + local) Stream stdout; parse and callback on FLEET_PID: line before returning
  • [#71] — BUSY check before session reset (same root cause, superseded by this)
  • [#148] — Local background agent cancellation (separate but related problem)

Related

Tickets: #148
Tickets: #71

Discussion

  • Anonymous

    Anonymous - 2026-04-16

    Originally posted by: kumaakh

    Root cause identified — internal retries in execute-prompt.ts

    Found in src/tools/execute-prompt.ts lines 139–152. Every execute_prompt call has two built-in retry paths, each spawning a new Claude process without killing the previous one:

    1. Stale session retry (line 140): If the first attempt fails and resume=true with a stored session ID, immediately retries without the session ID — new process, old one still running.
    2. Server/overload retry (line 147): If that also fails with a retryable error, waits 5s then retries again — third process, previous two still running.

    So each execute_prompt call can spawn up to 3 Claude processes. With 4 external execute_prompt calls (from a misbehaving dispatch agent), that's up to 12 processes — we observed 5.

    The fix needs to happen at the retry points: before starting the retry process, kill the previous one. This requires tracking the PID of the running process (from strategy.execCommand) and terminating it before the retry execCommand call.

    The stale-session retry is the most impactful one to fix first — it fires immediately on any failure, producing a second process before the first has had a chance to fully exit.

     
  • Anonymous

    Anonymous - 2026-04-16

    Originally posted by: kumaakh

    Exact trigger conditions

    2nd process (stale session retry):

    if (result.code !== 0 && input.resume && agent.sessionId) {
    

    Fires on any non-zero exit code when resume=true and a sessionId exists — including process kill, auth failure, or timeout. The intent is to handle stale session IDs, but the condition is too broad. A killed process returns non-zero and triggers this immediately.

    3rd process (server/overload retry):

    if (result.code !== 0 && isRetryable(classifyError(stderr || stdout))) {
    

    Narrower — only fires if output contains 500/502/503/429/529/overloaded/rate-limit patterns. Less likely to fire on kills.

    Fix: Narrow the stale session retry condition to only fire when the error is actually a stale/invalid session, not on any non-zero exit. Requires adding a stale_session error category with patterns matching claude's actual "session not found" / "invalid session" output:

    if (result.code !== 0 && input.resume && agent.sessionId 
        && classifyError(output) === 'stale_session') {
    

    This prevents kills, auth errors, and unknown failures from spawning a second process.

     
  • Anonymous

    Anonymous - 2026-04-16

    Originally posted by: kumaakh

    Deeper issue: SSH exit code is not Claude's exit code

    The retry logic checks result.code !== 0 where result is from strategy.execCommand(claudeCmd) — this is the SSH exec result, not Claude's exit code. They only match when the SSH connection stays up for the full Claude run and Claude exits cleanly.

    In all the failure modes that actually matter:

    • SSH drops mid-run → non-zero, but Claude is still running on the remote machine
    • Process killed externally (Stop-Process) → SSH pipe closes, non-zero, Claude may be gone but SSH doesn't know why
    • SSH timeout → SSH exits non-zero, Claude keeps running
    • Network blip → non-zero SSH, Claude unaffected

    So the condition result.code !== 0 is an unreliable trigger for "Claude failed" — it fires on SSH-level problems where Claude may still be alive. The retry then spawns a second Claude process on top of a running one.

    Right fix: before any retry, check whether a Claude process is already running on the member (via ps/Get-Process). If one exists, skip the retry entirely — don't rely on SSH exit codes as a proxy for Claude's state. This subsumes the error-pattern approach and handles all failure modes correctly.

    Relates to [#71] (BUSY check before session reset) — same root cause.

     

    Related

    Tickets: #71

  • Anonymous

    Anonymous - 2026-04-16

    Originally posted by: kumaakh

    Precise fix: track the specific PID, not "any claude process"

    Checking for any running claude process is too broad — the member may have other Claude sessions running (user's own work, other projects). Only the PID of the process spawned by this execute_prompt call is safe to kill.

    Required changes

    1. execCommand must expose the child PID
    The SSH exec (or local spawn) needs to return the PID of the launched process alongside stdout/stderr/code, so the caller can act on it.

    2. executePrompt stores the PID in the member registry (persisted)
    After strategy.execCommand(claudeCmd) starts, record the PID against the member ID. Must survive fleet server restarts — so it goes in the persistent agent registry, not a local variable.

    3. Before the stale-session retry: kill that PID
    Instead of blindly spawning a second process:

    // before retry
    if (storedPid) { killPid(storedPid); clearStoredPid(agent.id); }
    const retryCmd = ...
    result = await strategy.execCommand(retryCmd, timeoutMs);
    // store new PID
    

    4. Before any new execute_prompt on the same member: kill stored PID if still alive
    At the top of executePrompt, before writing the prompt file:

    const prev = getStoredPid(agent.id);
    if (prev) { tryKillPid(prev); clearStoredPid(agent.id); } // best-effort, non-blocking
    

    Failsafe properties

    • If the process already exited, kill is a no-op (handle ESRCH gracefully)
    • If kill times out, log and continue — don't block the new dispatch
    • On Windows: taskkill /F /PID <pid>; on Unix: kill -9 <pid>
    • PID is cleared from registry once killed or confirmed dead
     
  • Anonymous

    Anonymous - 2026-04-16

    Originally posted by: kumaakh

    Note: the PID tracking and kill logic belongs in the provider-agnostic core (execute-prompt.ts), not in the Claude provider. Each provider already exposes (e.g. 'claude', 'gemini', etc.) — but the PID itself is what must be tracked, not the process name, so the kill is always targeted at the exact spawned process regardless of provider.

     
  • Anonymous

    Anonymous - 2026-04-16

    Originally posted by: kumaakh

    How to capture the PID safely: print from the shell wrapper at launch

    We cannot get the PID from execCommand's return value — it only returns after the full process finishes. The PID must be captured immediately at launch, before any failure can occur.

    Approach: shell wrapper that announces PID to stdout

    Wrap the LLM invocation so the shell starts it as a child, immediately prints FLEET_PID:<pid>, then waits. The fleet server reads that line from streaming stdout as soon as it arrives.

    Unix:

    claude -p "..." --output-format json --max-turns 80 & echo "FLEET_PID:$!"; wait $!; exit $?
    

    Windows (PowerShell):

    $p = Start-Process claude.exe -ArgumentList ("-p", "...", "--output-format", "json", ...) -PassThru -NoNewWindow -Wait:$false
    Write-Host "FLEET_PID:$($p.Id)"
    $p.WaitForExit()
    exit $p.ExitCode
    

    This is provider-agnostic — the wrapper goes in getOsCommands() (already abstracts Unix vs Windows). buildAgentPromptCommand emits the wrapped form. No changes needed to any LLM binary.

    Required changes to execCommand

    execCommand currently buffers all stdout and returns it at the end. It needs to stream stdout and parse for FLEET_PID:<pid> as soon as that line arrives:

    1. Start the SSH exec with streaming stdout
    2. On first line matching /^FLEET_PID:(\d+)$/: store PID in member registry immediately, continue reading
    3. Collect remaining stdout for the normal response
    4. Return as today

    If the SSH connection drops after the PID line is received, the PID is already persisted and can be used to kill the process.

    Kill on retry / new dispatch

    Before the stale-session retry in executePrompt:

    const prevPid = getStoredPid(agent.id);
    if (prevPid) { await tryKill(agent, prevPid); clearStoredPid(agent.id); }
    

    tryKill uses getOsCommands().killPid(pid) — on Unix kill -9 <pid>, on Windows taskkill /F /PID <pid>. Non-blocking, handles "process not found" gracefully.

    Same check at the top of executePrompt before a new dispatch to clean up any leftover from a previous call.

     
  • Anonymous

    Anonymous - 2026-04-23

    Originally posted by: kumaakh

    Technical direction: The issue body already contains a precise implementation plan. To summarize the recommended execution order:

    1. Narrow the stale-session retry condition in src/tools/execute-prompt.ts (line ~140) — add a stale_session error category to classifyError() with patterns matching Claude's actual invalid-session output. This is the quickest win and prevents most phantom processes.
    2. Shell wrapper + PID streaming: modify uildAgentPromptCommand in each provider (src/providers/claude.ts, src/providers/gemini.ts) to wrap the LLM invocation so the PID is printed to stdout as FLEET_PID:<pid> before the LLM does any work.</pid>
    3. Stream execCommand stdout in src/services/strategy.ts (SSH + local): parse FLEET_PID: line early, persist PID to member registry (ctivePid field in src/types.ts).
    4. Kill stored PID before any retry and at the start of every new execute_prompt call. Add killPid(pid) to src/os/os-commands.ts (Unix: kill -9, Windows: askkill /F /PID).

    Files: execute-prompt.ts, os-commands.ts, providers/claude.ts, providers/gemini.ts, services/strategy.ts, utils/agent-helpers.ts, ypes.ts.

     
  • Anonymous

    Anonymous - 2026-04-23
     
  • Anonymous

    Anonymous - 2026-04-24

    Ticket changed by: kumaakh

    • status: open --> closed
     

Log in to post a comment.

MongoDB Logo MongoDB