Originally created by: kumaakh
Originally owned by: kumaakh
When execute_prompt is called on a member, it can silently spawn multiple LLM processes that never terminate. Two sources combine to cause this:
Source 1 — Internal retries in executePrompt (src/tools/execute-prompt.ts lines 139–152)
Every execute_prompt call contains two internal retry paths:
// Retry 1: stale session
if (result.code !== 0 && input.resume && agent.sessionId) {
result = await strategy.execCommand(retryCmd, timeoutMs); // spawns 2nd process
}
// Retry 2: server/overload error
if (result.code !== 0 && isRetryable(classifyError(stderr || stdout))) {
result = await strategy.execCommand(retryCmd, timeoutMs); // spawns 3rd process
}
Each execute_prompt call can spawn up to 3 LLM processes. Neither retry kills the previous process before starting the next one.
Source 2 — SSH exit code is not the LLM exit code
result.code comes from strategy.execCommand which is an SSH exec — not from the LLM process itself. They only match when the SSH connection stays alive for the full run and the LLM exits cleanly. In all failure modes that matter:
The stale session retry fires on any non-zero SSH exit — including kills and network drops where the LLM is still running. The retry then spawns a second process on top of a live one.
Result observed: 5 LLM processes running simultaneously on odm-ssdev, all blocked inside long-running Bash tool calls (vstest with no timeout), all running conflicting work against the same cameras.
The fleet has no record of what PID was spawned for a given member. It cannot:
Checking for "any LLM process running" (e.g. Get-Process claude) is not safe — the member may have unrelated Claude sessions (user's own work, other projects). Only the specific PID spawned by this execute_prompt call is safe to kill.
execCommand currently returns only after the full process finishes — the PID is not available until it's too late. The shell wrapper must announce the PID to stdout before the LLM does any work, so the fleet server can record it immediately even if the connection later drops.
Modify buildAgentPromptCommand (via getOsCommands()) to wrap the LLM invocation:
Unix:
claude -p "..." --output-format json --max-turns 80 & echo "FLEET_PID:$!"; wait $!; exit $?
Windows (PowerShell):
$p = Start-Process claude.exe -ArgumentList ("-p","...","--output-format","json",...) -PassThru -NoNewWindow -Wait:$false
Write-Host "FLEET_PID:$($p.Id)"
$p.WaitForExit()
exit $p.ExitCode
This is provider-agnostic — the wrapper lives in the OS commands layer, not tied to any specific LLM binary.
execCommand and parse the PID line earlyexecCommand must stream stdout rather than buffer it. On receiving a line matching /^FLEET_PID:(\d+)$/:
In executePrompt, before each retry:
const prevPid = getStoredPid(agent.id);
if (prevPid) {
await tryKillPid(agent, prevPid); // non-blocking, handles "not found" gracefully
clearStoredPid(agent.id);
}
// now safe to spawn retry
result = await strategy.execCommand(retryCmd, timeoutMs);
tryKillPid uses getOsCommands().killPid(pid):
kill -9 <pid> taskkill /F /PID <pid>Non-blocking — if the process is already gone, it is a no-op. Does not block on failure.
execute_promptAt the top of executePrompt, before writing the prompt file:
const prevPid = getStoredPid(agent.id);
if (prevPid) {
await tryKillPid(agent, prevPid);
clearStoredPid(agent.id);
}
This handles the case where a previous execute_prompt call left a process running (SSH drop, external kill, whatever) and a new call arrives later.
The stored PID lives in the member registry (same store as sessionId, tokenUsage, etc.) so it survives fleet server restarts. Field name: activePid?: number. Cleared when:
executePrompt success path)| File | Change |
|---|---|
src/os/os-commands.ts |
Add killPid(pid) to OS command interface + Unix/Windows implementations |
src/providers/claude.ts (and others) |
buildPromptCommand emits PID-announcing shell wrapper |
src/tools/execute-prompt.ts |
Kill stored PID before each retry and at start of new call; clear PID on success |
src/utils/agent-helpers.ts |
Add getStoredPid / setStoredPid / clearStoredPid helpers |
src/types.ts |
Add activePid?: number to Agent type |
src/services/strategy.ts (SSH + local) |
Stream stdout; parse and callback on FLEET_PID: line before returning |
Originally posted by: kumaakh
Root cause identified — internal retries in execute-prompt.ts
Found in
src/tools/execute-prompt.tslines 139–152. Everyexecute_promptcall has two built-in retry paths, each spawning a new Claude process without killing the previous one:resume=truewith a stored session ID, immediately retries without the session ID — new process, old one still running.So each
execute_promptcall can spawn up to 3 Claude processes. With 4 externalexecute_promptcalls (from a misbehaving dispatch agent), that's up to 12 processes — we observed 5.The fix needs to happen at the retry points: before starting the retry process, kill the previous one. This requires tracking the PID of the running process (from
strategy.execCommand) and terminating it before the retryexecCommandcall.The stale-session retry is the most impactful one to fix first — it fires immediately on any failure, producing a second process before the first has had a chance to fully exit.
Originally posted by: kumaakh
Exact trigger conditions
2nd process (stale session retry):
Fires on any non-zero exit code when resume=true and a sessionId exists — including process kill, auth failure, or timeout. The intent is to handle stale session IDs, but the condition is too broad. A killed process returns non-zero and triggers this immediately.
3rd process (server/overload retry):
Narrower — only fires if output contains 500/502/503/429/529/overloaded/rate-limit patterns. Less likely to fire on kills.
Fix: Narrow the stale session retry condition to only fire when the error is actually a stale/invalid session, not on any non-zero exit. Requires adding a
stale_sessionerror category with patterns matching claude's actual "session not found" / "invalid session" output:This prevents kills, auth errors, and unknown failures from spawning a second process.
Originally posted by: kumaakh
Deeper issue: SSH exit code is not Claude's exit code
The retry logic checks
result.code !== 0whereresultis fromstrategy.execCommand(claudeCmd)— this is the SSH exec result, not Claude's exit code. They only match when the SSH connection stays up for the full Claude run and Claude exits cleanly.In all the failure modes that actually matter:
Stop-Process) → SSH pipe closes, non-zero, Claude may be gone but SSH doesn't know whySo the condition
result.code !== 0is an unreliable trigger for "Claude failed" — it fires on SSH-level problems where Claude may still be alive. The retry then spawns a second Claude process on top of a running one.Right fix: before any retry, check whether a Claude process is already running on the member (via
ps/Get-Process). If one exists, skip the retry entirely — don't rely on SSH exit codes as a proxy for Claude's state. This subsumes the error-pattern approach and handles all failure modes correctly.Relates to [#71] (BUSY check before session reset) — same root cause.
Related
Tickets: #71
Originally posted by: kumaakh
Precise fix: track the specific PID, not "any claude process"
Checking for any running claude process is too broad — the member may have other Claude sessions running (user's own work, other projects). Only the PID of the process spawned by this
execute_promptcall is safe to kill.Required changes
1.
execCommandmust expose the child PIDThe SSH exec (or local spawn) needs to return the PID of the launched process alongside stdout/stderr/code, so the caller can act on it.
2.
executePromptstores the PID in the member registry (persisted)After
strategy.execCommand(claudeCmd)starts, record the PID against the member ID. Must survive fleet server restarts — so it goes in the persistent agent registry, not a local variable.3. Before the stale-session retry: kill that PID
Instead of blindly spawning a second process:
4. Before any new
execute_prompton the same member: kill stored PID if still aliveAt the top of
executePrompt, before writing the prompt file:Failsafe properties
taskkill /F /PID <pid>; on Unix:kill -9 <pid>Originally posted by: kumaakh
Note: the PID tracking and kill logic belongs in the provider-agnostic core (execute-prompt.ts), not in the Claude provider. Each provider already exposes (e.g. 'claude', 'gemini', etc.) — but the PID itself is what must be tracked, not the process name, so the kill is always targeted at the exact spawned process regardless of provider.
Originally posted by: kumaakh
How to capture the PID safely: print from the shell wrapper at launch
We cannot get the PID from
execCommand's return value — it only returns after the full process finishes. The PID must be captured immediately at launch, before any failure can occur.Approach: shell wrapper that announces PID to stdout
Wrap the LLM invocation so the shell starts it as a child, immediately prints
FLEET_PID:<pid>, then waits. The fleet server reads that line from streaming stdout as soon as it arrives.Unix:
Windows (PowerShell):
This is provider-agnostic — the wrapper goes in
getOsCommands()(already abstracts Unix vs Windows).buildAgentPromptCommandemits the wrapped form. No changes needed to any LLM binary.Required changes to
execCommandexecCommandcurrently buffers all stdout and returns it at the end. It needs to stream stdout and parse forFLEET_PID:<pid>as soon as that line arrives:/^FLEET_PID:(\d+)$/: store PID in member registry immediately, continue readingIf the SSH connection drops after the PID line is received, the PID is already persisted and can be used to kill the process.
Kill on retry / new dispatch
Before the stale-session retry in
executePrompt:tryKillusesgetOsCommands().killPid(pid)— on Unixkill -9 <pid>, on Windowstaskkill /F /PID <pid>. Non-blocking, handles "process not found" gracefully.Same check at the top of
executePromptbefore a new dispatch to clean up any leftover from a previous call.Originally posted by: kumaakh
Technical direction: The issue body already contains a precise implementation plan. To summarize the recommended execution order:
Files: execute-prompt.ts, os-commands.ts, providers/claude.ts, providers/gemini.ts, services/strategy.ts, utils/agent-helpers.ts, ypes.ts.
Ticket changed by: kumaakh