apra-fleet / Tickets / #147 execute_prompt: kill previous agent instances on same member before starting new session

File	Change
`src/os/os-commands.ts`	Add `killPid(pid)` to OS command interface + Unix/Windows implementations
`src/providers/claude.ts` (and others)	`buildPromptCommand` emits PID-announcing shell wrapper
`src/tools/execute-prompt.ts`	Kill stored PID before each retry and at start of new call; clear PID on success
`src/utils/agent-helpers.ts`	Add `getStoredPid` / `setStoredPid` / `clearStoredPid` helpers
`src/types.ts`	Add `activePid?: number` to `Agent` type
`src/services/strategy.ts` (SSH + local)	Stream stdout; parse and callback on `FLEET_PID:` line before returning

Anonymous - 2026-04-16

Originally posted by: kumaakh

Root cause identified — internal retries in execute-prompt.ts

Found in src/tools/execute-prompt.ts lines 139–152. Every execute_prompt call has two built-in retry paths, each spawning a new Claude process without killing the previous one:

Stale session retry (line 140): If the first attempt fails and resume=true with a stored session ID, immediately retries without the session ID — new process, old one still running.

Server/overload retry (line 147): If that also fails with a retryable error, waits 5s then retries again — third process, previous two still running.

So each execute_prompt call can spawn up to 3 Claude processes. With 4 external execute_prompt calls (from a misbehaving dispatch agent), that's up to 12 processes — we observed 5.

The fix needs to happen at the retry points: before starting the retry process, kill the previous one. This requires tracking the PID of the running process (from strategy.execCommand) and terminating it before the retry execCommand call.

The stale-session retry is the most impactful one to fix first — it fires immediately on any failure, producing a second process before the first has had a chance to fully exit.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2026-04-16

Originally posted by: kumaakh

Exact trigger conditions

2nd process (stale session retry):

if (result.code !== 0 && input.resume && agent.sessionId) {

Fires on any non-zero exit code when resume=true and a sessionId exists — including process kill, auth failure, or timeout. The intent is to handle stale session IDs, but the condition is too broad. A killed process returns non-zero and triggers this immediately.

3rd process (server/overload retry):

if (result.code !== 0 && isRetryable(classifyError(stderr || stdout))) {

Narrower — only fires if output contains 500/502/503/429/529/overloaded/rate-limit patterns. Less likely to fire on kills.

Fix: Narrow the stale session retry condition to only fire when the error is actually a stale/invalid session, not on any non-zero exit. Requires adding a stale_session error category with patterns matching claude's actual "session not found" / "invalid session" output:

if (result.code !== 0 && input.resume && agent.sessionId && classifyError(output) === 'stale_session') {

This prevents kills, auth errors, and unknown failures from spawning a second process.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2026-04-16

Originally posted by: kumaakh

Deeper issue: SSH exit code is not Claude's exit code

The retry logic checks result.code !== 0 where result is from strategy.execCommand(claudeCmd) — this is the SSH exec result, not Claude's exit code. They only match when the SSH connection stays up for the full Claude run and Claude exits cleanly.

In all the failure modes that actually matter:

SSH drops mid-run → non-zero, but Claude is still running on the remote machine

Process killed externally (Stop-Process) → SSH pipe closes, non-zero, Claude may be gone but SSH doesn't know why

SSH timeout → SSH exits non-zero, Claude keeps running

Network blip → non-zero SSH, Claude unaffected

So the condition result.code !== 0 is an unreliable trigger for "Claude failed" — it fires on SSH-level problems where Claude may still be alive. The retry then spawns a second Claude process on top of a running one.

Right fix: before any retry, check whether a Claude process is already running on the member (via ps/Get-Process). If one exists, skip the retry entirely — don't rely on SSH exit codes as a proxy for Claude's state. This subsumes the error-pattern approach and handles all failure modes correctly.

Relates to [#71] (BUSY check before session reset) — same root cause.

Related

Tickets: #71
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2026-04-16

Originally posted by: kumaakh

Precise fix: track the specific PID, not "any claude process"

Checking for any running claude process is too broad — the member may have other Claude sessions running (user's own work, other projects). Only the PID of the process spawned by this execute_prompt call is safe to kill.

Required changes

1. execCommand must expose the child PID
The SSH exec (or local spawn) needs to return the PID of the launched process alongside stdout/stderr/code, so the caller can act on it.

2. executePrompt stores the PID in the member registry (persisted)
After strategy.execCommand(claudeCmd) starts, record the PID against the member ID. Must survive fleet server restarts — so it goes in the persistent agent registry, not a local variable.

3. Before the stale-session retry: kill that PID
Instead of blindly spawning a second process:

// before retry if (storedPid) { killPid(storedPid); clearStoredPid(agent.id); } const retryCmd = ... result = await strategy.execCommand(retryCmd, timeoutMs); // store new PID

4. Before any new execute_prompt on the same member: kill stored PID if still alive
At the top of executePrompt, before writing the prompt file:

const prev = getStoredPid(agent.id); if (prev) { tryKillPid(prev); clearStoredPid(agent.id); } // best-effort, non-blocking

Failsafe properties

If the process already exited, kill is a no-op (handle ESRCH gracefully)

If kill times out, log and continue — don't block the new dispatch

On Windows: taskkill /F /PID <pid>; on Unix: kill -9 <pid>

PID is cleared from registry once killed or confirmed dead
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2026-04-16

Originally posted by: kumaakh

Note: the PID tracking and kill logic belongs in the provider-agnostic core (execute-prompt.ts), not in the Claude provider. Each provider already exposes (e.g. 'claude', 'gemini', etc.) — but the PID itself is what must be tracked, not the process name, so the kill is always targeted at the exact spawned process regardless of provider.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2026-04-16

Originally posted by: kumaakh

How to capture the PID safely: print from the shell wrapper at launch

We cannot get the PID from execCommand's return value — it only returns after the full process finishes. The PID must be captured immediately at launch, before any failure can occur.

Approach: shell wrapper that announces PID to stdout

Wrap the LLM invocation so the shell starts it as a child, immediately prints FLEET_PID:<pid>, then waits. The fleet server reads that line from streaming stdout as soon as it arrives.

Unix:

claude -p "..." --output-format json --max-turns 80 & echo "FLEET_PID:$!"; wait $!; exit $?

Windows (PowerShell):

$p = Start-Process claude.exe -ArgumentList ("-p", "...", "--output-format", "json", ...) -PassThru -NoNewWindow -Wait:$false Write-Host "FLEET_PID:$($p.Id)" $p.WaitForExit() exit $p.ExitCode

This is provider-agnostic — the wrapper goes in getOsCommands() (already abstracts Unix vs Windows). buildAgentPromptCommand emits the wrapped form. No changes needed to any LLM binary.

Required changes to execCommand

execCommand currently buffers all stdout and returns it at the end. It needs to stream stdout and parse for FLEET_PID:<pid> as soon as that line arrives:

Start the SSH exec with streaming stdout

On first line matching /^FLEET_PID:(\d+)$/: store PID in member registry immediately, continue reading

Collect remaining stdout for the normal response

Return as today

If the SSH connection drops after the PID line is received, the PID is already persisted and can be used to kill the process.

Kill on retry / new dispatch

Before the stale-session retry in executePrompt:

const prevPid = getStoredPid(agent.id); if (prevPid) { await tryKill(agent, prevPid); clearStoredPid(agent.id); }

tryKill uses getOsCommands().killPid(pid) — on Unix kill -9 <pid>, on Windows taskkill /F /PID <pid>. Non-blocking, handles "process not found" gracefully.

Same check at the top of executePrompt before a new dispatch to clean up any leftover from a previous call.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2026-04-23

Originally posted by: kumaakh

Technical direction: The issue body already contains a precise implementation plan. To summarize the recommended execution order:

Narrow the stale-session retry condition in src/tools/execute-prompt.ts (line ~140) — add a stale_session error category to classifyError() with patterns matching Claude's actual invalid-session output. This is the quickest win and prevents most phantom processes.

Shell wrapper + PID streaming: modify uildAgentPromptCommand in each provider (src/providers/claude.ts, src/providers/gemini.ts) to wrap the LLM invocation so the PID is printed to stdout as FLEET_PID:<pid> before the LLM does any work.</pid>

Stream execCommand stdout in src/services/strategy.ts (SSH + local): parse FLEET_PID: line early, persist PID to member registry (ctivePid field in src/types.ts).

Kill stored PID before any retry and at the start of every new execute_prompt call. Add killPid(pid) to src/os/os-commands.ts (Unix: kill -9, Windows: askkill /F /PID).

Files: execute-prompt.ts, os-commands.ts, providers/claude.ts, providers/gemini.ts, services/strategy.ts, utils/agent-helpers.ts, ypes.ts.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2026-04-23

assigned_to: kumaakh
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2026-04-24

Ticket changed by: kumaakh

status: open --> closed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

execute_prompt: kill previous agent instances on same member before starting...

Apra Fleet is an open-source MCP server

Milestone

Searches

Help

#147 execute_prompt: kill previous agent instances on same member before starting new session

Problem

Root cause: no reliable way to know if LLM is running, and no stored PID to kill

Implementation approach

Step 1 — Capture PID immediately at launch via shell wrapper

Step 2 — Stream stdout in `execCommand` and parse the PID line early

Step 3 — Kill stored PID before any retry

Step 4 — Kill stored PID at the start of every new `execute_prompt`

Persistence

Files to change

Related

Discussion

Root cause identified — internal retries in execute-prompt.ts

Exact trigger conditions

Deeper issue: SSH exit code is not Claude's exit code

Related

Precise fix: track the specific PID, not "any claude process"

Required changes

Failsafe properties

How to capture the PID safely: print from the shell wrapper at launch

Approach: shell wrapper that announces PID to stdout

Required changes to `execCommand`

Kill on retry / new dispatch

execute_prompt: kill previous agent instances on same member before starting...

Apra Fleet is an open-source MCP server

Milestone

Searches

Help

#147 execute_prompt: kill previous agent instances on same member before starting new session

Problem

Root cause: no reliable way to know if LLM is running, and no stored PID to kill

Implementation approach

Step 1 — Capture PID immediately at launch via shell wrapper

Step 2 — Stream stdout in execCommand and parse the PID line early

Step 3 — Kill stored PID before any retry

Step 4 — Kill stored PID at the start of every new execute_prompt

Persistence

Files to change

Related

Related

Discussion

Root cause identified — internal retries in execute-prompt.ts

Exact trigger conditions

Deeper issue: SSH exit code is not Claude's exit code

Related

Precise fix: track the specific PID, not "any claude process"

Required changes

Failsafe properties

How to capture the PID safely: print from the shell wrapper at launch

Approach: shell wrapper that announces PID to stdout

Required changes to execCommand

Kill on retry / new dispatch

Step 2 — Stream stdout in `execCommand` and parse the PID line early

Step 4 — Kill stored PID at the start of every new `execute_prompt`

Required changes to `execCommand`