Add first-class cloud compute member support (AWS EC2)
Apra Fleet is an open-source MCP server
Brought to you by:
apralabs
Originally created by: joiskash
restart_command) and auto-retrycloud_control tool for manual start/stop/statusmonitor_task tool for structured task monitoring (PID, GPU, crash detection, log tail)fleet_status showing instance state, uptime, cost estimatedocs/cloud-compute.mdManual EC2 management with shell scripts (fleet-ec2.sh, watchdog) was unreliable:
cloud config block (no new member type)aws ec2 ...), no SDK dependencyensureCloudReady() lifecycle wrapper before SSH strategy — transparent to toolsnvidia-smi before idle stop
Originally posted by: kumaakh
PR [#1] Review — Add First-Class Cloud Compute Member Support (AWS EC2)
1. What This PR Does
Adds end-to-end AWS EC2 lifecycle management to apra-fleet. Cloud members are remote members with an additional
cloudconfig block — no new member type, no new strategy class. The implementation covers:restart_command), activity markers, and auto-retrycloud_control(manual start/stop/status) andmonitor_task(structured task monitoring)fleet_statusandmember_detailshow instance state, uptime, cost estimate, GPU utilizationMotivation
Manual EC2 management with shell scripts (fleet-ec2.sh, watchdog) was unreliable:
2. Architecture
New files (services/cloud/)
types.tsaws.tsaws ec2CLI commandslifecycle.tsensureCloudReady()— state machine for instance startup + SSH polling + auth re-provisionactivity.tscheckMemberActivity()— GPU (nvidia-smi) + process checks → busy-gpu/busy-process/idle/unknownidle-manager.tstask-wrapper.tscost.tsModified files
types.tscloud?: CloudConfigto Agent interfaceregister-member.tsupdate-member.tsexecute-command.tsensureCloudReady+touchAgent+ long_running task wrapper supportexecute-prompt.tsensureCloudReady+touchAgentsend-files.tsensureCloudReady+touchAgentcheck-status.tsmember-detail.tsindex.tsagent-helpers.tssetIdleTouchHook()+touchAgent()with session updateos-commands.tsgpuProcessCheck()+gpuUtilization()interfacelinux.tswindows.ts3. Design Decisions — Assessment
Good decisions
child_process.exec. Profile/region support for free.ensureCloudReady()as lifecycle wrappersetIdleTouchHook()avoids circular imports between idle-manager ↔ agent-helpers.Questionable decisions
python3 -cused to parse JSON for timestamp preservation. Assumes Python3 available on all cloud instances. Could usejqor pure bash.aws ec2 describe-instanceshangs,ensureCloudReadyhangs. Should have a timeout wrapper.4. Security Review
Command Injection
validateInstanceId()(regex:i-[0-9a-f]{8,17}),validateRegion()(regex),escapeShellArg()for profile~/.fleet-tasks/${taskId}/^task-[a-z0-9]+$Credential Handling
No concerns found
5. DRY Analysis
Duplicated patterns (should consolidate)
cost.ts— ensure all callers use it (verify they do)parseInt(stdout.trim()))parseGpuUtilization()helper in activity.tsgetCloudSummary(agent)functionensureCloudReady+ try/catch patternwithStatusline(agentId, fn)wrapperAcceptable repetition
touchAgent()calls in each tool — intentional, each tool is responsible for its own activity signalingensureCloudReady()guard in each tool — 3 lines, extracting adds indirection without benefit6. Test Coverage
What's well tested (336 tests)
Test gaps (should add before merge)
Mock strategy assessment
All test files use appropriate mocking:
makeExec()factory7. Documentation Review
docs/cloud-compute.mdis comprehensive and accurate. Minor gaps:~/.fleet-tasks/<taskId>/activity)~/.fleet-tasks/<taskId>/output.log)max_retriesdefault value (3)8. Requirements Traceability
9. Improvement Suggestions
Before merge
Add unit test for restart_command retry path — F1 is a headline feature with no unit test coverage. The task wrapper generates the retry logic but it's never tested in isolation.
Add unit test for F5 auth re-provisioning —
reProvisionAuth()in lifecycle.ts is called after cold start but not tested. Verify it callsprovisionAuth()andprovisionVcsAuth(), handles errors gracefully.Validate task ID format —
monitor_taskandexecute_command(long_running) accept user-provided task IDs used in file paths. Add regex validation:^task-[a-z0-9]{8,}$.Extract GPU utilization parsing —
parseInt(stdout.trim())appears in 3 tools. CreateparseGpuUtilization(stdout: string): number | undefinedin activity.ts.After merge (nice-to-have)
Add timeout to AWS CLI calls —
exec()calls in aws.ts have no timeout. A hungaws ec2 describe-instancesblocksensureCloudReadyindefinitely. Add 15s timeout.Replace Python3 dependency in task wrapper —
python3 -cis used to parse JSON for timestamp preservation. Usejq(more commonly available on GPU instances) or pure bash string extraction.Extract cloud summary builder —
check-status.tsandmember-detail.tsboth construct cloud detail objects with the same fields. Extract togetCloudSummary(agent): CloudSummary.Add exponential backoff to SSH polling — Current linear polling (2s fixed) could be smarter: 1s, 2s, 4s, 8s... caps at 16s. Reduces unnecessary TCP connections during slow starts.
Document activity marker path —
~/.fleet-tasks/<taskId>/activityis critical for idle manager integration but not documented in cloud-compute.md.Budget alerts (R5) — Marked as future/optional in requirements. Consider adding a warning when estimated cost exceeds a threshold.
10. Verification Gates (all passed on real hardware)
[cloud:running g5.2xlarge 7m $0.15], cloud_control stop works11. Overall Assessment
Quality: High. Clean architecture (cloud logic isolated in
services/cloud/), thoughtful design (safe defaults, hook pattern, state machine), comprehensive testing (336 tests), real hardware validation. The PR is well-structured with 13 work commits following the plan exactly.Main risks: Three high-value features (F1 restart, F3 activity marker, F5 auth re-provision) are verified on real hardware but lack unit tests. Adding these tests before merge would significantly increase confidence for future refactoring.
Recommendation: Merge after adding tests for items 1-3 from the improvement list. Items 4-10 can be addressed post-merge.
Related
Tickets:
#1Originally posted by: joiskash
PR [#1] Fix Sprint — Code Review (🟩 apra-fleet-reviewer)
Verdict: APPROVED | 393/397 tests passing (1 pre-existing Windows env test, 3 skipped)
Changes (+1,334 / -451 across 24 files)
d05f97eb797b730e42830371e81df06435eaea304aab4ca24Issues Found
update_memberschema:cloud_activity_commandhasmin(1)but description says "pass empty string to clear" — contradicts validationString.slice, not actuallifecycle.tscode pathNon-blocking Follow-ups
cloud_activity_commanddescription/schema mismatchRequirements Coverage
All 6 user-reported issues (U1-U6) and 10 PR review suggestions addressed. Security hardening solid at schema level.
ACTIVITY_TIMEOUT_MScorrectly applied for custom workload detection.Related
Tickets:
#1Ticket changed by: kumaakh