# Technical Architecture: Concurrent vs. Parallel Execution **Version:** 1.0.0 **Date:** 2025-10-31 **Audience:** Technical decision-makers, engineers --- ## Quick Definition | Term | What It Is | Our Use | |------|-----------|---------| | **Parallel** | Multiple processes on different CPUs simultaneously | NOT what we do | | **Concurrent** | Multiple requests submitted at once, processed in queue | What we actually do | | **Sequential** | One after another, waiting for each to complete | Single-agent mode | --- ## What the Task Tool Actually Does ### When You Call Task() ``` Your Code (Main Thread) │ ├─ Create Task 1 payload ├─ Create Task 2 payload ├─ Create Task 3 payload └─ Create Task 4 payload │ └─ Submit all 4 HTTP requests to Anthropic API simultaneously (This is "concurrent submission") ``` ### At Anthropic's API Level ``` HTTP Requests Arrive at API │ └─ Rate Limit Check ├─ RPM (Requests Per Minute): X available ├─ TPM (Tokens Per Minute): Y available └─ Concurrent Request Count: Z allowed │ └─ Queue Processing ├─ Request 1: Processing... ├─ Request 2: Waiting (might queue if limit hit) ├─ Request 3: Waiting (might queue if limit hit) └─ Request 4: Waiting (might queue if limit hit) │ └─ Results Returned (in any order) ├─ Response 1: Ready ├─ Response 2: Ready ├─ Response 3: Ready └─ Response 4: Ready │ └─ Your Code (Main Thread BLOCKS) └─ Waits for all 4 responses before continuing ``` --- ## Rate Limits and Concurrency ### Your API Account Limits Anthropic enforces **per-minute limits** (example values): ``` Requests Per Minute (RPM): 500 max Tokens Per Minute (TPM): 100,000 max Concurrent Requests: 20 max ``` ### What Happens When You Launch 4 Concurrent Agents ``` Scenario 1: Off-Peak, Plenty of Quota ├─ All 4 requests accepted immediately ├─ All process somewhat in parallel (within API limits) ├─ Combined result: ~20-30% time savings └─ Token usage: Standard rate Scenario 2: Near Rate Limit ├─ Request 1: Accepted (480/500 RPM remaining) ├─ Request 2: Accepted (460/500 RPM remaining) ├─ Request 3: Queued (hit RPM limit) ├─ Request 4: Queued (hit RPM limit) ├─ Requests 3-4 wait for next minute window └─ Result: Sequential execution, same speed as single agent Scenario 3: Token Limit Hit ├─ Request 1: ~25,000 tokens ├─ Request 2: ~25,000 tokens ├─ Request 3: REJECTED (would exceed TPM) ├─ Request 4: REJECTED (would exceed TPM) └─ Result: Task fails, agents don't run ``` ### Cost Implications ``` Running 4 concurrent agents always costs: - Agent 1: ~15-18K tokens - Agent 2: ~15-18K tokens - Agent 3: ~15-18K tokens - Agent 4: ~12-15K tokens Total: ~57-69K tokens Regardless of whether they run parallel or queue sequentially, the TOKEN COST is the same (you pay for the analysis) The TIME COST varies (might be slower if queued) ``` --- ## The Illusion of Parallelism ### What Marketing Says > "4 agents run in parallel" ### What Actually Happens ``` Timeline for 4 Concurrent Agents (Best Case - Off-Peak) Time Agent 1 Agent 2 Agent 3 Agent 4 ──────────────────────────────────────────────────────────────── 0ms Start Start Start Start 100ms Processing... Processing... Processing... Processing... 500ms Processing... Processing... Processing... Processing... 1000ms Processing... Processing... Processing... Processing... 1500ms Processing... Processing... Processing... Processing... 2000ms Processing... Processing... Processing... Processing... 2500ms DONE ✓ DONE ✓ DONE ✓ DONE ✓ Result Time: ~2500ms (all done roughly together) Total work done: 4 × 2500ms = 10,000ms Sequential would be: ~4 × 2500ms = 10,000ms Speedup: None (still 2500ms wall time, but... concurrent!) ``` ### Reality: API Queuing ``` Timeline for 4 Concurrent Agents (Realistic - Some Queuing) Time Agent 1 Agent 2 Agent 3 Agent 4 ──────────────────────────────────────────────────────────────── 0ms Start Start Queue... Queue... 100ms Processing... Processing... Queue... Queue... 500ms Processing... Processing... Queue... Queue... 1000ms DONE ✓ Processing... Queue... Queue... 1500ms (free) Processing... Start Queue... 2000ms (free) DONE ✓ Processing... Start 2500ms (free) (free) Processing... Processing... 3000ms (free) (free) DONE ✓ Processing... 3500ms (free) (free) (free) DONE ✓ Result Time: ~3500ms (more like sequential) Speedup: ~0% (actually slower than sequential single agent) ``` --- ## Why This Matters for Your Design ### Token Budget Impact ``` Your Monthly Token Budget: 5,000,000 tokens Single Agent Review: 35,000 tokens Can do: 142 reviews per month Concurrent Agents Review: 68,000 tokens Can do: 73 reviews per month Cost multiplier: 2x ``` ### Decision Matrix | Situation | Use This | Use Single Agent | Why | |-----------|----------|------------------|-----| | Off-peak hours | ✓ | - | Concurrency works | | Peak hours | - | ✓ | Queuing makes it slow | | Cost sensitive | - | ✓ | 2x cost is significant | | One file change | - | ✓ | Overkill | | Release review | ✓ | - | Worth the cost | | Multiple perspectives needed | ✓ | - | Value in specialization | | Emergency fix | - | ✓ | Speed doesn't help | | Enterprise quality | ✓ | - | Multi-expert review valuable | --- ## API Rate Limit Scenarios ### Scenario 1: Hitting RPM Limit ``` Your account: 500 RPM limit 4 concurrent agents @ 100 req each: - Request 1: Success (100/500) - Request 2: Success (200/500) - Request 3: Success (300/500) - Request 4: Success (400/500) In same minute, if user makes another request: - Request 5: REJECTED (500/500 limit hit) - Error: "Rate limit exceeded" ``` ### Scenario 2: Hitting TPM Limit ``` Your account: 100,000 TPM limit 4 concurrent agents: - Agent 1: ~25,000 tokens (25K/100K remaining) - Agent 2: ~25,000 tokens (50K/100K remaining) - Agent 3: ~25,000 tokens (75K/100K remaining) - Agent 4: ~20,000 tokens (95K/100K remaining) Agent 4 completes, you do another review: - Next analysis needs ~25,000 tokens - Available: 5,000 tokens - REJECTED: Exceeds TPM limit - Wait until: Next minute window ``` ### Scenario 3: Concurrent Request Limit ``` Your account: 20 concurrent requests allowed 4 concurrent agents: - Agents 1-4: OK (4/20 quota) Someone else on your account launches 17 more agents: - Agent 5-17: OK (21/20 quota) ← LIMIT EXCEEDED - One agent gets: "Concurrency limit exceeded" - Execution: Queued or failed ``` --- ## Understanding "Concurrent Submission" ### What It Looks Like in Code ```python # Master Orchestrator (Pseudo-code) def run_concurrent_agents(): # Submit all 4 agents at once (concurrent) results = launch_all_agents([ Agent.code_review(context), Agent.architecture(context), Agent.security(context), Agent.multi_perspective(context) ]) # Block until all 4 complete return wait_for_all(results) ``` ### What Actually Happens at API Level ``` 1. Prepare 4 HTTP requests 2. Send all 4 requests to API in parallel (concurrency) 3. API receives all 4 requests 4. API checks rate limits (RPM, TPM, concurrent limit) 5. API queues them in order available 6. Process requests from queue (could be parallel, could be sequential) 7. Return results as they complete 8. Your code waits for all 4 results (blocking) 9. Continue when all 4 are done ``` ### The Key Distinction ``` CONCURRENT SUBMISSION (What we do): ├─ 4 requests submitted at same time ├─ But API decides how to process them └─ Could be parallel, could be sequential TRUE PARALLEL (Not what we do): ├─ 4 requests execute on 4 different processors ├─ Guaranteed simultaneous execution └─ No queueing, no waiting ``` --- ## Why We're Not Parallel ### Hardware Reality ``` Your Computer: ├─ CPU: 1-16 cores (for you) └─ But HTTP requests go to Anthropic's servers Anthropic's Servers: ├─ Thousands of cores ├─ Processing requests from thousands of customers ├─ Your 4 requests share infrastructure with 10,000+ others └─ They decide how to allocate resources ``` ### Request Processing ``` Your Request ──HTTP──> Anthropic API ──> GPU Cluster │ (Thousands of queries being processed) │ Your request waits its turn │ When available: Process │ Return response ──HTTP──> Your Code ``` --- ## Actual Performance Gains ### Best Case (Off-Peak) ``` Stages 2-5 Duration: - Sequential: 28-45 minutes - Concurrent: 18-20 minutes - Gain: ~40% But this requires: - No other users on API - No rate limiting - Sufficient TPM budget - Rare in production ``` ### Realistic Case (Normal Load) ``` Stages 2-5 Duration: - Sequential: 28-45 minutes - Concurrent: 24-35 minutes - Gain: ~20-30% With typical: - Some API load - No rate limiting hits - Normal usage patterns ``` ### Worst Case (Peak Load) ``` Stages 2-5 Duration: - Sequential: 28-45 minutes - Concurrent: 32-48 minutes - Gain: Negative (slower) When: - High API load - Rate limiting active - High token usage - Results in queueing ``` --- ## Calculating Your Expected Speedup ``` Formula: Expected Time = Base Time × (1 - Concurrency Efficiency) Concurrency Efficiency = Percentage of APIs that process parallel If 80% of the time agents run parallel: - Expected Time = 37 min × (1 - 0.8) = 37 min × 0.2 = 7.4 min faster - Total: 37 - 7.4 = 29.6 minutes If 20% of the time agents run parallel (high load): - Expected Time = 37 min × (1 - 0.2) = 37 min × 0.8 = 29.6 min savings - Total: 37 - 1 = 36 minutes (almost no speedup) ``` --- ## Recommendations ### When to Use Concurrent Agents 1. **Off-peak hours** (guaranteed better concurrency) 2. **Well below rate limits** (room for 4 simultaneous requests) 3. **Token budget permits** (2x cost is acceptable) 4. **Quality > Speed** (primary motivation is thorough review) 5. **Enterprise standards** (multiple expert perspectives required) ### When to Avoid 1. **Peak hours** (queueing dominates) 2. **Near rate limits** (risk of failures) 3. **Limited token budget** (2x cost is expensive) 4. **Speed is primary** (20-30% is not meaningful) 5. **Simple changes** (overkill) ### Monitoring Your API Health ```bash # Track your usage: 1. Monitor RPM: requests per minute 2. Monitor TPM: tokens per minute 3. Monitor Response times 4. Track errors from rate limiting # Good signs for concurrent agents: - RPM usage < 50% of limit - TPM usage < 50% of limit - Response times stable - No rate limit errors # Bad signs: - Frequent rate limit errors - Response times > 2 seconds - TPM usage > 70% of limit - RPM usage > 60% of limit ``` --- ## Summary The Master Orchestrator **submits 4 requests concurrently**, but: - ✗ NOT true parallel (depends on API queue) - ✓ Provides context isolation (each agent clean context) - ✓ Offers multi-perspective analysis (specialization benefits) - ⚠ Costs 2x tokens (regardless of execution model) - ⚠ Speedup is 20-30% best case, not 40-50% - ⚠ Can degrade to sequential during high load **Use when**: Quality and multiple perspectives matter more than cost/speed. **Avoid when**: Cost or speed is the primary concern. See [REALITY.md](REALITY.md) for honest assessment and [TOKEN-USAGE.md](TOKEN-USAGE.md) for detailed cost analysis.