This update corrects misleading performance and cost claims in the documentation: CORRECTED CLAIMS: - Performance: Changed from "40-50% faster" to "20-30% faster" (honest observation) - Token Cost: Changed from "60-70% savings" to "1.9-2.0x more expensive" (actual cost) - Parallelism: Clarified "concurrent requests" vs "true parallel execution" - Architecture: Updated from "parallel" to "concurrent" throughout NEW DOCUMENTATION: - REALITY.md: Honest assessment and reality vs. marketing - ARCHITECTURE.md: Technical details on concurrent vs. parallel execution - TOKEN-USAGE.md: Detailed token cost breakdown and optimization strategies UPDATED FILES: - master-orchestrator.md: Accurate performance, cost, and when-to-use guidance - README.md: Updated architecture overview and trade-offs KEY INSIGHTS: - Concurrent agent architecture IS valuable but for different reasons: * Main thread context is clean (20-30% of single-agent size) * 4 independent expert perspectives (genuine value) * API rate limiting affects actual speed (20-30% typical) * Cost is 1.9-2.0x tokens vs. single agent analysis - Best for enterprise quality-critical work, NOT cost-efficient projects - Includes decision matrix and cost optimization strategies This update maintains technical accuracy while preserving the genuine benefits of multi-perspective analysis and context isolation that make the system valuable.
455 lines
12 KiB
Markdown
455 lines
12 KiB
Markdown
# Technical Architecture: Concurrent vs. Parallel Execution
|
||
|
||
**Version:** 1.0.0
|
||
**Date:** 2025-10-31
|
||
**Audience:** Technical decision-makers, engineers
|
||
|
||
---
|
||
|
||
## Quick Definition
|
||
|
||
| Term | What It Is | Our Use |
|
||
|------|-----------|---------|
|
||
| **Parallel** | Multiple processes on different CPUs simultaneously | NOT what we do |
|
||
| **Concurrent** | Multiple requests submitted at once, processed in queue | What we actually do |
|
||
| **Sequential** | One after another, waiting for each to complete | Single-agent mode |
|
||
|
||
---
|
||
|
||
## What the Task Tool Actually Does
|
||
|
||
### When You Call Task()
|
||
|
||
```
|
||
Your Code (Main Thread)
|
||
│
|
||
├─ Create Task 1 payload
|
||
├─ Create Task 2 payload
|
||
├─ Create Task 3 payload
|
||
└─ Create Task 4 payload
|
||
│
|
||
└─ Submit all 4 HTTP requests to Anthropic API simultaneously
|
||
(This is "concurrent submission")
|
||
```
|
||
|
||
### At Anthropic's API Level
|
||
|
||
```
|
||
HTTP Requests Arrive at API
|
||
│
|
||
└─ Rate Limit Check
|
||
├─ RPM (Requests Per Minute): X available
|
||
├─ TPM (Tokens Per Minute): Y available
|
||
└─ Concurrent Request Count: Z allowed
|
||
│
|
||
└─ Queue Processing
|
||
├─ Request 1: Processing...
|
||
├─ Request 2: Waiting (might queue if limit hit)
|
||
├─ Request 3: Waiting (might queue if limit hit)
|
||
└─ Request 4: Waiting (might queue if limit hit)
|
||
│
|
||
└─ Results Returned (in any order)
|
||
├─ Response 1: Ready
|
||
├─ Response 2: Ready
|
||
├─ Response 3: Ready
|
||
└─ Response 4: Ready
|
||
│
|
||
└─ Your Code (Main Thread BLOCKS)
|
||
└─ Waits for all 4 responses before continuing
|
||
```
|
||
|
||
---
|
||
|
||
## Rate Limits and Concurrency
|
||
|
||
### Your API Account Limits
|
||
|
||
Anthropic enforces **per-minute limits** (example values):
|
||
|
||
```
|
||
Requests Per Minute (RPM): 500 max
|
||
Tokens Per Minute (TPM): 100,000 max
|
||
Concurrent Requests: 20 max
|
||
```
|
||
|
||
### What Happens When You Launch 4 Concurrent Agents
|
||
|
||
```
|
||
Scenario 1: Off-Peak, Plenty of Quota
|
||
├─ All 4 requests accepted immediately
|
||
├─ All process somewhat in parallel (within API limits)
|
||
├─ Combined result: ~20-30% time savings
|
||
└─ Token usage: Standard rate
|
||
|
||
Scenario 2: Near Rate Limit
|
||
├─ Request 1: Accepted (480/500 RPM remaining)
|
||
├─ Request 2: Accepted (460/500 RPM remaining)
|
||
├─ Request 3: Queued (hit RPM limit)
|
||
├─ Request 4: Queued (hit RPM limit)
|
||
├─ Requests 3-4 wait for next minute window
|
||
└─ Result: Sequential execution, same speed as single agent
|
||
|
||
Scenario 3: Token Limit Hit
|
||
├─ Request 1: ~25,000 tokens
|
||
├─ Request 2: ~25,000 tokens
|
||
├─ Request 3: REJECTED (would exceed TPM)
|
||
├─ Request 4: REJECTED (would exceed TPM)
|
||
└─ Result: Task fails, agents don't run
|
||
```
|
||
|
||
### Cost Implications
|
||
|
||
```
|
||
Running 4 concurrent agents always costs:
|
||
- Agent 1: ~15-18K tokens
|
||
- Agent 2: ~15-18K tokens
|
||
- Agent 3: ~15-18K tokens
|
||
- Agent 4: ~12-15K tokens
|
||
Total: ~57-69K tokens
|
||
|
||
Regardless of whether they run parallel or queue sequentially,
|
||
the TOKEN COST is the same (you pay for the analysis)
|
||
The TIME COST varies (might be slower if queued)
|
||
```
|
||
|
||
---
|
||
|
||
## The Illusion of Parallelism
|
||
|
||
### What Marketing Says
|
||
|
||
> "4 agents run in parallel"
|
||
|
||
### What Actually Happens
|
||
|
||
```
|
||
Timeline for 4 Concurrent Agents (Best Case - Off-Peak)
|
||
|
||
Time Agent 1 Agent 2 Agent 3 Agent 4
|
||
────────────────────────────────────────────────────────────────
|
||
0ms Start Start Start Start
|
||
100ms Processing... Processing... Processing... Processing...
|
||
500ms Processing... Processing... Processing... Processing...
|
||
1000ms Processing... Processing... Processing... Processing...
|
||
1500ms Processing... Processing... Processing... Processing...
|
||
2000ms Processing... Processing... Processing... Processing...
|
||
2500ms DONE ✓ DONE ✓ DONE ✓ DONE ✓
|
||
|
||
Result Time: ~2500ms (all done roughly together)
|
||
Total work done: 4 × 2500ms = 10,000ms
|
||
Sequential would be: ~4 × 2500ms = 10,000ms
|
||
Speedup: None (still 2500ms wall time, but... concurrent!)
|
||
```
|
||
|
||
### Reality: API Queuing
|
||
|
||
```
|
||
Timeline for 4 Concurrent Agents (Realistic - Some Queuing)
|
||
|
||
Time Agent 1 Agent 2 Agent 3 Agent 4
|
||
────────────────────────────────────────────────────────────────
|
||
0ms Start Start Queue... Queue...
|
||
100ms Processing... Processing... Queue... Queue...
|
||
500ms Processing... Processing... Queue... Queue...
|
||
1000ms DONE ✓ Processing... Queue... Queue...
|
||
1500ms (free) Processing... Start Queue...
|
||
2000ms (free) DONE ✓ Processing... Start
|
||
2500ms (free) (free) Processing... Processing...
|
||
3000ms (free) (free) DONE ✓ Processing...
|
||
3500ms (free) (free) (free) DONE ✓
|
||
|
||
Result Time: ~3500ms (more like sequential)
|
||
Speedup: ~0% (actually slower than sequential single agent)
|
||
```
|
||
|
||
---
|
||
|
||
## Why This Matters for Your Design
|
||
|
||
### Token Budget Impact
|
||
|
||
```
|
||
Your Monthly Token Budget: 5,000,000 tokens
|
||
|
||
Single Agent Review: 35,000 tokens
|
||
Can do: 142 reviews per month
|
||
|
||
Concurrent Agents Review: 68,000 tokens
|
||
Can do: 73 reviews per month
|
||
|
||
Cost multiplier: 2x
|
||
```
|
||
|
||
### Decision Matrix
|
||
|
||
| Situation | Use This | Use Single Agent | Why |
|
||
|-----------|----------|------------------|-----|
|
||
| Off-peak hours | ✓ | - | Concurrency works |
|
||
| Peak hours | - | ✓ | Queuing makes it slow |
|
||
| Cost sensitive | - | ✓ | 2x cost is significant |
|
||
| One file change | - | ✓ | Overkill |
|
||
| Release review | ✓ | - | Worth the cost |
|
||
| Multiple perspectives needed | ✓ | - | Value in specialization |
|
||
| Emergency fix | - | ✓ | Speed doesn't help |
|
||
| Enterprise quality | ✓ | - | Multi-expert review valuable |
|
||
|
||
---
|
||
|
||
## API Rate Limit Scenarios
|
||
|
||
### Scenario 1: Hitting RPM Limit
|
||
|
||
```
|
||
Your account: 500 RPM limit
|
||
|
||
4 concurrent agents @ 100 req each:
|
||
- Request 1: Success (100/500)
|
||
- Request 2: Success (200/500)
|
||
- Request 3: Success (300/500)
|
||
- Request 4: Success (400/500)
|
||
|
||
In same minute, if user makes another request:
|
||
- Request 5: REJECTED (500/500 limit hit)
|
||
- Error: "Rate limit exceeded"
|
||
```
|
||
|
||
### Scenario 2: Hitting TPM Limit
|
||
|
||
```
|
||
Your account: 100,000 TPM limit
|
||
|
||
4 concurrent agents:
|
||
- Agent 1: ~25,000 tokens (25K/100K remaining)
|
||
- Agent 2: ~25,000 tokens (50K/100K remaining)
|
||
- Agent 3: ~25,000 tokens (75K/100K remaining)
|
||
- Agent 4: ~20,000 tokens (95K/100K remaining)
|
||
|
||
Agent 4 completes, you do another review:
|
||
- Next analysis needs ~25,000 tokens
|
||
- Available: 5,000 tokens
|
||
- REJECTED: Exceeds TPM limit
|
||
- Wait until: Next minute window
|
||
```
|
||
|
||
### Scenario 3: Concurrent Request Limit
|
||
|
||
```
|
||
Your account: 20 concurrent requests allowed
|
||
|
||
4 concurrent agents:
|
||
- Agents 1-4: OK (4/20 quota)
|
||
|
||
Someone else on your account launches 17 more agents:
|
||
- Agent 5-17: OK (21/20 quota) ← LIMIT EXCEEDED
|
||
- One agent gets: "Concurrency limit exceeded"
|
||
- Execution: Queued or failed
|
||
```
|
||
|
||
---
|
||
|
||
## Understanding "Concurrent Submission"
|
||
|
||
### What It Looks Like in Code
|
||
|
||
```python
|
||
# Master Orchestrator (Pseudo-code)
|
||
def run_concurrent_agents():
|
||
# Submit all 4 agents at once (concurrent)
|
||
results = launch_all_agents([
|
||
Agent.code_review(context),
|
||
Agent.architecture(context),
|
||
Agent.security(context),
|
||
Agent.multi_perspective(context)
|
||
])
|
||
# Block until all 4 complete
|
||
return wait_for_all(results)
|
||
```
|
||
|
||
### What Actually Happens at API Level
|
||
|
||
```
|
||
1. Prepare 4 HTTP requests
|
||
2. Send all 4 requests to API in parallel (concurrency)
|
||
3. API receives all 4 requests
|
||
4. API checks rate limits (RPM, TPM, concurrent limit)
|
||
5. API queues them in order available
|
||
6. Process requests from queue (could be parallel, could be sequential)
|
||
7. Return results as they complete
|
||
8. Your code waits for all 4 results (blocking)
|
||
9. Continue when all 4 are done
|
||
```
|
||
|
||
### The Key Distinction
|
||
|
||
```
|
||
CONCURRENT SUBMISSION (What we do):
|
||
├─ 4 requests submitted at same time
|
||
├─ But API decides how to process them
|
||
└─ Could be parallel, could be sequential
|
||
|
||
TRUE PARALLEL (Not what we do):
|
||
├─ 4 requests execute on 4 different processors
|
||
├─ Guaranteed simultaneous execution
|
||
└─ No queueing, no waiting
|
||
```
|
||
|
||
---
|
||
|
||
## Why We're Not Parallel
|
||
|
||
### Hardware Reality
|
||
|
||
```
|
||
Your Computer:
|
||
├─ CPU: 1-16 cores (for you)
|
||
└─ But HTTP requests go to Anthropic's servers
|
||
|
||
Anthropic's Servers:
|
||
├─ Thousands of cores
|
||
├─ Processing requests from thousands of customers
|
||
├─ Your 4 requests share infrastructure with 10,000+ others
|
||
└─ They decide how to allocate resources
|
||
```
|
||
|
||
### Request Processing
|
||
|
||
```
|
||
Your Request ──HTTP──> Anthropic API ──> GPU Cluster
|
||
│
|
||
(Thousands of queries
|
||
being processed)
|
||
│
|
||
Your request waits its turn
|
||
│
|
||
When available: Process
|
||
│
|
||
Return response ──HTTP──> Your Code
|
||
```
|
||
|
||
---
|
||
|
||
## Actual Performance Gains
|
||
|
||
### Best Case (Off-Peak)
|
||
|
||
```
|
||
Stages 2-5 Duration:
|
||
- Sequential: 28-45 minutes
|
||
- Concurrent: 18-20 minutes
|
||
- Gain: ~40%
|
||
|
||
But this requires:
|
||
- No other users on API
|
||
- No rate limiting
|
||
- Sufficient TPM budget
|
||
- Rare in production
|
||
```
|
||
|
||
### Realistic Case (Normal Load)
|
||
|
||
```
|
||
Stages 2-5 Duration:
|
||
- Sequential: 28-45 minutes
|
||
- Concurrent: 24-35 minutes
|
||
- Gain: ~20-30%
|
||
|
||
With typical:
|
||
- Some API load
|
||
- No rate limiting hits
|
||
- Normal usage patterns
|
||
```
|
||
|
||
### Worst Case (Peak Load)
|
||
|
||
```
|
||
Stages 2-5 Duration:
|
||
- Sequential: 28-45 minutes
|
||
- Concurrent: 32-48 minutes
|
||
- Gain: Negative (slower)
|
||
|
||
When:
|
||
- High API load
|
||
- Rate limiting active
|
||
- High token usage
|
||
- Results in queueing
|
||
```
|
||
|
||
---
|
||
|
||
## Calculating Your Expected Speedup
|
||
|
||
```
|
||
Formula:
|
||
Expected Time = Base Time × (1 - Concurrency Efficiency)
|
||
Concurrency Efficiency = Percentage of APIs that process parallel
|
||
|
||
If 80% of the time agents run parallel:
|
||
- Expected Time = 37 min × (1 - 0.8) = 37 min × 0.2 = 7.4 min faster
|
||
- Total: 37 - 7.4 = 29.6 minutes
|
||
|
||
If 20% of the time agents run parallel (high load):
|
||
- Expected Time = 37 min × (1 - 0.2) = 37 min × 0.8 = 29.6 min savings
|
||
- Total: 37 - 1 = 36 minutes (almost no speedup)
|
||
```
|
||
|
||
---
|
||
|
||
## Recommendations
|
||
|
||
### When to Use Concurrent Agents
|
||
|
||
1. **Off-peak hours** (guaranteed better concurrency)
|
||
2. **Well below rate limits** (room for 4 simultaneous requests)
|
||
3. **Token budget permits** (2x cost is acceptable)
|
||
4. **Quality > Speed** (primary motivation is thorough review)
|
||
5. **Enterprise standards** (multiple expert perspectives required)
|
||
|
||
### When to Avoid
|
||
|
||
1. **Peak hours** (queueing dominates)
|
||
2. **Near rate limits** (risk of failures)
|
||
3. **Limited token budget** (2x cost is expensive)
|
||
4. **Speed is primary** (20-30% is not meaningful)
|
||
5. **Simple changes** (overkill)
|
||
|
||
### Monitoring Your API Health
|
||
|
||
```bash
|
||
# Track your usage:
|
||
1. Monitor RPM: requests per minute
|
||
2. Monitor TPM: tokens per minute
|
||
3. Monitor Response times
|
||
4. Track errors from rate limiting
|
||
|
||
# Good signs for concurrent agents:
|
||
- RPM usage < 50% of limit
|
||
- TPM usage < 50% of limit
|
||
- Response times stable
|
||
- No rate limit errors
|
||
|
||
# Bad signs:
|
||
- Frequent rate limit errors
|
||
- Response times > 2 seconds
|
||
- TPM usage > 70% of limit
|
||
- RPM usage > 60% of limit
|
||
```
|
||
|
||
---
|
||
|
||
## Summary
|
||
|
||
The Master Orchestrator **submits 4 requests concurrently**, but:
|
||
|
||
- ✗ NOT true parallel (depends on API queue)
|
||
- ✓ Provides context isolation (each agent clean context)
|
||
- ✓ Offers multi-perspective analysis (specialization benefits)
|
||
- ⚠ Costs 2x tokens (regardless of execution model)
|
||
- ⚠ Speedup is 20-30% best case, not 40-50%
|
||
- ⚠ Can degrade to sequential during high load
|
||
|
||
**Use when**: Quality and multiple perspectives matter more than cost/speed.
|
||
**Avoid when**: Cost or speed is the primary concern.
|
||
|
||
See [REALITY.md](REALITY.md) for honest assessment and [TOKEN-USAGE.md](TOKEN-USAGE.md) for detailed cost analysis.
|
||
|