Svrnty 672bdacc8d docs: Reality-check update - honest assessment of concurrent agent architecture

This update corrects misleading performance and cost claims in the documentation:

CORRECTED CLAIMS:
- Performance: Changed from "40-50% faster" to "20-30% faster" (honest observation)
- Token Cost: Changed from "60-70% savings" to "1.9-2.0x more expensive" (actual cost)
- Parallelism: Clarified "concurrent requests" vs "true parallel execution"
- Architecture: Updated from "parallel" to "concurrent" throughout

NEW DOCUMENTATION:
- REALITY.md: Honest assessment and reality vs. marketing
- ARCHITECTURE.md: Technical details on concurrent vs. parallel execution
- TOKEN-USAGE.md: Detailed token cost breakdown and optimization strategies

UPDATED FILES:
- master-orchestrator.md: Accurate performance, cost, and when-to-use guidance
- README.md: Updated architecture overview and trade-offs

KEY INSIGHTS:
- Concurrent agent architecture IS valuable but for different reasons:
  * Main thread context is clean (20-30% of single-agent size)
  * 4 independent expert perspectives (genuine value)
  * API rate limiting affects actual speed (20-30% typical)
  * Cost is 1.9-2.0x tokens vs. single agent analysis
- Best for enterprise quality-critical work, NOT cost-efficient projects
- Includes decision matrix and cost optimization strategies

This update maintains technical accuracy while preserving the genuine benefits
of multi-perspective analysis and context isolation that make the system valuable.

2025-10-31 13:14:24 -04:00

12 KiB

Raw Blame History

Technical Architecture: Concurrent vs. Parallel Execution

Version: 1.0.0 Date: 2025-10-31 Audience: Technical decision-makers, engineers

Quick Definition

Term	What It Is	Our Use
Parallel	Multiple processes on different CPUs simultaneously	NOT what we do
Concurrent	Multiple requests submitted at once, processed in queue	What we actually do
Sequential	One after another, waiting for each to complete	Single-agent mode

What the Task Tool Actually Does

When You Call Task()

Your Code (Main Thread)
│
├─ Create Task 1 payload
├─ Create Task 2 payload
├─ Create Task 3 payload
└─ Create Task 4 payload
│
└─ Submit all 4 HTTP requests to Anthropic API simultaneously
   (This is "concurrent submission")

At Anthropic's API Level

HTTP Requests Arrive at API
│
└─ Rate Limit Check
   ├─ RPM (Requests Per Minute): X available
   ├─ TPM (Tokens Per Minute): Y available
   └─ Concurrent Request Count: Z allowed
│
└─ Queue Processing
   ├─ Request 1: Processing...
   ├─ Request 2: Waiting (might queue if limit hit)
   ├─ Request 3: Waiting (might queue if limit hit)
   └─ Request 4: Waiting (might queue if limit hit)
│
└─ Results Returned (in any order)
   ├─ Response 1: Ready
   ├─ Response 2: Ready
   ├─ Response 3: Ready
   └─ Response 4: Ready
│
└─ Your Code (Main Thread BLOCKS)
   └─ Waits for all 4 responses before continuing

Rate Limits and Concurrency

Your API Account Limits

Anthropic enforces per-minute limits (example values):

Requests Per Minute (RPM):     500 max
Tokens Per Minute (TPM):    100,000 max
Concurrent Requests:            20 max

What Happens When You Launch 4 Concurrent Agents

Scenario 1: Off-Peak, Plenty of Quota
├─ All 4 requests accepted immediately
├─ All process somewhat in parallel (within API limits)
├─ Combined result: ~20-30% time savings
└─ Token usage: Standard rate

Scenario 2: Near Rate Limit
├─ Request 1: Accepted (480/500 RPM remaining)
├─ Request 2: Accepted (460/500 RPM remaining)
├─ Request 3: Queued (hit RPM limit)
├─ Request 4: Queued (hit RPM limit)
├─ Requests 3-4 wait for next minute window
└─ Result: Sequential execution, same speed as single agent

Scenario 3: Token Limit Hit
├─ Request 1: ~25,000 tokens
├─ Request 2: ~25,000 tokens
├─ Request 3: REJECTED (would exceed TPM)
├─ Request 4: REJECTED (would exceed TPM)
└─ Result: Task fails, agents don't run

Cost Implications

Running 4 concurrent agents always costs:
- Agent 1: ~15-18K tokens
- Agent 2: ~15-18K tokens
- Agent 3: ~15-18K tokens
- Agent 4: ~12-15K tokens
Total: ~57-69K tokens

Regardless of whether they run parallel or queue sequentially,
the TOKEN COST is the same (you pay for the analysis)
The TIME COST varies (might be slower if queued)

The Illusion of Parallelism

What Marketing Says

"4 agents run in parallel"

What Actually Happens

Timeline for 4 Concurrent Agents (Best Case - Off-Peak)

Time    Agent 1         Agent 2         Agent 3         Agent 4
────────────────────────────────────────────────────────────────
0ms     Start           Start           Start           Start
100ms   Processing...   Processing...   Processing...   Processing...
500ms   Processing...   Processing...   Processing...   Processing...
1000ms  Processing...   Processing...   Processing...   Processing...
1500ms  Processing...   Processing...   Processing...   Processing...
2000ms  Processing...   Processing...   Processing...   Processing...
2500ms  DONE ✓          DONE ✓          DONE ✓          DONE ✓

Result Time: ~2500ms (all done roughly together)
Total work done: 4 × 2500ms = 10,000ms
Sequential would be: ~4 × 2500ms = 10,000ms
Speedup: None (still 2500ms wall time, but... concurrent!)

Reality: API Queuing

Timeline for 4 Concurrent Agents (Realistic - Some Queuing)

Time    Agent 1         Agent 2         Agent 3         Agent 4
────────────────────────────────────────────────────────────────
0ms     Start           Start           Queue...        Queue...
100ms   Processing...   Processing...   Queue...        Queue...
500ms   Processing...   Processing...   Queue...        Queue...
1000ms  DONE ✓          Processing...   Queue...        Queue...
1500ms  (free)          Processing...   Start           Queue...
2000ms  (free)          DONE ✓          Processing...   Start
2500ms  (free)          (free)          Processing...   Processing...
3000ms  (free)          (free)          DONE ✓          Processing...
3500ms  (free)          (free)          (free)          DONE ✓

Result Time: ~3500ms (more like sequential)
Speedup: ~0% (actually slower than sequential single agent)

Why This Matters for Your Design

Token Budget Impact

Your Monthly Token Budget: 5,000,000 tokens

Single Agent Review: 35,000 tokens
Can do: 142 reviews per month

Concurrent Agents Review: 68,000 tokens
Can do: 73 reviews per month

Cost multiplier: 2x

Decision Matrix

Situation	Use This	Use Single Agent	Why
Off-peak hours	✓	-	Concurrency works
Peak hours	-	✓	Queuing makes it slow
Cost sensitive	-	✓	2x cost is significant
One file change	-	✓	Overkill
Release review	✓	-	Worth the cost
Multiple perspectives needed	✓	-	Value in specialization
Emergency fix	-	✓	Speed doesn't help
Enterprise quality	✓	-	Multi-expert review valuable

API Rate Limit Scenarios

Scenario 1: Hitting RPM Limit

Your account: 500 RPM limit

4 concurrent agents @ 100 req each:
- Request 1: Success (100/500)
- Request 2: Success (200/500)
- Request 3: Success (300/500)
- Request 4: Success (400/500)

In same minute, if user makes another request:
- Request 5: REJECTED (500/500 limit hit)
- Error: "Rate limit exceeded"

Scenario 2: Hitting TPM Limit

Your account: 100,000 TPM limit

4 concurrent agents:
- Agent 1: ~25,000 tokens (25K/100K remaining)
- Agent 2: ~25,000 tokens (50K/100K remaining)
- Agent 3: ~25,000 tokens (75K/100K remaining)
- Agent 4: ~20,000 tokens (95K/100K remaining)

Agent 4 completes, you do another review:
- Next analysis needs ~25,000 tokens
- Available: 5,000 tokens
- REJECTED: Exceeds TPM limit
- Wait until: Next minute window

Scenario 3: Concurrent Request Limit

Your account: 20 concurrent requests allowed

4 concurrent agents:
- Agents 1-4: OK (4/20 quota)

Someone else on your account launches 17 more agents:
- Agent 5-17: OK (21/20 quota) ← LIMIT EXCEEDED
- One agent gets: "Concurrency limit exceeded"
- Execution: Queued or failed

Understanding "Concurrent Submission"

What It Looks Like in Code

# Master Orchestrator (Pseudo-code)
def run_concurrent_agents():
    # Submit all 4 agents at once (concurrent)
    results = launch_all_agents([
        Agent.code_review(context),
        Agent.architecture(context),
        Agent.security(context),
        Agent.multi_perspective(context)
    ])
    # Block until all 4 complete
    return wait_for_all(results)

What Actually Happens at API Level

1. Prepare 4 HTTP requests
2. Send all 4 requests to API in parallel (concurrency)
3. API receives all 4 requests
4. API checks rate limits (RPM, TPM, concurrent limit)
5. API queues them in order available
6. Process requests from queue (could be parallel, could be sequential)
7. Return results as they complete
8. Your code waits for all 4 results (blocking)
9. Continue when all 4 are done

The Key Distinction

CONCURRENT SUBMISSION (What we do):
├─ 4 requests submitted at same time
├─ But API decides how to process them
└─ Could be parallel, could be sequential

TRUE PARALLEL (Not what we do):
├─ 4 requests execute on 4 different processors
├─ Guaranteed simultaneous execution
└─ No queueing, no waiting

Why We're Not Parallel

Hardware Reality

Your Computer:
├─ CPU: 1-16 cores (for you)
└─ But HTTP requests go to Anthropic's servers

Anthropic's Servers:
├─ Thousands of cores
├─ Processing requests from thousands of customers
├─ Your 4 requests share infrastructure with 10,000+ others
└─ They decide how to allocate resources

Request Processing

Your Request ──HTTP──> Anthropic API ──> GPU Cluster
                                            │
                                    (Thousands of queries
                                     being processed)
                                            │
                        Your request waits its turn
                                            │
                        When available: Process
                                            │
                        Return response ──HTTP──> Your Code

Actual Performance Gains

Best Case (Off-Peak)

Stages 2-5 Duration:
- Sequential:     28-45 minutes
- Concurrent:     18-20 minutes
- Gain:           ~40%

But this requires:
- No other users on API
- No rate limiting
- Sufficient TPM budget
- Rare in production

Realistic Case (Normal Load)

Stages 2-5 Duration:
- Sequential:     28-45 minutes
- Concurrent:     24-35 minutes
- Gain:           ~20-30%

With typical:
- Some API load
- No rate limiting hits
- Normal usage patterns

Worst Case (Peak Load)

Stages 2-5 Duration:
- Sequential:     28-45 minutes
- Concurrent:     32-48 minutes
- Gain:           Negative (slower)

When:
- High API load
- Rate limiting active
- High token usage
- Results in queueing

Calculating Your Expected Speedup

Formula:
Expected Time = Base Time × (1 - Concurrency Efficiency)
Concurrency Efficiency = Percentage of APIs that process parallel

If 80% of the time agents run parallel:
- Expected Time = 37 min × (1 - 0.8) = 37 min × 0.2 = 7.4 min faster
- Total: 37 - 7.4 = 29.6 minutes

If 20% of the time agents run parallel (high load):
- Expected Time = 37 min × (1 - 0.2) = 37 min × 0.8 = 29.6 min savings
- Total: 37 - 1 = 36 minutes (almost no speedup)

Recommendations

When to Use Concurrent Agents

Off-peak hours (guaranteed better concurrency)
Well below rate limits (room for 4 simultaneous requests)
Token budget permits (2x cost is acceptable)
Quality > Speed (primary motivation is thorough review)
Enterprise standards (multiple expert perspectives required)

When to Avoid

Peak hours (queueing dominates)
Near rate limits (risk of failures)
Limited token budget (2x cost is expensive)
Speed is primary (20-30% is not meaningful)
Simple changes (overkill)

Monitoring Your API Health

# Track your usage:
1. Monitor RPM: requests per minute
2. Monitor TPM: tokens per minute
3. Monitor Response times
4. Track errors from rate limiting

# Good signs for concurrent agents:
- RPM usage < 50% of limit
- TPM usage < 50% of limit
- Response times stable
- No rate limit errors

# Bad signs:
- Frequent rate limit errors
- Response times > 2 seconds
- TPM usage > 70% of limit
- RPM usage > 60% of limit

Summary

The Master Orchestrator submits 4 requests concurrently, but:

✗ NOT true parallel (depends on API queue)
✓ Provides context isolation (each agent clean context)
✓ Offers multi-perspective analysis (specialization benefits)
⚠ Costs 2x tokens (regardless of execution model)
⚠ Speedup is 20-30% best case, not 40-50%
⚠ Can degrade to sequential during high load

Use when: Quality and multiple perspectives matter more than cost/speed. Avoid when: Cost or speed is the primary concern.

See REALITY.md for honest assessment and TOKEN-USAGE.md for detailed cost analysis.

12 KiB Raw Blame History Unescape Escape