claude-skills/ARCHITECTURE.md

# Technical Architecture: Concurrent vs. Parallel Execution

**Version:** 1.0.0
**Date:** 2025-10-31
**Audience:** Technical decision-makers, engineers

---

## Quick Definition

| Term | What It Is | Our Use |
|------|-----------|---------|
| **Parallel** | Multiple processes on different CPUs simultaneously | NOT what we do |
| **Concurrent** | Multiple requests submitted at once, processed in queue | What we actually do |
| **Sequential** | One after another, waiting for each to complete | Single-agent mode |

---

## What the Task Tool Actually Does

### When You Call Task()

```
Your Code (Main Thread)
│
├─ Create Task 1 payload
├─ Create Task 2 payload
├─ Create Task 3 payload
└─ Create Task 4 payload
│
└─ Submit all 4 HTTP requests to Anthropic API simultaneously
   (This is "concurrent submission")
```

### At Anthropic's API Level

```
HTTP Requests Arrive at API
│
└─ Rate Limit Check
   ├─ RPM (Requests Per Minute): X available
   ├─ TPM (Tokens Per Minute): Y available
   └─ Concurrent Request Count: Z allowed
│
└─ Queue Processing
   ├─ Request 1: Processing...
   ├─ Request 2: Waiting (might queue if limit hit)
   ├─ Request 3: Waiting (might queue if limit hit)
   └─ Request 4: Waiting (might queue if limit hit)
│
└─ Results Returned (in any order)
   ├─ Response 1: Ready
   ├─ Response 2: Ready
   ├─ Response 3: Ready
   └─ Response 4: Ready
│
└─ Your Code (Main Thread BLOCKS)
   └─ Waits for all 4 responses before continuing
```

---

## Rate Limits and Concurrency

### Your API Account Limits

Anthropic enforces **per-minute limits** (example values):

```
Requests Per Minute (RPM):     500 max
Tokens Per Minute (TPM):    100,000 max
Concurrent Requests:            20 max
```

### What Happens When You Launch 4 Concurrent Agents

```
Scenario 1: Off-Peak, Plenty of Quota
├─ All 4 requests accepted immediately
├─ All process somewhat in parallel (within API limits)
├─ Combined result: ~20-30% time savings
└─ Token usage: Standard rate

Scenario 2: Near Rate Limit
├─ Request 1: Accepted (480/500 RPM remaining)
├─ Request 2: Accepted (460/500 RPM remaining)
├─ Request 3: Queued (hit RPM limit)
├─ Request 4: Queued (hit RPM limit)
├─ Requests 3-4 wait for next minute window
└─ Result: Sequential execution, same speed as single agent

Scenario 3: Token Limit Hit
├─ Request 1: ~25,000 tokens
├─ Request 2: ~25,000 tokens
├─ Request 3: REJECTED (would exceed TPM)
├─ Request 4: REJECTED (would exceed TPM)
└─ Result: Task fails, agents don't run
```

### Cost Implications

```
Running 4 concurrent agents always costs:
- Agent 1: ~15-18K tokens
- Agent 2: ~15-18K tokens
- Agent 3: ~15-18K tokens
- Agent 4: ~12-15K tokens
Total: ~57-69K tokens

Regardless of whether they run parallel or queue sequentially,
the TOKEN COST is the same (you pay for the analysis)
The TIME COST varies (might be slower if queued)
```

---

## The Illusion of Parallelism

### What Marketing Says

> "4 agents run in parallel"

### What Actually Happens

```
Timeline for 4 Concurrent Agents (Best Case - Off-Peak)

Time    Agent 1         Agent 2         Agent 3         Agent 4
────────────────────────────────────────────────────────────────
0ms     Start           Start           Start           Start
100ms   Processing...   Processing...   Processing...   Processing...
500ms   Processing...   Processing...   Processing...   Processing...
1000ms  Processing...   Processing...   Processing...   Processing...
1500ms  Processing...   Processing...   Processing...   Processing...
2000ms  Processing...   Processing...   Processing...   Processing...
2500ms  DONE ✓          DONE ✓          DONE ✓          DONE ✓

Result Time: ~2500ms (all done roughly together)
Total work done: 4 × 2500ms = 10,000ms
Sequential would be: ~4 × 2500ms = 10,000ms
Speedup: None (still 2500ms wall time, but... concurrent!)
```

### Reality: API Queuing

```
Timeline for 4 Concurrent Agents (Realistic - Some Queuing)

Time    Agent 1         Agent 2         Agent 3         Agent 4
────────────────────────────────────────────────────────────────
0ms     Start           Start           Queue...        Queue...
100ms   Processing...   Processing...   Queue...        Queue...
500ms   Processing...   Processing...   Queue...        Queue...
1000ms  DONE ✓          Processing...   Queue...        Queue...
1500ms  (free)          Processing...   Start           Queue...
2000ms  (free)          DONE ✓          Processing...   Start
2500ms  (free)          (free)          Processing...   Processing...
3000ms  (free)          (free)          DONE ✓          Processing...
3500ms  (free)          (free)          (free)          DONE ✓

Result Time: ~3500ms (more like sequential)
Speedup: ~0% (actually slower than sequential single agent)
```

---

## Why This Matters for Your Design

### Token Budget Impact

```
Your Monthly Token Budget: 5,000,000 tokens

Single Agent Review: 35,000 tokens
Can do: 142 reviews per month

Concurrent Agents Review: 68,000 tokens
Can do: 73 reviews per month

Cost multiplier: 2x
```

### Decision Matrix

| Situation | Use This | Use Single Agent | Why |
|-----------|----------|------------------|-----|
| Off-peak hours | ✓ | - | Concurrency works |
| Peak hours | - | ✓ | Queuing makes it slow |
| Cost sensitive | - | ✓ | 2x cost is significant |
| One file change | - | ✓ | Overkill |
| Release review | ✓ | - | Worth the cost |
| Multiple perspectives needed | ✓ | - | Value in specialization |
| Emergency fix | - | ✓ | Speed doesn't help |
| Enterprise quality | ✓ | - | Multi-expert review valuable |

---

## API Rate Limit Scenarios

### Scenario 1: Hitting RPM Limit

```
Your account: 500 RPM limit

4 concurrent agents @ 100 req each:
- Request 1: Success (100/500)
- Request 2: Success (200/500)
- Request 3: Success (300/500)
- Request 4: Success (400/500)

In same minute, if user makes another request:
- Request 5: REJECTED (500/500 limit hit)
- Error: "Rate limit exceeded"
```

### Scenario 2: Hitting TPM Limit

```
Your account: 100,000 TPM limit

4 concurrent agents:
- Agent 1: ~25,000 tokens (25K/100K remaining)
- Agent 2: ~25,000 tokens (50K/100K remaining)
- Agent 3: ~25,000 tokens (75K/100K remaining)
- Agent 4: ~20,000 tokens (95K/100K remaining)

Agent 4 completes, you do another review:
- Next analysis needs ~25,000 tokens
- Available: 5,000 tokens
- REJECTED: Exceeds TPM limit
- Wait until: Next minute window
```

### Scenario 3: Concurrent Request Limit

```
Your account: 20 concurrent requests allowed

4 concurrent agents:
- Agents 1-4: OK (4/20 quota)

Someone else on your account launches 17 more agents:
- Agent 5-17: OK (21/20 quota) ← LIMIT EXCEEDED
- One agent gets: "Concurrency limit exceeded"
- Execution: Queued or failed
```

---

## Understanding "Concurrent Submission"

### What It Looks Like in Code

```python
# Master Orchestrator (Pseudo-code)
def run_concurrent_agents():
    # Submit all 4 agents at once (concurrent)
    results = launch_all_agents([
        Agent.code_review(context),
        Agent.architecture(context),
        Agent.security(context),
        Agent.multi_perspective(context)
    ])
    # Block until all 4 complete
    return wait_for_all(results)
```

### What Actually Happens at API Level

```
1. Prepare 4 HTTP requests
2. Send all 4 requests to API in parallel (concurrency)
3. API receives all 4 requests
4. API checks rate limits (RPM, TPM, concurrent limit)
5. API queues them in order available
6. Process requests from queue (could be parallel, could be sequential)
7. Return results as they complete
8. Your code waits for all 4 results (blocking)
9. Continue when all 4 are done
```

### The Key Distinction

```
CONCURRENT SUBMISSION (What we do):
├─ 4 requests submitted at same time
├─ But API decides how to process them
└─ Could be parallel, could be sequential

TRUE PARALLEL (Not what we do):
├─ 4 requests execute on 4 different processors
├─ Guaranteed simultaneous execution
└─ No queueing, no waiting
```

---

## Why We're Not Parallel

### Hardware Reality

```
Your Computer:
├─ CPU: 1-16 cores (for you)
└─ But HTTP requests go to Anthropic's servers

Anthropic's Servers:
├─ Thousands of cores
├─ Processing requests from thousands of customers
├─ Your 4 requests share infrastructure with 10,000+ others
└─ They decide how to allocate resources
```

### Request Processing

```
Your Request ──HTTP──> Anthropic API ──> GPU Cluster
                                            │
                                    (Thousands of queries
                                     being processed)
                                            │
                        Your request waits its turn
                                            │
                        When available: Process
                                            │
                        Return response ──HTTP──> Your Code
```

---

## Actual Performance Gains

### Best Case (Off-Peak)

```
Stages 2-5 Duration:
- Sequential:     28-45 minutes
- Concurrent:     18-20 minutes
- Gain:           ~40%

But this requires:
- No other users on API
- No rate limiting
- Sufficient TPM budget
- Rare in production
```

### Realistic Case (Normal Load)

```
Stages 2-5 Duration:
- Sequential:     28-45 minutes
- Concurrent:     24-35 minutes
- Gain:           ~20-30%

With typical:
- Some API load
- No rate limiting hits
- Normal usage patterns
```

### Worst Case (Peak Load)

```
Stages 2-5 Duration:
- Sequential:     28-45 minutes
- Concurrent:     32-48 minutes
- Gain:           Negative (slower)

When:
- High API load
- Rate limiting active
- High token usage
- Results in queueing
```

---

## Calculating Your Expected Speedup

```
Formula:
Expected Time = Base Time × (1 - Concurrency Efficiency)
Concurrency Efficiency = Percentage of APIs that process parallel

If 80% of the time agents run parallel:
- Expected Time = 37 min × (1 - 0.8) = 37 min × 0.2 = 7.4 min faster
- Total: 37 - 7.4 = 29.6 minutes

If 20% of the time agents run parallel (high load):
- Expected Time = 37 min × (1 - 0.2) = 37 min × 0.8 = 29.6 min savings
- Total: 37 - 1 = 36 minutes (almost no speedup)
```

---

## Recommendations

### When to Use Concurrent Agents

1. **Off-peak hours** (guaranteed better concurrency)
2. **Well below rate limits** (room for 4 simultaneous requests)
3. **Token budget permits** (2x cost is acceptable)
4. **Quality > Speed** (primary motivation is thorough review)
5. **Enterprise standards** (multiple expert perspectives required)

### When to Avoid

1. **Peak hours** (queueing dominates)
2. **Near rate limits** (risk of failures)
3. **Limited token budget** (2x cost is expensive)
4. **Speed is primary** (20-30% is not meaningful)
5. **Simple changes** (overkill)

### Monitoring Your API Health

```bash
# Track your usage:
1. Monitor RPM: requests per minute
2. Monitor TPM: tokens per minute
3. Monitor Response times
4. Track errors from rate limiting

# Good signs for concurrent agents:
- RPM usage < 50% of limit
- TPM usage < 50% of limit
- Response times stable
- No rate limit errors

# Bad signs:
- Frequent rate limit errors
- Response times > 2 seconds
- TPM usage > 70% of limit
- RPM usage > 60% of limit
```

---

## Summary

The Master Orchestrator **submits 4 requests concurrently**, but:

- ✗ NOT true parallel (depends on API queue)
- ✓ Provides context isolation (each agent clean context)
- ✓ Offers multi-perspective analysis (specialization benefits)
- ⚠ Costs 2x tokens (regardless of execution model)
- ⚠ Speedup is 20-30% best case, not 40-50%
- ⚠ Can degrade to sequential during high load

**Use when**: Quality and multiple perspectives matter more than cost/speed.
**Avoid when**: Cost or speed is the primary concern.

See [REALITY.md](REALITY.md) for honest assessment and [TOKEN-USAGE.md](TOKEN-USAGE.md) for detailed cost analysis.