swift-apple-intelligence-grpc/README.md

# Apple Intelligence gRPC Server

A Swift-based gRPC server that exposes Apple Intelligence (Foundation Models) over the network, allowing any device on your LAN to send prompts and receive streaming AI responses.

## Features

- **gRPC API** - Standard gRPC interface accessible from any language
- **Streaming Support** - Real-time token streaming for responsive UX
- **Vision Analysis** - Analyze images with text extraction, labeling, and descriptions
- **Text-to-Speech** - Convert text to audio (WAV/MP3) with multiple voices
- **Speech-to-Text** - Transcribe audio files or stream audio in real-time
- **Menu Bar App** - Native macOS app with system tray integration
- **Built-in Chat UI** - Test the AI directly from the app with voice input/output
- **API Key Auth** - Optional bearer token authentication
- **Auto-Start** - Launch at login and auto-start server options

## Requirements

- macOS 26+ (Tahoe)
- Apple Silicon Mac (M1/M2/M3/M4)
- Apple Intelligence enabled in System Settings
- Swift 6.0+

## Installation

### Download Release

Download the latest `.dmg` from the [Releases](../../releases) page, open it, and drag the app to Applications.

### Build from Source

```bash
# Clone the repository
git clone https://github.com/svrnty/apple-intelligence-grpc.git
cd apple-intelligence-grpc

# Build the menu bar app
swift build -c release --product AppleIntelligenceApp

# Or build the CLI server
swift build -c release --product AppleIntelligenceServer
```

## Usage

### Menu Bar App

1. Launch **Apple Intelligence Server** from Applications
2. Click the brain icon in the menu bar
3. Toggle **Start Server** to begin accepting connections
4. Use **Chat** to test the AI directly (supports voice input/output)
5. Configure host, port, and API key in **Settings**

### CLI Server

```bash
# Run with defaults (0.0.0.0:50051)
.build/release/AppleIntelligenceServer

# Custom configuration via environment
GRPC_HOST=127.0.0.1 GRPC_PORT=8080 API_KEY=secret .build/release/AppleIntelligenceServer
```

## API

### Service Definition

```protobuf
service AppleIntelligenceService {
  // AI Completion
  rpc Health(HealthRequest) returns (HealthResponse);
  rpc Complete(CompletionRequest) returns (CompletionResponse);
  rpc StreamComplete(CompletionRequest) returns (stream CompletionChunk);

  // Text-to-Speech
  rpc TextToSpeech(TextToSpeechRequest) returns (TextToSpeechResponse);
  rpc ListVoices(ListVoicesRequest) returns (ListVoicesResponse);

  // Speech-to-Text
  rpc Transcribe(TranscribeRequest) returns (TranscribeResponse);
  rpc StreamTranscribe(stream StreamingTranscribeRequest) returns (stream StreamingTranscribeResponse);
}
```

### Methods

| Method | Type | Description |
|--------|------|-------------|
| `Health` | Unary | Check server and model availability |
| `Complete` | Unary | Generate complete response (supports images) |
| `StreamComplete` | Server Streaming | Stream tokens as they're generated |
| `TextToSpeech` | Unary | Convert text to audio |
| `ListVoices` | Unary | List available TTS voices |
| `Transcribe` | Unary | Transcribe audio file to text |
| `StreamTranscribe` | Bidirectional | Real-time audio transcription |

### Vision Support

The `Complete` and `StreamComplete` methods support image analysis:

```protobuf
message CompletionRequest {
  string prompt = 1;
  optional float temperature = 2;
  optional int32 max_tokens = 3;
  repeated ImageData images = 4;      // Attach images for analysis
  bool include_analysis = 5;          // Return detailed analysis
}

message ImageData {
  bytes data = 1;
  string filename = 2;
  string mime_type = 3;               // image/png, image/jpeg, etc.
}
```

**Supported Image Formats:** PNG, JPEG, GIF, WebP, HEIC

### Text-to-Speech

```protobuf
message TextToSpeechRequest {
  string text = 1;
  AudioFormat output_format = 2;      // WAV or MP3
  optional VoiceConfig voice_config = 3;
}

message VoiceConfig {
  string voice_identifier = 1;        // Voice ID from ListVoices
  optional float speaking_rate = 2;   // 0.0-1.0, default 0.5
  optional float pitch_multiplier = 3; // 0.5-2.0, default 1.0
  optional float volume = 4;          // 0.0-1.0, default 1.0
}
```

**Output Formats:** WAV, MP3

### Speech-to-Text

#### File-based Transcription

```protobuf
message TranscribeRequest {
  AudioInput audio = 1;
  optional TranscriptionConfig config = 2;
}

message AudioInput {
  bytes data = 1;
  string mime_type = 2;               // audio/wav, audio/mp3, etc.
  optional int32 sample_rate = 3;
  optional int32 channels = 4;
}

message TranscriptionConfig {
  optional string language_code = 1;  // e.g., "en-US", "fr-CA"
  optional bool enable_punctuation = 2;
  optional bool enable_timestamps = 3;
}
```

**Supported Audio Formats:** WAV, MP3, M4A, AAC, FLAC

#### Streaming Transcription

For real-time transcription, use bidirectional streaming:

1. Send `TranscriptionConfig` first to configure the session
2. Send `audio_chunk` messages with PCM audio data (16-bit, 16kHz, mono)
3. Receive `StreamingTranscribeResponse` with partial and final results

```protobuf
message StreamingTranscribeRequest {
  oneof request {
    TranscriptionConfig config = 1;   // Send first
    bytes audio_chunk = 2;            // Then audio chunks
  }
}

message StreamingTranscribeResponse {
  string partial_text = 1;
  bool is_final = 2;
  string final_text = 3;
  repeated TranscriptionSegment segments = 4;
}
```

### Quick Test with grpcurl

```bash
# Health check
grpcurl -plaintext localhost:50051 appleintelligence.AppleIntelligenceService/Health

# Text completion
grpcurl -plaintext \
  -d '{"prompt": "What is 2 + 2?"}' \
  localhost:50051 appleintelligence.AppleIntelligenceService/Complete

# Streaming completion
grpcurl -plaintext \
  -d '{"prompt": "Tell me a short story"}' \
  localhost:50051 appleintelligence.AppleIntelligenceService/StreamComplete

# List TTS voices
grpcurl -plaintext \
  -d '{"language_code": "en-US"}' \
  localhost:50051 appleintelligence.AppleIntelligenceService/ListVoices

# Text-to-Speech (base64 encode the response audio_data)
grpcurl -plaintext \
  -d '{"text": "Hello world", "output_format": 1}' \
  localhost:50051 appleintelligence.AppleIntelligenceService/TextToSpeech

# Transcribe audio file (base64 encode audio data)
grpcurl -plaintext \
  -d '{"audio": {"data": "'$(base64 -i audio.wav)'", "mime_type": "audio/wav"}}' \
  localhost:50051 appleintelligence.AppleIntelligenceService/Transcribe
```

## Configuration

| Environment Variable | Default | Description |
|---------------------|---------|-------------|
| `GRPC_HOST` | `0.0.0.0` | Host to bind (use `0.0.0.0` for LAN access) |
| `GRPC_PORT` | `50051` | Port to listen on |
| `API_KEY` | *none* | Optional API key for authentication |

## Supported Languages

### Speech Recognition (STT)
- English (US, CA, GB, AU, IN, IE, ZA)
- French (CA, FR)
- Spanish (ES, MX)
- German, Italian, Portuguese, Japanese, Korean, Chinese
- And many more via macOS Speech framework

### Text-to-Speech (TTS)
All voices available in macOS System Settings, including:
- Premium voices (highest quality, requires download)
- Enhanced voices (good quality)
- Default/Compact voices (pre-installed)

## Client Libraries

Connect from any language with gRPC support:

- **Python**: `grpcio`, `grpcio-tools`
- **Node.js**: `@grpc/grpc-js`, `@grpc/proto-loader`
- **Go**: `google.golang.org/grpc`
- **Swift**: `grpc-swift`
- **Rust**: `tonic`

See [docs/grpc-client-guide.md](docs/grpc-client-guide.md) for detailed examples.

## Project Structure

```
apple-intelligence-grpc/
├── Package.swift
├── Proto/
│   └── apple_intelligence.proto      # gRPC service definition
├── Sources/
│   ├── AppleIntelligenceCore/        # Shared gRPC service code
│   │   ├── Config.swift
│   │   ├── Services/
│   │   │   ├── AppleIntelligenceService.swift
│   │   │   ├── TextToSpeechService.swift
│   │   │   ├── SpeechToTextService.swift
│   │   │   └── VisionAnalysisService.swift
│   │   ├── Providers/
│   │   │   └── AppleIntelligenceProvider.swift
│   │   └── Generated/
│   │       ├── apple_intelligence.pb.swift
│   │       └── apple_intelligence.grpc.swift
│   ├── AppleIntelligenceServer/      # CLI executable
│   │   └── main.swift
│   └── AppleIntelligenceApp/         # Menu bar app
│       ├── App.swift
│       ├── ServerManager.swift
│       ├── Models/
│       ├── Views/
│       └── ViewModels/
├── scripts/
│   ├── build-app.sh                  # Build .app bundle
│   └── create-dmg.sh                 # Create DMG installer
└── docs/
    ├── grpc-client-guide.md          # Client connection examples
    ├── macos-runner-setup.md         # CI runner setup
    └── pipeline-configuration.md     # CI/CD configuration
```

## CI/CD

Automated builds are configured with Gitea Actions. When a release is created:

1. Builds the app bundle
2. Signs with Developer ID
3. Notarizes with Apple
4. Uploads DMG to release

See [docs/pipeline-configuration.md](docs/pipeline-configuration.md) for setup instructions.

## Security

- **Local Network**: By default, the server binds to `0.0.0.0` allowing LAN access
- **API Key**: Enable authentication by setting the `API_KEY` environment variable
- **Firewall**: macOS will prompt to allow incoming connections on first run
- **Notarized**: Release builds are signed and notarized by Apple

## Troubleshooting

### Model Not Available

- Ensure Apple Intelligence is enabled: System Settings → Apple Intelligence & Siri
- Requires Apple Silicon Mac with macOS 26+

### Connection Refused

- Check the server is running (brain icon should be filled)
- Verify firewall allows connections on the configured port
- Try `localhost` instead of the IP if testing locally

### Authentication Failed

- Include the API key in the Authorization header: `Bearer YOUR_API_KEY`
- Verify the key matches what's configured in Settings

### Speech Recognition Not Working

- Grant microphone permission when prompted
- Check System Settings → Privacy & Security → Speech Recognition
- Ensure the language is supported

### TTS Voice Quality

- Download Premium/Enhanced voices from System Settings → Accessibility → Read & Speak
- Premium voices are larger (~150-500MB) but sound more natural

## License

MIT