swift-apple-intelligence-grpc/README.md
Mathias Beaulieu-Duncan 851d6fef2b Update README with TTS, STT, and Vision documentation
- Document all gRPC API methods including new speech services
- Add Vision support section with image formats
- Add Text-to-Speech section with voice configuration
- Add Speech-to-Text section with file and streaming support
- Document supported audio formats (WAV, MP3, M4A, AAC, FLAC)
- Add streaming transcription protocol details
- Update grpcurl examples for all endpoints
- Add supported languages section
- Update project structure with new services
- Add troubleshooting for speech features

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-31 03:41:59 -05:00

10 KiB

Apple Intelligence gRPC Server

A Swift-based gRPC server that exposes Apple Intelligence (Foundation Models) over the network, allowing any device on your LAN to send prompts and receive streaming AI responses.

Features

  • gRPC API - Standard gRPC interface accessible from any language
  • Streaming Support - Real-time token streaming for responsive UX
  • Vision Analysis - Analyze images with text extraction, labeling, and descriptions
  • Text-to-Speech - Convert text to audio (WAV/MP3) with multiple voices
  • Speech-to-Text - Transcribe audio files or stream audio in real-time
  • Menu Bar App - Native macOS app with system tray integration
  • Built-in Chat UI - Test the AI directly from the app with voice input/output
  • API Key Auth - Optional bearer token authentication
  • Auto-Start - Launch at login and auto-start server options

Requirements

  • macOS 26+ (Tahoe)
  • Apple Silicon Mac (M1/M2/M3/M4)
  • Apple Intelligence enabled in System Settings
  • Swift 6.0+

Installation

Download Release

Download the latest .dmg from the Releases page, open it, and drag the app to Applications.

Build from Source

# Clone the repository
git clone https://github.com/svrnty/apple-intelligence-grpc.git
cd apple-intelligence-grpc

# Build the menu bar app
swift build -c release --product AppleIntelligenceApp

# Or build the CLI server
swift build -c release --product AppleIntelligenceServer

Usage

Menu Bar App

  1. Launch Apple Intelligence Server from Applications
  2. Click the brain icon in the menu bar
  3. Toggle Start Server to begin accepting connections
  4. Use Chat to test the AI directly (supports voice input/output)
  5. Configure host, port, and API key in Settings

CLI Server

# Run with defaults (0.0.0.0:50051)
.build/release/AppleIntelligenceServer

# Custom configuration via environment
GRPC_HOST=127.0.0.1 GRPC_PORT=8080 API_KEY=secret .build/release/AppleIntelligenceServer

API

Service Definition

service AppleIntelligenceService {
  // AI Completion
  rpc Health(HealthRequest) returns (HealthResponse);
  rpc Complete(CompletionRequest) returns (CompletionResponse);
  rpc StreamComplete(CompletionRequest) returns (stream CompletionChunk);

  // Text-to-Speech
  rpc TextToSpeech(TextToSpeechRequest) returns (TextToSpeechResponse);
  rpc ListVoices(ListVoicesRequest) returns (ListVoicesResponse);

  // Speech-to-Text
  rpc Transcribe(TranscribeRequest) returns (TranscribeResponse);
  rpc StreamTranscribe(stream StreamingTranscribeRequest) returns (stream StreamingTranscribeResponse);
}

Methods

Method Type Description
Health Unary Check server and model availability
Complete Unary Generate complete response (supports images)
StreamComplete Server Streaming Stream tokens as they're generated
TextToSpeech Unary Convert text to audio
ListVoices Unary List available TTS voices
Transcribe Unary Transcribe audio file to text
StreamTranscribe Bidirectional Real-time audio transcription

Vision Support

The Complete and StreamComplete methods support image analysis:

message CompletionRequest {
  string prompt = 1;
  optional float temperature = 2;
  optional int32 max_tokens = 3;
  repeated ImageData images = 4;      // Attach images for analysis
  bool include_analysis = 5;          // Return detailed analysis
}

message ImageData {
  bytes data = 1;
  string filename = 2;
  string mime_type = 3;               // image/png, image/jpeg, etc.
}

Supported Image Formats: PNG, JPEG, GIF, WebP, HEIC

Text-to-Speech

message TextToSpeechRequest {
  string text = 1;
  AudioFormat output_format = 2;      // WAV or MP3
  optional VoiceConfig voice_config = 3;
}

message VoiceConfig {
  string voice_identifier = 1;        // Voice ID from ListVoices
  optional float speaking_rate = 2;   // 0.0-1.0, default 0.5
  optional float pitch_multiplier = 3; // 0.5-2.0, default 1.0
  optional float volume = 4;          // 0.0-1.0, default 1.0
}

Output Formats: WAV, MP3

Speech-to-Text

File-based Transcription

message TranscribeRequest {
  AudioInput audio = 1;
  optional TranscriptionConfig config = 2;
}

message AudioInput {
  bytes data = 1;
  string mime_type = 2;               // audio/wav, audio/mp3, etc.
  optional int32 sample_rate = 3;
  optional int32 channels = 4;
}

message TranscriptionConfig {
  optional string language_code = 1;  // e.g., "en-US", "fr-CA"
  optional bool enable_punctuation = 2;
  optional bool enable_timestamps = 3;
}

Supported Audio Formats: WAV, MP3, M4A, AAC, FLAC

Streaming Transcription

For real-time transcription, use bidirectional streaming:

  1. Send TranscriptionConfig first to configure the session
  2. Send audio_chunk messages with PCM audio data (16-bit, 16kHz, mono)
  3. Receive StreamingTranscribeResponse with partial and final results
message StreamingTranscribeRequest {
  oneof request {
    TranscriptionConfig config = 1;   // Send first
    bytes audio_chunk = 2;            // Then audio chunks
  }
}

message StreamingTranscribeResponse {
  string partial_text = 1;
  bool is_final = 2;
  string final_text = 3;
  repeated TranscriptionSegment segments = 4;
}

Quick Test with grpcurl

# Health check
grpcurl -plaintext localhost:50051 appleintelligence.AppleIntelligenceService/Health

# Text completion
grpcurl -plaintext \
  -d '{"prompt": "What is 2 + 2?"}' \
  localhost:50051 appleintelligence.AppleIntelligenceService/Complete

# Streaming completion
grpcurl -plaintext \
  -d '{"prompt": "Tell me a short story"}' \
  localhost:50051 appleintelligence.AppleIntelligenceService/StreamComplete

# List TTS voices
grpcurl -plaintext \
  -d '{"language_code": "en-US"}' \
  localhost:50051 appleintelligence.AppleIntelligenceService/ListVoices

# Text-to-Speech (base64 encode the response audio_data)
grpcurl -plaintext \
  -d '{"text": "Hello world", "output_format": 1}' \
  localhost:50051 appleintelligence.AppleIntelligenceService/TextToSpeech

# Transcribe audio file (base64 encode audio data)
grpcurl -plaintext \
  -d '{"audio": {"data": "'$(base64 -i audio.wav)'", "mime_type": "audio/wav"}}' \
  localhost:50051 appleintelligence.AppleIntelligenceService/Transcribe

Configuration

Environment Variable Default Description
GRPC_HOST 0.0.0.0 Host to bind (use 0.0.0.0 for LAN access)
GRPC_PORT 50051 Port to listen on
API_KEY none Optional API key for authentication

Supported Languages

Speech Recognition (STT)

  • English (US, CA, GB, AU, IN, IE, ZA)
  • French (CA, FR)
  • Spanish (ES, MX)
  • German, Italian, Portuguese, Japanese, Korean, Chinese
  • And many more via macOS Speech framework

Text-to-Speech (TTS)

All voices available in macOS System Settings, including:

  • Premium voices (highest quality, requires download)
  • Enhanced voices (good quality)
  • Default/Compact voices (pre-installed)

Client Libraries

Connect from any language with gRPC support:

  • Python: grpcio, grpcio-tools
  • Node.js: @grpc/grpc-js, @grpc/proto-loader
  • Go: google.golang.org/grpc
  • Swift: grpc-swift
  • Rust: tonic

See docs/grpc-client-guide.md for detailed examples.

Project Structure

apple-intelligence-grpc/
├── Package.swift
├── Proto/
│   └── apple_intelligence.proto      # gRPC service definition
├── Sources/
│   ├── AppleIntelligenceCore/        # Shared gRPC service code
│   │   ├── Config.swift
│   │   ├── Services/
│   │   │   ├── AppleIntelligenceService.swift
│   │   │   ├── TextToSpeechService.swift
│   │   │   ├── SpeechToTextService.swift
│   │   │   └── VisionAnalysisService.swift
│   │   ├── Providers/
│   │   │   └── AppleIntelligenceProvider.swift
│   │   └── Generated/
│   │       ├── apple_intelligence.pb.swift
│   │       └── apple_intelligence.grpc.swift
│   ├── AppleIntelligenceServer/      # CLI executable
│   │   └── main.swift
│   └── AppleIntelligenceApp/         # Menu bar app
│       ├── App.swift
│       ├── ServerManager.swift
│       ├── Models/
│       ├── Views/
│       └── ViewModels/
├── scripts/
│   ├── build-app.sh                  # Build .app bundle
│   └── create-dmg.sh                 # Create DMG installer
└── docs/
    ├── grpc-client-guide.md          # Client connection examples
    ├── macos-runner-setup.md         # CI runner setup
    └── pipeline-configuration.md     # CI/CD configuration

CI/CD

Automated builds are configured with Gitea Actions. When a release is created:

  1. Builds the app bundle
  2. Signs with Developer ID
  3. Notarizes with Apple
  4. Uploads DMG to release

See docs/pipeline-configuration.md for setup instructions.

Security

  • Local Network: By default, the server binds to 0.0.0.0 allowing LAN access
  • API Key: Enable authentication by setting the API_KEY environment variable
  • Firewall: macOS will prompt to allow incoming connections on first run
  • Notarized: Release builds are signed and notarized by Apple

Troubleshooting

Model Not Available

  • Ensure Apple Intelligence is enabled: System Settings → Apple Intelligence & Siri
  • Requires Apple Silicon Mac with macOS 26+

Connection Refused

  • Check the server is running (brain icon should be filled)
  • Verify firewall allows connections on the configured port
  • Try localhost instead of the IP if testing locally

Authentication Failed

  • Include the API key in the Authorization header: Bearer YOUR_API_KEY
  • Verify the key matches what's configured in Settings

Speech Recognition Not Working

  • Grant microphone permission when prompted
  • Check System Settings → Privacy & Security → Speech Recognition
  • Ensure the language is supported

TTS Voice Quality

  • Download Premium/Enhanced voices from System Settings → Accessibility → Read & Speak
  • Premium voices are larger (~150-500MB) but sound more natural

License

MIT