Vision-module-auto/README.md
Svrnty 41cecca0e2 Initial release of Claude Vision Auto v1.0.0
Vision-based auto-approval system for Claude Code CLI using MiniCPM-V vision model.

Features:
- Automatic detection and response to approval prompts
- Screenshot capture and vision analysis via Ollama
- Support for multiple screenshot tools (scrot, gnome-screenshot, etc.)
- Configurable timing and behavior
- Debug mode for troubleshooting
- Comprehensive documentation

Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Jean-Philippe Brule <jp@svrnty.io>
2025-10-29 10:09:01 -04:00

295 lines
6.9 KiB
Markdown

# Claude Vision Auto
Vision-based auto-approval system for Claude Code CLI using MiniCPM-V vision model.
## Overview
Claude Vision Auto automatically detects and responds to approval prompts in Claude Code by:
1. Monitoring terminal output for approval keywords
2. Taking screenshots when idle (waiting for input)
3. Analyzing screenshots with MiniCPM-V vision model via Ollama
4. Automatically submitting appropriate responses
## Features
- **Zero Pattern Matching**: Uses vision AI instead of fragile regex patterns
- **Universal Compatibility**: Works with any Claude Code prompt format
- **Intelligent Detection**: Only activates when approval keywords are present
- **Configurable**: Environment variables for all settings
- **Lightweight**: Minimal dependencies (only `requests`)
- **Debug Mode**: Verbose logging for troubleshooting
## Prerequisites
### Required
1. **Claude Code CLI** - Anthropic's official CLI tool
```bash
npm install -g @anthropic-ai/claude-code
```
2. **Ollama with MiniCPM-V** - Vision model server
```bash
docker pull ollama/ollama
docker run -d -p 11434:11434 --name ollama ollama/ollama
docker exec ollama ollama pull minicpm-v:latest
```
3. **Screenshot Tool** (one of):
- `scrot` (recommended)
- `gnome-screenshot`
- `imagemagick` (import command)
- `maim`
### Install Screenshot Tool
```bash
# Debian/Ubuntu (recommended)
sudo apt-get install scrot
# Alternative options
sudo apt-get install gnome-screenshot
sudo apt-get install imagemagick
sudo apt-get install maim
```
## Installation
### Quick Install
```bash
cd claude-vision-auto
make deps # Install system dependencies
make install # Install the package
```
### Manual Install
```bash
# Install system dependencies
sudo apt-get update
sudo apt-get install -y scrot python3-pip
# Install Python package
pip3 install -e .
```
### Verify Installation
```bash
# Check if command is available
which claude-vision
# Test Ollama connection
curl http://localhost:11434/api/tags
```
## Usage
### Basic Usage
Replace `claude` with `claude-vision`:
```bash
# Instead of:
claude
# Use:
claude-vision
```
### With Prompts
```bash
# Pass prompts directly
claude-vision "create a test file in /tmp"
# Interactive session
claude-vision
```
### Configuration
Set environment variables to customize behavior:
```bash
# Ollama URL (default: http://localhost:11434/api/generate)
export OLLAMA_URL="http://custom-host:11434/api/generate"
# Vision model (default: minicpm-v:latest)
export VISION_MODEL="llama3.2-vision:latest"
# Idle threshold in seconds (default: 3.0)
export IDLE_THRESHOLD="5.0"
# Response delay in seconds (default: 1.0)
export RESPONSE_DELAY="2.0"
# Enable debug mode
export DEBUG="true"
# Run with custom settings
claude-vision
```
## How It Works
1. **Launch**: Spawns `claude` as subprocess
2. **Monitor**: Watches output for approval keywords (Yes, No, Approve, etc.)
3. **Detect Idle**: When output stops for `IDLE_THRESHOLD` seconds
4. **Screenshot**: Captures terminal window with `scrot`
5. **Analyze**: Sends to MiniCPM-V via Ollama API
6. **Respond**: Vision model returns "1", "y", or "WAIT"
7. **Submit**: Automatically sends response if not "WAIT"
## Configuration Options
| Variable | Default | Description |
|----------|---------|-------------|
| `OLLAMA_URL` | `http://localhost:11434/api/generate` | Ollama API endpoint |
| `VISION_MODEL` | `minicpm-v:latest` | Vision model to use |
| `IDLE_THRESHOLD` | `3.0` | Seconds of idle before screenshot |
| `RESPONSE_DELAY` | `1.0` | Seconds to wait before responding |
| `OUTPUT_BUFFER_SIZE` | `4096` | Bytes of output to buffer |
| `SCREENSHOT_TIMEOUT` | `5` | Screenshot capture timeout |
| `VISION_TIMEOUT` | `30` | Vision analysis timeout |
| `DEBUG` | `false` | Enable verbose logging |
## Troubleshooting
### Ollama Not Connected
```bash
# Check if Ollama is running
docker ps | grep ollama
# Check if model is available
curl http://localhost:11434/api/tags
```
### Screenshot Fails
```bash
# Test screenshot tool
scrot /tmp/test.png
ls -lh /tmp/test.png
# Install if missing
sudo apt-get install scrot
```
### Debug Mode
```bash
# Run with verbose logging
DEBUG=true claude-vision "test command"
```
### Vision Model Not Found
```bash
# Pull the model
docker exec ollama ollama pull minicpm-v:latest
# Or use alternative vision model
export VISION_MODEL="llava:latest"
claude-vision
```
## Development
### Setup Development Environment
```bash
# Clone repository
git clone https://git.openharbor.io/svrnty/claude-vision-auto.git
cd claude-vision-auto
# Install in development mode
pip3 install -e .
# Run tests
make test
```
### Project Structure
```
claude-vision-auto/
├── README.md # This file
├── LICENSE # MIT License
├── setup.py # Package setup
├── requirements.txt # Python dependencies
├── Makefile # Build automation
├── .gitignore # Git ignore rules
├── claude_vision_auto/ # Main package
│ ├── __init__.py # Package initialization
│ ├── main.py # CLI entry point
│ ├── config.py # Configuration
│ ├── screenshot.py # Screenshot capture
│ └── vision_analyzer.py # Vision analysis
├── bin/
│ └── claude-vision # CLI wrapper script
├── tests/ # Test suite
│ └── test_vision.py
├── docs/ # Documentation
│ ├── INSTALLATION.md
│ └── USAGE.md
└── examples/ # Usage examples
└── example_usage.sh
```
## Supported Vision Models
Tested and working:
- **minicpm-v:latest** (Recommended) - Best for structured output
- **llama3.2-vision:latest** - Good alternative
- **llava:latest** - Fallback option
## Performance
- **Startup**: < 1 second
- **Screenshot**: ~100ms
- **Vision Analysis**: 2-5 seconds (depends on model)
- **Total Response Time**: 3-7 seconds per approval
## Limitations
- **X11 Only**: Requires X11 display server (no Wayland support for scrot)
- **Linux Only**: Currently only tested on Debian/Ubuntu
- **Vision Dependency**: Requires Ollama and vision model
- **Screen Required**: Must have GUI session (no headless support)
## Future Enhancements
- [ ] Wayland support (alternative screenshot tools)
- [ ] macOS support
- [ ] Headless mode (API-only, no screenshots)
- [ ] Configurable response patterns
- [ ] Multi-terminal support
- [ ] Session recording and replay
## License
MIT License - See LICENSE file
## Author
**Svrnty**
- Email: jp@svrnty.io
- Repository: https://git.openharbor.io/svrnty/claude-vision-auto
## Contributing
Contributions welcome! Please:
1. Fork the repository
2. Create a feature branch
3. Submit a pull request
## Acknowledgments
- **Anthropic** - Claude Code CLI
- **MiniCPM-V** - Vision model
- **Ollama** - Model serving infrastructure