feat(plugin): STT migration via audio_attachment_processor hook (L1-L6)

Closes Phase 2.A. STT now lives entirely in the plugin via the new public-API method `api.register_audio_attachment_processor` added to the loader hook (Rule 1 — extended API, no forced-internal). The fork patch stays minimal (streaming.py gains a small loop that calls registered processors; loader adds the 1 new method). Plugin additions: routes/transcribe.py POST /api/transcribe + audio_attachment_processor - _external_stt_transcribe: multipart POST to STT endpoint - _handle_transcribe: one-shot transcription route - _transcribe_audio_attachments: voice-message processor - _parse_multipart_file: stdlib email-based multipart (Python 3.13 dropped cgi per PEP 594) tests/unit/test_transcribe.py 8 tests (register, processor, route, multipart parser) tests/evals/test_features.py + 1 eval (audio processor signature contract) Config (read at call time, never persisted): HERMES_WEBUI_STT_URL external STT endpoint (OpenAI or WhisperX shape) HERMES_WEBUI_STT_KEY optional bearer token CONNECTION-MAP regenerated: 9 public-API · 0 forced-internal · 1 frontend. 20/20 tests PASS. Loader API extended in hermes-webui (next commit there) — 7th method: register_audio_attachment_processor. Streaming.py gets a small loop that calls registered processors before _build_native_multimodal_message. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 10:14:29 -04:00 · 2026-05-23 10:14:29 -04:00 · 37123f570b
commit 37123f570b
parent cbf53a0d55
6 changed files with 314 additions and 36 deletions
--- a/CONNECTION-MAP.md
+++ b/CONNECTION-MAP.md
@ -2,7 +2,7 @@
 **Upstream version:** v0.51.117  
 **Plugin version:** 0.1.0  
-**Total dependencies:** 7 (6 public API · 0 forced internal · 1 frontend)
+**Total dependencies:** 10 (9 public API · 0 forced internal · 1 frontend)
 > **Auto-generated by `scripts/ast-connection-map.py`. Do not hand-edit.**
 > To change a justification, edit the `# CONNECTION:` comment above the
@ -18,6 +18,9 @@
 | `plugin.py:34` | `api.register_static` | `api.register_static(STATIC_PREFIX, str(STATIC_DIR))` |
 | `plugin.py:35` | `api.inject_stylesheet` | `api.inject_stylesheet(f"/plugins/{STATIC_PREFIX}/app.css")` |
 | `plugin.py:36` | `api.inject_script` | `api.inject_script(f"/plugins/{STATIC_PREFIX}/app.js")` |
 | `routes/transcribe.py:37` | `api.logger` | `log = api.logger("svrnty.routes.transcribe")` |
 | `routes/transcribe.py:38` | `api.register_route` | `api.register_route("/api/transcribe", "POST", _handle_transcribe)` |
 | `routes/transcribe.py:39` | `api.register_audio_attachment_processor` | `api.register_audio_attachment_processor(_transcribe_audio_attachments)` |
 | `routes/vault_status.py:19` | `api.logger` | `log = api.logger("svrnty.routes.vault_status")` |
 | `routes/vault_status.py:20` | `api.register_route` | `api.register_route("/api/vault/status", "GET", _handle_vault_status)` |
--- a/plugin.py
+++ b/plugin.py
@ -59,6 +59,6 @@ def _phase2_routes():
    ImportError is logged + swallowed so the plugin loads cleanly.
    """
    return [
-        # "transcribe",      # P2.A — STT (deferred — needs streaming.py integration refactor)
+        "transcribe",        # P2.A — STT + voice-message audio processor ✓
        "vault_status",      # P2.B — vault connections status ✓
    ]
--- a/routes/transcribe.py
+++ b/routes/transcribe.py
@ -1,37 +1,187 @@
-"""GET /api/transcribe — STT route — DEFERRED MIGRATION (P2.A).
+"""POST /api/transcribe + voice-message audio processor.
-The STT feature in the original fork commit 014b9eef touches THREE upstream
+Migrated from hermes-webui fork commit 014b9eef (now reverted) per Phase 2.1
-modules:
+of the SVRNTY-HERMES Plugin Protocol. Uses the loader's new public API method
 `api.register_audio_attachment_processor` so streaming.py can pull transcripts
 of voice-message attachments into the agent-visible text WITHOUT any further
 fork patch.
-  1. api/upload.py        — handle_transcribe() + _external_stt_transcribe()
+Configuration (read at call time, never persisted):
-  2. api/streaming.py     — _transcribe_audio_attachments() injects transcripts
+  HERMES_WEBUI_STT_URL  external STT endpoint (OpenAI-shape or WhisperX)
-                            into the agent-visible message during streaming
+  HERMES_WEBUI_STT_KEY  optional bearer token
  3. static/boot.js       — mic button + MediaRecorder fallback (iOS WKWebView)
-Migration #1 is straightforward (route + helper move cleanly). Migrations #2
+Endpoints + processors:
-and #3 cross-cut the streaming engine and the bootstrap JS — refactoring them
+  POST /api/transcribe         direct one-shot transcription
-to live in the plugin requires either:
+  audio_attachment_processor   called by streaming.py before agent receives msg
-  (a) New public-API hooks: api.streaming_hook(name, callback) so the plugin
+Public API surface used: register_route, register_audio_attachment_processor, logger.
-      can register an attachment processor that runs inside the streaming
+No forced internal dependencies.
      pipeline.  Adds ~50 LOC to the loader + amends Protocol PRD §5.1.
  (b) Accept STT as a forced-internal dependency.  Adds CONNECTION-MAP entries
      under forced_internal/ with the streaming.py + boot.js touch points and
      their rebase-risk notes.
 Phase 2.1 decides between (a) and (b).  Until that's resolved, the STT route
 stays in the fork (commit 014b9eef remains).  This stub exists so the migration
 plan is co-located with the code and tooling can flag the gap.
 Test status: vault_status migration proves the loader works.  STT is a deeper
 integration test for the loader's expressiveness.
 """
 import email
 import email.parser
 import email.policy
 import io
 import json
 import mimetypes
 import os
 import re
 import tempfile
 import urllib.request
 import uuid
-# Intentionally NOT registered yet.  The plugin loader's _phase2_routes() does
+_VOICE_MSG_AUDIO_EXTS = ('.m4a', '.aac', '.oga', '.opus', '.wav', '.mp3', '.flac', '.ogg', '.webm')
-# not include "transcribe" — see plugin.py.
+
-#
+
-# When Phase 2.1 lands, this file will host either:
+def register(api):
-#   - A new route handler using a streaming_hook to register the attachment
+    """Wire route + audio processor."""
-#     processor (option a), or
+    log = api.logger("svrnty.routes.transcribe")
-#   - The route handler + CONNECTION-MAP forced-internal entries for the
+    api.register_route("/api/transcribe", "POST", _handle_transcribe)
-#     remaining touch points (option b).
+    api.register_audio_attachment_processor(_transcribe_audio_attachments)
    log.info("transcribe endpoint + audio processor registered")
 def _external_stt_transcribe(audio_path: str, url: str, api_key: str) -> str:
    """POST audio to an external STT endpoint (multipart `file`).
    Handles OpenAI-shaped servers (top-level `text`) and WhisperX-style servers
    (`segments[].text`). Stdlib only.
    """
    boundary = '----webui' + uuid.uuid4().hex
    fname = os.path.basename(audio_path) or 'audio.webm'
    ctype = mimetypes.guess_type(fname)[0] or 'application/octet-stream'
    with open(audio_path, 'rb') as f:
        audio = f.read()
    body = b''.join([
        ('--' + boundary + '\r\n'
         'Content-Disposition: form-data; name="file"; filename="' + fname + '"\r\n'
         'Content-Type: ' + ctype + '\r\n\r\n').encode(),
        audio,
        ('\r\n--' + boundary + '\r\n'
         'Content-Disposition: form-data; name="model"\r\n\r\nwhisper-1').encode(),
        ('\r\n--' + boundary + '--\r\n').encode(),
    ])
    headers = {'Content-Type': 'multipart/form-data; boundary=' + boundary}
    if api_key:
        headers['Authorization'] = 'Bearer ' + api_key
    req = urllib.request.Request(url, data=body, headers=headers)
    with urllib.request.urlopen(req, timeout=300) as resp:
        data = json.loads(resp.read())
    text = str(data.get('text') or '').strip()
    if not text:
        segs = data.get('segments') or []
        text = ' '.join(str(s.get('text', '')).strip() for s in segs).strip()
    return text
 def _transcribe_audio_attachments(attachments) -> str:
    """Audio-attachment processor — registered via the loader.
    Scans attachments for voice-message audio files; transcribes each via the
    configured STT endpoint; returns a single text block to prepend to the
    agent-visible message. Empty string when no audio / STT not configured.
    """
    stt_url = os.environ.get('HERMES_WEBUI_STT_URL', '').strip()
    if not stt_url or not attachments:
        return ''
    stt_key = os.environ.get('HERMES_WEBUI_STT_KEY', '').strip()
    parts = []
    for att in attachments or []:
        if not isinstance(att, dict):
            continue
        path = str(att.get('path') or '')
        mime = str(att.get('mime') or '').lower()
        name = str(att.get('name') or '') or path
        is_audio = (
            os.path.basename(name).startswith('voice-message')
            or mime.startswith('audio/')
            or os.path.splitext(name)[1].lower() in _VOICE_MSG_AUDIO_EXTS
        )
        if not is_audio or not path:
            continue
        try:
            text = _external_stt_transcribe(path, stt_url, stt_key)
        except Exception:
            print(f'[svrnty] voice-message transcription failed for {name}', flush=True)
            text = ''
        if text:
            parts.append(text)
    return '[Voice message transcript]\n' + '\n\n'.join(parts) if parts else ''
 def _handle_transcribe(handler, parsed):
    """POST /api/transcribe — direct one-shot transcription.
    Reads a multipart form with field `file` (the recorded audio blob), writes
    it to a temp file, sends it to the configured STT endpoint, returns
    `{"ok": true, "transcript": "..."}`.
    """
    stt_url = os.environ.get('HERMES_WEBUI_STT_URL', '').strip()
    if not stt_url:
        return _send_json(handler, {'ok': False, 'error': 'HERMES_WEBUI_STT_URL not configured'}, 503)
    ctype = handler.headers.get('Content-Type', '')
    if 'multipart' not in ctype.lower():
        return _send_json(handler, {'ok': False, 'error': 'multipart/form-data required'}, 400)
    length = int(handler.headers.get('Content-Length', '0') or 0)
    if not length:
        return _send_json(handler, {'ok': False, 'error': 'empty body'}, 400)
    body = handler.rfile.read(length)
    file_bytes, fname = _parse_multipart_file(body, ctype, field_name='file')
    if file_bytes is None:
        return _send_json(handler, {'ok': False, 'error': "missing 'file' field"}, 400)
    fname = fname or 'audio.webm'
    suffix = os.path.splitext(fname)[1] or '.webm'
    temp_path = None
    try:
        with tempfile.NamedTemporaryFile(prefix='svrnty-stt-', suffix=suffix, delete=False) as tmp:
            temp_path = tmp.name
            tmp.write(file_bytes)
        transcript = _external_stt_transcribe(
            temp_path, stt_url, os.environ.get('HERMES_WEBUI_STT_KEY', '').strip())
        return _send_json(handler, {'ok': True, 'transcript': transcript}, 200)
    except Exception as e:
        return _send_json(handler, {'ok': False, 'error': str(e)}, 500)
    finally:
        if temp_path and os.path.exists(temp_path):
            try:
                os.remove(temp_path)
            except OSError:
                pass
 def _parse_multipart_file(body: bytes, content_type: str, field_name: str = 'file'):
    """Parse a multipart body and return (file_bytes, filename) for the named field.
    Stdlib only. cgi.FieldStorage was removed in Python 3.13 (PEP 594), so we
    parse via the email module which is the documented replacement.
    Returns (None, None) when the named field is absent.
    """
    # Construct a fake email message so email.parser handles the multipart split.
    full = b'Content-Type: ' + content_type.encode() + b'\r\n\r\n' + body
    parser = email.parser.BytesParser(policy=email.policy.default)
    msg = parser.parsebytes(full)
    if not msg.is_multipart():
        return None, None
    for part in msg.iter_parts():
        disp = part.get('Content-Disposition', '')
        m = re.search(r'name="([^"]+)"', disp)
        if not m or m.group(1) != field_name:
            continue
        fn_m = re.search(r'filename="([^"]+)"', disp)
        filename = fn_m.group(1) if fn_m else None
        payload = part.get_payload(decode=True)
        return payload, filename
    return None, None
 def _send_json(handler, payload: dict, status: int) -> bool:
    body = json.dumps(payload).encode('utf-8')
    handler.send_response(status)
    handler.send_header('Content-Type', 'application/json; charset=utf-8')
    handler.send_header('Content-Length', str(len(body)))
    handler.send_header('Cache-Control', 'no-store')
    handler.end_headers()
    handler.wfile.write(body)
    return True
--- a/scripts/ast-connection-map.py
+++ b/scripts/ast-connection-map.py
@ -34,6 +34,7 @@ MAP_PATH = REPO / "CONNECTION-MAP.md"
 PUBLIC_API = {
    "register_route", "register_static", "inject_script",
    "inject_stylesheet", "config_get", "logger",
    "register_audio_attachment_processor",
 }
--- a/tests/evals/test_features.py
+++ b/tests/evals/test_features.py
@ -10,18 +10,18 @@ ROOT = Path(__file__).resolve().parents[2]
 def test_eval_loader_contract_unchanged():
-    """The 6-method public API is the protocol contract — adding methods needs a PRD bump."""
+    """The 7-method public API is the protocol contract — adding methods needs a PRD bump."""
    import sys
    sys.path.insert(0, str(ROOT.parent / "hermes-webui"))
    try:
        from api.svrnty_plugin_loader import _PluginAPI
    except ImportError:
        # If hermes-webui not next to the plugin, skip — integration env.
        import pytest
        pytest.skip("hermes-webui fork not adjacent; loader contract eval skipped")
    api = _PluginAPI()
    required = {"register_route", "register_static", "inject_script",
-                "inject_stylesheet", "config_get", "logger"}
+                "inject_stylesheet", "config_get", "logger",
                "register_audio_attachment_processor"}
    actual = {m for m in dir(api) if not m.startswith("_")}
    assert required == actual, (
        f"public API drift: expected {required}, got {actual}. "
@ -29,6 +29,13 @@ def test_eval_loader_contract_unchanged():
    )
 def test_eval_audio_processor_signature_unchanged():
    """The audio_attachment_processor takes attachments → str. Loader hook + plugin agree."""
    from routes import transcribe
    out = transcribe._transcribe_audio_attachments([])
    assert isinstance(out, str), f"audio processor must return str, got {type(out).__name__}"
 def test_eval_vault_status_payload_shape():
    """Vault status returns {'secrets': [{'name': ...}, ...]} — schema lock."""
    import json
--- a/tests/unit/test_transcribe.py
+++ b/tests/unit/test_transcribe.py
@ -0,0 +1,117 @@
 """Unit tests for routes/transcribe.py (P3.B + L6).
 Cover the route handler shape + the audio_attachment_processor contract.
 Network calls to the external STT endpoint are mocked.
 """
 import json
 import os
 from unittest.mock import MagicMock, patch
 from routes import transcribe
 class _FakeHandler:
    def __init__(self, body=b"", headers=None):
        self.status = None
        self.headers = headers or {}
        self.body_out = b""
        self.rfile = MagicMock()
        self.rfile.read.return_value = body
    def send_response(self, code):
        self.status = code
    def send_header(self, k, v):
        pass
    def end_headers(self):
        pass
    @property
    def wfile(self):
        h = self
        class _W:
            def write(self_, b): h.body_out += b
        return _W()
 def test_register_wires_route_and_processor():
    api = MagicMock()
    api.logger.return_value = MagicMock()
    transcribe.register(api)
    api.register_route.assert_called_once_with(
        "/api/transcribe", "POST", transcribe._handle_transcribe)
    api.register_audio_attachment_processor.assert_called_once_with(
        transcribe._transcribe_audio_attachments)
 def test_processor_returns_empty_when_stt_url_unset():
    with patch.dict(os.environ, {"HERMES_WEBUI_STT_URL": ""}, clear=False):
        assert transcribe._transcribe_audio_attachments(
            [{"path": "/tmp/foo.webm", "mime": "audio/webm"}]) == ""
 def test_processor_returns_empty_when_no_audio_attachments():
    with patch.dict(os.environ, {"HERMES_WEBUI_STT_URL": "http://stt:8000/transcribe"}):
        assert transcribe._transcribe_audio_attachments([]) == ""
        assert transcribe._transcribe_audio_attachments(
            [{"path": "/tmp/doc.pdf", "mime": "application/pdf"}]) == ""
 def test_processor_transcribes_audio_attachments():
    """End-to-end: audio attachment → STT call → transcript block."""
    attachments = [{
        "path": "/tmp/voice-message-123.webm",
        "mime": "audio/webm",
        "name": "voice-message-123.webm",
    }]
    with patch.dict(os.environ, {"HERMES_WEBUI_STT_URL": "http://stt:8000/v1/audio/transcriptions"}):
        with patch.object(transcribe, "_external_stt_transcribe",
                          return_value="hello world"):
            out = transcribe._transcribe_audio_attachments(attachments)
    assert out.startswith("[Voice message transcript]")
    assert "hello world" in out
 def test_processor_detects_audio_by_filename_prefix():
    """voice-message-* prefix triggers transcription even with non-audio mime."""
    attachments = [{
        "path": "/tmp/voice-message-abc.mp4",
        "mime": "video/mp4",  # browser may upload as video/* per upload handler
        "name": "voice-message-abc.mp4",
    }]
    with patch.dict(os.environ, {"HERMES_WEBUI_STT_URL": "http://stt:8000/v1"}):
        with patch.object(transcribe, "_external_stt_transcribe",
                          return_value="hi"):
            assert "hi" in transcribe._transcribe_audio_attachments(attachments)
 def test_handle_transcribe_503_when_stt_url_missing():
    with patch.dict(os.environ, {"HERMES_WEBUI_STT_URL": ""}, clear=False):
        h = _FakeHandler()
        transcribe._handle_transcribe(h, None)
    assert h.status == 503
 def test_handle_transcribe_400_on_non_multipart():
    with patch.dict(os.environ, {"HERMES_WEBUI_STT_URL": "http://stt:8000/v1"}):
        h = _FakeHandler(headers={"Content-Type": "application/json", "Content-Length": "10"})
        transcribe._handle_transcribe(h, None)
    assert h.status == 400
 def test_multipart_parser_extracts_file_field():
    """_parse_multipart_file pulls the named field's bytes + filename."""
    boundary = "----boundary"
    body = (
        f"--{boundary}\r\n"
        f'Content-Disposition: form-data; name="file"; filename="hello.wav"\r\n'
        f"Content-Type: audio/wav\r\n\r\n"
        f"FAKEAUDIO\r\n"
        f"--{boundary}--\r\n"
    ).encode()
    data, fname = transcribe._parse_multipart_file(
        body, f"multipart/form-data; boundary={boundary}", "file")
    assert data == b"FAKEAUDIO"
    assert fname == "hello.wav"