audio-inbox

Turn a raw voice memo into a clean, concise inbox note with provenance preserved. Orchestrates transcribe-memo for the speech-to-text step, then cleans the result into a structured note.

TEMPORARILY SUSPENDED (2026-04-22)

Audio transcription via transcribe-memo requires ffmpeg and network access to OpenRouter. Claude Code cloud runners currently lack Git LFS support, so audio files in 0_Inbox/Memos/ may be LFS pointers rather than real audio data. Skip all files in 0_Inbox/Memos/ until this is resolved. A workaround (e.g. providing transcripts directly, local transcription, or LFS support) is pending.

When this blocker is lifted, remove this section and the skill works as documented below.


When to use

  • A new audio file appears in 0_Inbox/Memos/ (any format: M4A, AIFF, MP3, WAV, etc.).
  • The user asks to “clean up this voice memo” or “process this memo”.
  • The user runs /audio-inbox explicitly.
  • Invoked by process-inbox in its Phase 0.

Workflow

Phase 0 — Transcribe

  1. For each audio file in 0_Inbox/Memos/ (any format — M4A, AIFF, MP3, WAV, etc.), invoke the transcribe-memo skill.
    • transcribe-memo converts to WAV, sends to OpenRouter (google/gemini-2.5-flash), saves the raw transcript to 6_Private/Memo/transcripts/<filename>.md, and deletes the audio file.
    • If transcribe-memo fails (no API key, API error), it appends a notification to 7_Agent/notifications.md and preserves the audio file. Stop processing that file — move to the next one, or stop if there are no more.

Phase 1 — Clean and structure

For each successfully transcribed file (i.e., a transcript now exists at 6_Private/Memo/transcripts/<filename>.md):

  1. Read the raw transcript from 6_Private/Memo/transcripts/<filename>.md.
  2. Detect languagede (German), en (English), ja (Japanese), or mixed. If mixed, use the dominant language for the concise version; keep the original in the callout.
  3. Produce three outputs (all in the source language):
    • Cleaned transcript — remove filler words, fix spelling / transcription mistakes (e.g. wrong words). Preserve sentence structure and voice.
    • Concise version — the note as if written by hand instead of rambled into a voice memo. First-person throughout — never third-person (“The user discussed…”). No “Summary:” or “Overview:” prefixes.
    • Filename — content-bearing, no extension. No type words (“summary”, “note”, “transcript”, “memo”) — they’re unnecessary.

Example (given a raw transcript):

Today I went to the park and I was like, um, thinking about how smartphones are like, bad for my brain; they really seem ADHD inducing! Well, from my perspective I should stop using mine so much I guess?

Cleaned transcript:

Today I went to the park and thought about how smartphones are bad for my brain; they really seem ADHD inducing! Well, from my perspective I should stop using mine so much I guess?

Concise version:

I went to the park and thought about how smartphones are bad for the brain and ADHD inducing. I should probably stop using mine as much.

Filename:

Park walk - Smartphones and ADHD

  1. Write the note to 0_Inbox/<filename>.md (root of inbox, not in Memos/). Structure:

    ---
    created: YYYY-MM-DD
    lang: en
    tags: [voice-memo]
    source: voice-memo
    transcript: 6_Private/Memo/transcripts/<original-filename>.md
    ---
     
    <concise version as the body>
     
    > [!note]- Full transcript
    > <cleaned transcript, preserving paragraph breaks>
  2. The audio file is already deleted by transcribe-memo. The raw transcript persists in 6_Private/Memo/transcripts/ as the permanent record.

Rules

  • Do all three outputs in the source language. If the transcript is German, the concise version is German, the filename is German.
  • First person throughout. “I went to the park” — not “The user went to the park” or “The speaker discussed”.
  • No summary / overview / notes on / memo in the filename. Content itself names the note.
  • Do not route. Write to 0_Inbox/ root. process-inbox handles PARA placement.
  • Do not touch other pages. This skill is narrow. Cross-references happen in process-inbox.
  • Preserve furigana syntax if Japanese content already has it (see obsidian-markdown).
  • The transcript: frontmatter field links back to the raw transcript in 6_Private/. This is the provenance chain: inbox note → raw transcript (in 6_Private/) → original audio (deleted, referenced in transcript frontmatter).

Blocked cases

  • Transcription failedtranscribe-memo handles notification. Skip the file.
  • Transcript is empty or unintelligible — leave the raw transcript in 6_Private/Memo/transcripts/, create a file 0_Inbox/_unprocessable-<timestamp>.md with a one-line note, so the user can decide.
  • Transcript references a specific page that you can’t identify — still produce the note; cross-referencing is process-inbox’s job.

Verification

After running:

  • 0_Inbox/<new-filename>.md exists with schema-conformant frontmatter and the callout.
  • 0_Inbox/Memos/ has no audio files that were successfully transcribed (they’re deleted by transcribe-memo).
  • 6_Private/Memo/transcripts/<filename>.md exists for each processed memo.
  • The body reads as written-by-hand prose, not a rambled monologue.