Audio Scribe AI - KB-whisper + Pyannote

A web application for audio transcription and speaker diarization using OpenAI Whisper variant KB-whisper (Swedish) and Pyannote.audio. Upload audio files or record directly in the browser to get timestamped transcriptions with speaker identification.

View on GitHub

πŸ“Έ Application Screenshots

See Audio Scribe AI in action with these screenshots from the actual application interface:

Main Interface & Recording Controls

The main application interface featuring browser recording controls, drag-and-drop file upload, and real-time processing feedback.

Audio Scribe AI - Main Interface

Transcription Results & Speaker Diarization

The results interface showing transcribed text with speaker identification, editable speaker names, and the interactive karaoke-style playback feature.

Audio Scribe AI - Transcription Results

🎯 Key Interface Features Shown

πŸŽ™οΈ Recording Controls

One-click browser recording with real-time duration display and audio format information.

πŸ“ Drag & Drop Upload

Intuitive file upload area supporting multiple audio formats with visual feedback.

πŸ“Š Processing Visualization

Real-time chunk processing progress with visual indicators and current text preview.

πŸ‘₯ Speaker Diarization

Color-coded speaker segments with editable names and precise timestamps.

🎡 Karaoke Player

Interactive playback with synchronized text highlighting and audio controls.

πŸ“€ Export Options

Download complete transcription results as structured JSON with metadata.

Project Overview

Audio Scribe AI is a comprehensive web application that combines the power of KB-whisper (a Swedish-optimized variant of OpenAI Whisper) with Pyannote.audio for advanced speaker diarization. The application runs entirely locally, ensuring privacy and offline processing capabilities. It features a modern, responsive interface with real-time processing feedback and advanced visualization tools.

Audio Scribe AI - KB-whisper + Pyannote

Key Features

  • πŸŽ™οΈ Browser Recording: Record audio directly in your browser with real-time duration display
  • πŸ“ File Upload: Support for multiple audio formats (MP3, WAV, M4A, FLAC, OGG, WebM) with drag-and-drop
  • πŸ€– AI Transcription: Powered by KB-whisper (Swedish-optimized) and OpenAI Whisper models
  • 🏠 Local Processing: Complete privacy with offline processing capabilities
  • πŸ‘₯ Speaker Diarization: Automatic speaker identification using Pyannote.audio
  • ✏️ Editable Speaker Names: Rename speakers from "Speaker 1, 2, 3..." to actual names
  • ⏱️ Timestamped Results: Precise start/end times for each speaker segment
  • 🎡 Karaoke Player: Interactive playback with synchronized text highlighting
  • πŸ“€ Export Options: Download results as JSON with full metadata
  • 🎨 Modern UI: Clean, responsive interface with dark theme and visual feedback
  • ⚑ GPU Acceleration: Automatic CUDA support for faster processing
  • πŸ“Š Processing Visualization: Real-time chunk processing progress with visual indicators

Technical Implementation

Audio Scribe AI is built with a modern FastAPI backend and vanilla JavaScript frontend. The system uses a unified service architecture that supports both local and cloud-based AI models, with automatic fallback to mock services for development.

FastAPI Backend Architecture

The backend uses a modular service-oriented architecture with separate services for audio processing, transcription, and diarization:

# backend/app.py - Main FastAPI application from fastapi import FastAPI, File, UploadFile, HTTPException from services.unified_whisper_service import UnifiedWhisperService from services.pyannote_service import PyannoteService from services.audio_service import AudioService app = FastAPI(title="Audio Scribe AI", version="1.0.0") # Initialize services with automatic fallback whisper_service = UnifiedWhisperService() pyannote_service = PyannoteService() audio_service = AudioService() @app.post("/api/transcribe/{file_id}") async def transcribe_audio(file_id: str): """Transcribe and diarize audio file with real-time progress""" # Implementation handles chunked processing with progress updates

Unified Whisper Service

The unified service supports both KB-whisper (Swedish-optimized) and standard OpenAI Whisper models with automatic model switching and local processing capabilities:

# services/unified_whisper_service.py class UnifiedWhisperService: def __init__(self): self.current_service = None self.local_service = None self.openai_service = None async def transcribe_with_progress(self, audio_path: str, progress_callback=None): """Transcribe with chunked processing and progress updates""" chunks = self.split_audio_into_chunks(audio_path) results = [] for i, chunk in enumerate(chunks): result = await self.transcribe_chunk(chunk) results.append(result) if progress_callback: await progress_callback({ "chunk": i + 1, "total_chunks": len(chunks), "current_text": result.get("text", ""), "progress": (i + 1) / len(chunks) * 100 }) return self.combine_chunk_results(results)

Pyannote Speaker Diarization

Advanced speaker diarization using Pyannote.audio with configurable speaker limits and automatic speaker identification:

# services/pyannote_service.py from pyannote.audio import Pipeline import torch class PyannoteService: def __init__(self): self.pipeline = Pipeline.from_pretrained( "pyannote/speaker-diarization-3.1", use_auth_token=os.getenv("HUGGINGFACE_TOKEN") ) def diarize_audio(self, audio_path: str, min_speakers: int = 1, max_speakers: int = 10): """Perform speaker diarization with configurable parameters""" diarization = self.pipeline( audio_path, min_speakers=min_speakers, max_speakers=max_speakers ) # Convert to structured format speakers = [] for turn, _, speaker in diarization.itertracks(yield_label=True): speakers.append({ "start": float(turn.start), "end": float(turn.end), "speaker": speaker, "duration": float(turn.end - turn.start) }) return speakers

Real-time Processing with WebSockets

The application features real-time progress updates using WebSocket connections for live processing feedback:

// frontend/static/js/app.js - Real-time progress handling class AudioProcessor { constructor() { this.ws = null; this.progressCallback = null; } async processAudio(fileId) { // Establish WebSocket connection for progress updates this.ws = new WebSocket(`ws://localhost:8000/ws/progress/${fileId}`); this.ws.onmessage = (event) => { const progress = JSON.parse(event.data); this.updateProgressVisualization(progress); }; // Start processing const response = await fetch(`/api/transcribe/${fileId}`, { method: 'POST' }); return await response.json(); } updateProgressVisualization(progress) { // Update chunk visualization and progress bars const chunksContainer = document.getElementById('chunksVisualization'); const progressFill = document.getElementById('overallProgressFill'); progressFill.style.width = `${progress.progress}%`; this.highlightCurrentChunk(progress.chunk); } }

Technology Stack

Python 3.8+
FastAPI
PyTorch
KB-whisper
OpenAI Whisper
Pyannote.audio
Transformers
NumPy
FFmpeg
Uvicorn
JavaScript ES6+
WebRTC
WebSockets
CUDA (optional)

API Endpoints

The application provides a comprehensive REST API for all transcription and diarization operations:

# Main Application GET / # Main application interface GET /static/* # Static files (CSS, JS, assets) # Audio Processing POST /api/upload # Upload audio file POST /api/save-recording # Save browser recording POST /api/transcribe/{file_id} # Transcribe and diarize audio GET /api/health # Health check endpoint # Whisper Service Management GET /api/whisper/status # Get detailed Whisper service status POST /api/whisper/switch-to-local # Switch to local Whisper service POST /api/whisper/switch-to-openai # Switch to OpenAI Whisper service POST /api/whisper/download-model # Download a local Whisper model # WebSocket Endpoints WS /ws/progress/{file_id} # Real-time processing progress updates

API Usage Examples

Here are practical examples of how to interact with the API:

# Upload an audio file curl -X POST "http://localhost:8000/api/upload" \ -H "Content-Type: multipart/form-data" \ -F "file=@audio.mp3" # Response: { "success": true, "file_id": "uuid-string", "filename": "audio.mp3", "size": 1048576, "message": "File uploaded successfully" } # Start transcription and diarization curl -X POST "http://localhost:8000/api/transcribe/uuid-string"

Example Output

Here's an example of the comprehensive JSON output from the transcription and diarization process:

{ "success": true, "transcription": "VΓ€lkommen till dagens mΓΆte. Vi ska idag diskutera projektets framsteg. Tack. Jag har uppdaterat tidslinjen och vi ligger i fas med planen. UtmΓ€rkt. Kan du gΓ₯ igenom de tekniska utmaningarna vi stΓΆtt pΓ₯?", "segments": [ { "start": 0.5, "end": 5.2, "speaker": "SPEAKER_00", "text": "VΓ€lkommen till dagens mΓΆte. Vi ska idag diskutera projektets framsteg." }, { "start": 6.0, "end": 10.8, "speaker": "SPEAKER_01", "text": "Tack. Jag har uppdaterat tidslinjen och vi ligger i fas med planen." }, { "start": 11.5, "end": 16.3, "speaker": "SPEAKER_00", "text": "UtmΓ€rkt. Kan du gΓ₯ igenom de tekniska utmaningarna vi stΓΆtt pΓ₯?" } ], "speakers": ["SPEAKER_00", "SPEAKER_01"], "duration": 16.3, "processing_time": 8.7, "model_info": { "whisper_model": "kb-whisper-large", "diarization_model": "pyannote/speaker-diarization-3.1" } }

Installation & Setup

Get Audio Scribe AI running on your local machine in just a few steps:

Prerequisites

  • Python 3.8 or higher
  • CUDA-capable GPU (optional, but recommended for faster processing)
  • FFmpeg installed on your system
  • 4GB+ RAM (8GB+ recommended for large models)

Quick Start

# 1. Clone the repository git clone https://github.com/your-repo/kb-whisper-pyannote-transcription-diarization cd kb-whisper-pyannote-transcription-diarization # 2. Install Python dependencies pip install -r requirements.txt # 3. Install FFmpeg (Windows) # Download from https://ffmpeg.org/download.html and add to PATH # 4. Set up environment (optional for some Pyannote models) cp .env-example .env # Edit .env and add your HUGGINGFACE_TOKEN if needed # 5. Start the application python run.py # 6. Open your browser # Navigate to http://localhost:8000

Configuration Options

# Environment variables for customization export WHISPER_MODEL="base" # tiny, base, small, medium, large export WHISPER_LANGUAGE="auto" # auto, sv, en, etc. export WHISPER_USE_LOCAL=true # Use local models for privacy export PYANNOTE_MODEL="pyannote/speaker-diarization-3.1" export MIN_SPEAKERS="1" export MAX_SPEAKERS="10" export MAX_FILE_SIZE="104857600" # 100MB limit

Applications

Audio Scribe AI has numerous real-world applications across various industries:

  • πŸ“ Meeting Minutes: Automatic generation with speaker attribution and timestamps
  • πŸŽ™οΈ Interview Transcription: Perfect for journalists, researchers, and content creators
  • βš–οΈ Legal Documentation: Court proceedings and legal consultations with speaker identification
  • πŸ₯ Medical Records: Doctor-patient consultations with privacy-focused local processing
  • πŸ“Ί Media Indexing: Content search and accessibility for video/audio libraries
  • β™Ώ Accessibility Services: Real-time transcription for hearing-impaired individuals
  • πŸ“ž Call Analysis: Customer service quality assurance and training
  • πŸŽ“ Educational Content: Lecture transcription and student accessibility
Audio Scribe AI Application

Project Structure

The application follows a clean, modular architecture that separates concerns and enables easy maintenance and extension:

kb-whisper-pyannote-transcription-diarization/ β”œβ”€β”€ backend/ β”‚ β”œβ”€β”€ app.py # FastAPI main application β”‚ β”œβ”€β”€ services/ β”‚ β”‚ β”œβ”€β”€ audio_service.py # Audio processing utilities β”‚ β”‚ β”œβ”€β”€ whisper_service.py # OpenAI Whisper integration β”‚ β”‚ β”œβ”€β”€ local_whisper_service.py # Local Whisper models β”‚ β”‚ β”œβ”€β”€ unified_whisper_service.py # Unified service interface β”‚ β”‚ β”œβ”€β”€ pyannote_service.py # Pyannote diarization β”‚ β”‚ β”œβ”€β”€ pyannote_service_simple.py # Simplified diarization β”‚ β”‚ β”œβ”€β”€ simple_diarization.py # Basic speaker separation β”‚ β”‚ └── mock_services.py # Development/testing mocks β”‚ └── utils/ β”‚ └── config.py # Configuration management β”œβ”€β”€ frontend/ β”‚ β”œβ”€β”€ index.html # Main application interface β”‚ └── static/ β”‚ β”œβ”€β”€ css/style.css # Application styling β”‚ └── js/ β”‚ β”œβ”€β”€ app.js # Core application logic β”‚ β”œβ”€β”€ karaoke-player.js # Interactive playback β”‚ └── recorder.js # Audio recording β”œβ”€β”€ screenshots/ # Application screenshots β”œβ”€β”€ requirements.txt # Python dependencies β”œβ”€β”€ run.py # Application entry point β”œβ”€β”€ setup.py # Package configuration β”œβ”€β”€ LOCAL_WHISPER_SETUP.md # Local setup guide └── README.md # Project documentation

Performance & Optimization

Audio Scribe AI is optimized for both speed and accuracy:

  • πŸš€ GPU Acceleration: Automatic CUDA detection and utilization
  • ⚑ Chunked Processing: Large files processed in manageable segments
  • πŸ”„ Parallel Processing: Concurrent transcription and diarization
  • πŸ’Ύ Memory Management: Efficient handling of large audio files
  • πŸ“Š Model Selection: Choose optimal model size for your hardware
  • 🎯 Format Optimization: Automatic audio format conversion

Future Enhancements

Planned improvements and features in development:

  • πŸ”΄ Real-time Streaming: Live transcription with WebSocket streaming
  • 🎭 Emotion Analysis: Sentiment and emotion detection in speech
  • πŸ“ Smart Summarization: AI-powered content summarization
  • 🌍 Multi-language Support: Enhanced support for mixed-language content
  • 🎯 Custom Vocabularies: Domain-specific terminology recognition
  • πŸ”Š Noise Reduction: Advanced audio preprocessing
  • πŸ“± Mobile App: Native iOS and Android applications
  • ☁️ Cloud Integration: Optional cloud processing for heavy workloads