πΈ Application Screenshots
See Audio Scribe AI in action with these screenshots from the actual application interface:
Main Interface & Recording Controls
The main application interface featuring browser recording controls, drag-and-drop file upload, and real-time processing feedback.
Transcription Results & Speaker Diarization
The results interface showing transcribed text with speaker identification, editable speaker names, and the interactive karaoke-style playback feature.
π― Key Interface Features Shown
ποΈ Recording Controls
One-click browser recording with real-time duration display and audio format information.
π Drag & Drop Upload
Intuitive file upload area supporting multiple audio formats with visual feedback.
π Processing Visualization
Real-time chunk processing progress with visual indicators and current text preview.
π₯ Speaker Diarization
Color-coded speaker segments with editable names and precise timestamps.
π΅ Karaoke Player
Interactive playback with synchronized text highlighting and audio controls.
π€ Export Options
Download complete transcription results as structured JSON with metadata.
Project Overview
Audio Scribe AI is a comprehensive web application that combines the power of KB-whisper (a Swedish-optimized variant of OpenAI Whisper) with Pyannote.audio for advanced speaker diarization. The application runs entirely locally, ensuring privacy and offline processing capabilities. It features a modern, responsive interface with real-time processing feedback and advanced visualization tools.
Key Features
- ποΈ Browser Recording: Record audio directly in your browser with real-time duration display
- π File Upload: Support for multiple audio formats (MP3, WAV, M4A, FLAC, OGG, WebM) with drag-and-drop
- π€ AI Transcription: Powered by KB-whisper (Swedish-optimized) and OpenAI Whisper models
- π Local Processing: Complete privacy with offline processing capabilities
- π₯ Speaker Diarization: Automatic speaker identification using Pyannote.audio
- βοΈ Editable Speaker Names: Rename speakers from "Speaker 1, 2, 3..." to actual names
- β±οΈ Timestamped Results: Precise start/end times for each speaker segment
- π΅ Karaoke Player: Interactive playback with synchronized text highlighting
- π€ Export Options: Download results as JSON with full metadata
- π¨ Modern UI: Clean, responsive interface with dark theme and visual feedback
- β‘ GPU Acceleration: Automatic CUDA support for faster processing
- π Processing Visualization: Real-time chunk processing progress with visual indicators
Technical Implementation
Audio Scribe AI is built with a modern FastAPI backend and vanilla JavaScript frontend. The system uses a unified service architecture that supports both local and cloud-based AI models, with automatic fallback to mock services for development.
FastAPI Backend Architecture
The backend uses a modular service-oriented architecture with separate services for audio processing, transcription, and diarization:
# backend/app.py - Main FastAPI application
from fastapi import FastAPI, File, UploadFile, HTTPException
from services.unified_whisper_service import UnifiedWhisperService
from services.pyannote_service import PyannoteService
from services.audio_service import AudioService
app = FastAPI(title="Audio Scribe AI", version="1.0.0")
# Initialize services with automatic fallback
whisper_service = UnifiedWhisperService()
pyannote_service = PyannoteService()
audio_service = AudioService()
@app.post("/api/transcribe/{file_id}")
async def transcribe_audio(file_id: str):
"""Transcribe and diarize audio file with real-time progress"""
# Implementation handles chunked processing with progress updates
Unified Whisper Service
The unified service supports both KB-whisper (Swedish-optimized) and standard OpenAI Whisper models with automatic model switching and local processing capabilities:
# services/unified_whisper_service.py
class UnifiedWhisperService:
def __init__(self):
self.current_service = None
self.local_service = None
self.openai_service = None
async def transcribe_with_progress(self, audio_path: str,
progress_callback=None):
"""Transcribe with chunked processing and progress updates"""
chunks = self.split_audio_into_chunks(audio_path)
results = []
for i, chunk in enumerate(chunks):
result = await self.transcribe_chunk(chunk)
results.append(result)
if progress_callback:
await progress_callback({
"chunk": i + 1,
"total_chunks": len(chunks),
"current_text": result.get("text", ""),
"progress": (i + 1) / len(chunks) * 100
})
return self.combine_chunk_results(results)
Pyannote Speaker Diarization
Advanced speaker diarization using Pyannote.audio with configurable speaker limits and automatic speaker identification:
# services/pyannote_service.py
from pyannote.audio import Pipeline
import torch
class PyannoteService:
def __init__(self):
self.pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1",
use_auth_token=os.getenv("HUGGINGFACE_TOKEN")
)
def diarize_audio(self, audio_path: str,
min_speakers: int = 1,
max_speakers: int = 10):
"""Perform speaker diarization with configurable parameters"""
diarization = self.pipeline(
audio_path,
min_speakers=min_speakers,
max_speakers=max_speakers
)
# Convert to structured format
speakers = []
for turn, _, speaker in diarization.itertracks(yield_label=True):
speakers.append({
"start": float(turn.start),
"end": float(turn.end),
"speaker": speaker,
"duration": float(turn.end - turn.start)
})
return speakers
Real-time Processing with WebSockets
The application features real-time progress updates using WebSocket connections for live processing feedback:
// frontend/static/js/app.js - Real-time progress handling
class AudioProcessor {
constructor() {
this.ws = null;
this.progressCallback = null;
}
async processAudio(fileId) {
// Establish WebSocket connection for progress updates
this.ws = new WebSocket(`ws://localhost:8000/ws/progress/${fileId}`);
this.ws.onmessage = (event) => {
const progress = JSON.parse(event.data);
this.updateProgressVisualization(progress);
};
// Start processing
const response = await fetch(`/api/transcribe/${fileId}`, {
method: 'POST'
});
return await response.json();
}
updateProgressVisualization(progress) {
// Update chunk visualization and progress bars
const chunksContainer = document.getElementById('chunksVisualization');
const progressFill = document.getElementById('overallProgressFill');
progressFill.style.width = `${progress.progress}%`;
this.highlightCurrentChunk(progress.chunk);
}
}
Technology Stack
Python 3.8+
FastAPI
PyTorch
KB-whisper
OpenAI Whisper
Pyannote.audio
Transformers
NumPy
FFmpeg
Uvicorn
JavaScript ES6+
WebRTC
WebSockets
CUDA (optional)
API Endpoints
The application provides a comprehensive REST API for all transcription and diarization operations:
# Main Application
GET / # Main application interface
GET /static/* # Static files (CSS, JS, assets)
# Audio Processing
POST /api/upload # Upload audio file
POST /api/save-recording # Save browser recording
POST /api/transcribe/{file_id} # Transcribe and diarize audio
GET /api/health # Health check endpoint
# Whisper Service Management
GET /api/whisper/status # Get detailed Whisper service status
POST /api/whisper/switch-to-local # Switch to local Whisper service
POST /api/whisper/switch-to-openai # Switch to OpenAI Whisper service
POST /api/whisper/download-model # Download a local Whisper model
# WebSocket Endpoints
WS /ws/progress/{file_id} # Real-time processing progress updates
API Usage Examples
Here are practical examples of how to interact with the API:
# Upload an audio file
curl -X POST "http://localhost:8000/api/upload" \
-H "Content-Type: multipart/form-data" \
-F "file=@audio.mp3"
# Response:
{
"success": true,
"file_id": "uuid-string",
"filename": "audio.mp3",
"size": 1048576,
"message": "File uploaded successfully"
}
# Start transcription and diarization
curl -X POST "http://localhost:8000/api/transcribe/uuid-string"
Example Output
Here's an example of the comprehensive JSON output from the transcription and diarization process:
{
"success": true,
"transcription": "VΓ€lkommen till dagens mΓΆte. Vi ska idag diskutera projektets framsteg. Tack. Jag har uppdaterat tidslinjen och vi ligger i fas med planen. UtmΓ€rkt. Kan du gΓ₯ igenom de tekniska utmaningarna vi stΓΆtt pΓ₯?",
"segments": [
{
"start": 0.5,
"end": 5.2,
"speaker": "SPEAKER_00",
"text": "VΓ€lkommen till dagens mΓΆte. Vi ska idag diskutera projektets framsteg."
},
{
"start": 6.0,
"end": 10.8,
"speaker": "SPEAKER_01",
"text": "Tack. Jag har uppdaterat tidslinjen och vi ligger i fas med planen."
},
{
"start": 11.5,
"end": 16.3,
"speaker": "SPEAKER_00",
"text": "UtmΓ€rkt. Kan du gΓ₯ igenom de tekniska utmaningarna vi stΓΆtt pΓ₯?"
}
],
"speakers": ["SPEAKER_00", "SPEAKER_01"],
"duration": 16.3,
"processing_time": 8.7,
"model_info": {
"whisper_model": "kb-whisper-large",
"diarization_model": "pyannote/speaker-diarization-3.1"
}
}
Installation & Setup
Get Audio Scribe AI running on your local machine in just a few steps:
Prerequisites
- Python 3.8 or higher
- CUDA-capable GPU (optional, but recommended for faster processing)
- FFmpeg installed on your system
- 4GB+ RAM (8GB+ recommended for large models)
Quick Start
# 1. Clone the repository
git clone https://github.com/your-repo/kb-whisper-pyannote-transcription-diarization
cd kb-whisper-pyannote-transcription-diarization
# 2. Install Python dependencies
pip install -r requirements.txt
# 3. Install FFmpeg (Windows)
# Download from https://ffmpeg.org/download.html and add to PATH
# 4. Set up environment (optional for some Pyannote models)
cp .env-example .env
# Edit .env and add your HUGGINGFACE_TOKEN if needed
# 5. Start the application
python run.py
# 6. Open your browser
# Navigate to http://localhost:8000
Configuration Options
# Environment variables for customization
export WHISPER_MODEL="base" # tiny, base, small, medium, large
export WHISPER_LANGUAGE="auto" # auto, sv, en, etc.
export WHISPER_USE_LOCAL=true # Use local models for privacy
export PYANNOTE_MODEL="pyannote/speaker-diarization-3.1"
export MIN_SPEAKERS="1"
export MAX_SPEAKERS="10"
export MAX_FILE_SIZE="104857600" # 100MB limit
Applications
Audio Scribe AI has numerous real-world applications across various industries:
- π Meeting Minutes: Automatic generation with speaker attribution and timestamps
- ποΈ Interview Transcription: Perfect for journalists, researchers, and content creators
- βοΈ Legal Documentation: Court proceedings and legal consultations with speaker identification
- π₯ Medical Records: Doctor-patient consultations with privacy-focused local processing
- πΊ Media Indexing: Content search and accessibility for video/audio libraries
- βΏ Accessibility Services: Real-time transcription for hearing-impaired individuals
- π Call Analysis: Customer service quality assurance and training
- π Educational Content: Lecture transcription and student accessibility
Project Structure
The application follows a clean, modular architecture that separates concerns and enables easy maintenance and extension:
kb-whisper-pyannote-transcription-diarization/
βββ backend/
β βββ app.py # FastAPI main application
β βββ services/
β β βββ audio_service.py # Audio processing utilities
β β βββ whisper_service.py # OpenAI Whisper integration
β β βββ local_whisper_service.py # Local Whisper models
β β βββ unified_whisper_service.py # Unified service interface
β β βββ pyannote_service.py # Pyannote diarization
β β βββ pyannote_service_simple.py # Simplified diarization
β β βββ simple_diarization.py # Basic speaker separation
β β βββ mock_services.py # Development/testing mocks
β βββ utils/
β βββ config.py # Configuration management
βββ frontend/
β βββ index.html # Main application interface
β βββ static/
β βββ css/style.css # Application styling
β βββ js/
β βββ app.js # Core application logic
β βββ karaoke-player.js # Interactive playback
β βββ recorder.js # Audio recording
βββ screenshots/ # Application screenshots
βββ requirements.txt # Python dependencies
βββ run.py # Application entry point
βββ setup.py # Package configuration
βββ LOCAL_WHISPER_SETUP.md # Local setup guide
βββ README.md # Project documentation
Performance & Optimization
Audio Scribe AI is optimized for both speed and accuracy:
- π GPU Acceleration: Automatic CUDA detection and utilization
- β‘ Chunked Processing: Large files processed in manageable segments
- π Parallel Processing: Concurrent transcription and diarization
- πΎ Memory Management: Efficient handling of large audio files
- π Model Selection: Choose optimal model size for your hardware
- π― Format Optimization: Automatic audio format conversion
Future Enhancements
Planned improvements and features in development:
- π΄ Real-time Streaming: Live transcription with WebSocket streaming
- π Emotion Analysis: Sentiment and emotion detection in speech
- π Smart Summarization: AI-powered content summarization
- π Multi-language Support: Enhanced support for mixed-language content
- π― Custom Vocabularies: Domain-specific terminology recognition
- π Noise Reduction: Advanced audio preprocessing
- π± Mobile App: Native iOS and Android applications
- βοΈ Cloud Integration: Optional cloud processing for heavy workloads