Speech to text pipeline Cost Estimation

Ganga_Gyatso · May 23, 2025, 5:30am

Cost Analysis: Building a Custom Speaker-Specific Speech-to-Text Model

This document provides a detailed breakdown of the estimated costs associated with each phase of developing a custom speech-to-text (STT) model for a specific speaker.

Phase 1: Cataloging Audio and Video Sources

Cost Category	Details	Estimated Cost (USD)
AWS S3 Storage	Initial storage of raw audio/video files (assuming 100GB)	$2.30/month
Spreadsheet Solution	Google Sheets for catalog management	$0 (free)
Alternative: Custom CMS	Development of dedicated content management system	TBD
Manual Cataloging Labor	Organizing Catalogs	TBD
Automated Collection Scripts	Development of scripts for automation	TBD
Total Phase 1		$2.30/month + TBD one-time costs

Note: If opting for a custom CMS instead of Google Sheets, additional development and hosting costs would apply

Note: Storage costs will increase as more raw audio/video is collected

Phase 2: Filtering and Splitting Audio

Note: The costs below assume no custom CMS is used in Phase 1. If a custom CMS is implemented, some of these costs may be integrated into the CMS functionality or may vary depending on the CMS capabilities.

Cost Category	Details	Estimated Cost (USD)
Computing Resources	VastAI instance (8 vCPU, 32GB RAM) for audio processing	$0.25/hour × 40 hours = $10
AWS S3 Storage	Processed audio segments (16kHz WAV, ~50GB)	$1.15/month
S3 Data Transfer	Uploading processed audio to S3	$0.09/GB × 50GB = $4.50
Preliminary Transcription	Hugging Face Inference API with custom 24GB GPU instance	~$1.20-2.00/hour of inference (depends on specific GPU type)
AWS EC2 Instance	t3.large for running processing scripts	$0.08/hour × 40 hours = $3.20
Total Phase 2		~$35-100 + $1.15/month

Note: Total depends on inference speed - assuming 100 hours of audio might take 5-10 hours of actual processing time on a 24GB GPU instance

Phase 3: Transcription and Review

Cost Category	Details	Estimated Cost (INR)	Estimated Cost (USD)
Transcriber Payment	Base cost (5 INR per minute of audio)	10 hours × 60 minutes × 5 INR = 3,000 INR	~$36
	Syllable-based cost (0.35 INR per syllable)	10 hours × 60 minutes × 60 sec × 4 syllables × 0.35 INR = 50,400 INR	~$605
	Total Transcriber Cost	53,400 INR	~$641
Reviewer Payment	25% of transcription cost	53,400 INR × 0.25 = 13,350 INR	~$160
Database Hosting	MongoDB Atlas or similar for transcription database		$57/month
Web Interface	For transcribers/reviewers (App hosting)		$25/month
Total Phase 3		66,750 INR + monthly costs	~$801 + $82/month

Note: USD conversion based on approximate exchange rate of 1 USD = 83.3 INR

Note: Syllable-based calculations assume an average of 4 Tibetan syllables per second of audio

Transcription Throughput Metrics

Metric	Estimate
Average transcription speed	5-10 minutes per 1 minute of audio
Average hours transcribed per transcriber per week	2-3 hours
Number of active transcribers	5
Total weekly throughput	10 hours of audio transcribed
Estimated timeline for 10 hours	1 week

Phase 4: Data Cleaning and Organization

Cost Category	Details	Estimated Cost (USD)
AWS Notebook Instance	ml.t3.medium for data processing	$0.05/hour × 20 hours = $1
Hugging Face Pro Account	For private dataset hosting (if needed)	$9/month
Total Phase 4		~$1 + $9/month

Phase 5: Model Training

Cost Category	Details	Estimated Cost (USD)
GPU Computing	VastAI instance (24GB GPU RAM)	$0.50-1.00/hour × 24 hours = $12-24 per training run
Hugging Face Storage	Model checkpoints and artifacts (10GB)	$0 (included with Hugging Face account)
Evaluation Computing	CPU instance for model evaluation	$0.10/hour × 10 hours = $1
Total Phase 5		~$13-25

Total Project Cost Estimation

Phase	One-time Costs (USD)	Recurring Costs (USD/month)
Phase 1: Cataloging	TBD	$2.30
Phase 2: Filtering & Splitting	$35-100	$1.15
Phase 3: Transcription & Review	$801	$82
Phase 4: Data Cleaning	$1	$9
Phase 5: Model Training	$13-25	$0
Grand Total	~$850-927 + TBD	~$94.45/month

Cost Distribution Visualization

Note: This visualization shows the relative proportions of one-time costs across project phases. Phase 1 is excluded as costs are TBD.

Topic		Replies	Views
Building a Custom Speech-to-Text Model: A Step-by-Step Workflow 🔊 ASR Speech Recognition SIG	2	161	May 19, 2025
🕉️ Garchen Rinpoche Speech SIG Proposal Garchen Rinpoche Speech SIG	0	72	July 8, 2025
📄 PRD: Garchen Rinpoche Speech Garchen Rinpoche Speech SIG	5	105	July 23, 2025
Custom Speech To Text (STT) Model PRD 🚀 WG སྡེ་ཚན།	0	38	June 9, 2025
Custom ASR(Automatic Speech Recognition) for Garchen Rinpoche Garchen Rinpoche Speech SIG documentation	1	63	July 22, 2025