Speech to text pipeline Cost Estimation

Cost Analysis: Building a Custom Speaker-Specific Speech-to-Text Model

This document provides a detailed breakdown of the estimated costs associated with each phase of developing a custom speech-to-text (STT) model for a specific speaker.

Phase 1: Cataloging Audio and Video Sources

Cost Category Details Estimated Cost (USD)
AWS S3 Storage Initial storage of raw audio/video files (assuming 100GB) $2.30/month
Spreadsheet Solution Google Sheets for catalog management $0 (free)
Alternative: Custom CMS Development of dedicated content management system TBD
Manual Cataloging Labor Organizing Catalogs TBD
Automated Collection Scripts Development of scripts for automation TBD
Total Phase 1 $2.30/month + TBD one-time costs

Note: If opting for a custom CMS instead of Google Sheets, additional development and hosting costs would apply

Note: Storage costs will increase as more raw audio/video is collected

Phase 2: Filtering and Splitting Audio

Note: The costs below assume no custom CMS is used in Phase 1. If a custom CMS is implemented, some of these costs may be integrated into the CMS functionality or may vary depending on the CMS capabilities.

Cost Category Details Estimated Cost (USD)
Computing Resources VastAI instance (8 vCPU, 32GB RAM) for audio processing $0.25/hour × 40 hours = $10
AWS S3 Storage Processed audio segments (16kHz WAV, ~50GB) $1.15/month
S3 Data Transfer Uploading processed audio to S3 $0.09/GB × 50GB = $4.50
Preliminary Transcription Hugging Face Inference API with custom 24GB GPU instance ~$1.20-2.00/hour of inference (depends on specific GPU type)
AWS EC2 Instance t3.large for running processing scripts $0.08/hour × 40 hours = $3.20
Total Phase 2 ~$35-100 + $1.15/month

Note: Total depends on inference speed - assuming 100 hours of audio might take 5-10 hours of actual processing time on a 24GB GPU instance

Phase 3: Transcription and Review

Cost Category Details Estimated Cost (INR) Estimated Cost (USD)
Transcriber Payment Base cost (5 INR per minute of audio) 10 hours × 60 minutes × 5 INR = 3,000 INR ~$36
Syllable-based cost (0.35 INR per syllable) 10 hours × 60 minutes × 60 sec × 4 syllables × 0.35 INR = 50,400 INR ~$605
Total Transcriber Cost 53,400 INR ~$641
Reviewer Payment 25% of transcription cost 53,400 INR × 0.25 = 13,350 INR ~$160
Database Hosting MongoDB Atlas or similar for transcription database $57/month
Web Interface For transcribers/reviewers (App hosting) $25/month
Total Phase 3 66,750 INR + monthly costs ~$801 + $82/month

Note: USD conversion based on approximate exchange rate of 1 USD = 83.3 INR

Note: Syllable-based calculations assume an average of 4 Tibetan syllables per second of audio

Transcription Throughput Metrics

Metric Estimate
Average transcription speed 5-10 minutes per 1 minute of audio
Average hours transcribed per transcriber per week 2-3 hours
Number of active transcribers 5
Total weekly throughput 10 hours of audio transcribed
Estimated timeline for 10 hours 1 week

Phase 4: Data Cleaning and Organization

Cost Category Details Estimated Cost (USD)
AWS Notebook Instance ml.t3.medium for data processing $0.05/hour × 20 hours = $1
Hugging Face Pro Account For private dataset hosting (if needed) $9/month
Total Phase 4 ~$1 + $9/month

Phase 5: Model Training

Cost Category Details Estimated Cost (USD)
GPU Computing VastAI instance (24GB GPU RAM) $0.50-1.00/hour × 24 hours = $12-24 per training run
Hugging Face Storage Model checkpoints and artifacts (10GB) $0 (included with Hugging Face account)
Evaluation Computing CPU instance for model evaluation $0.10/hour × 10 hours = $1
Total Phase 5 ~$13-25

Total Project Cost Estimation

Phase One-time Costs (USD) Recurring Costs (USD/month)
Phase 1: Cataloging TBD $2.30
Phase 2: Filtering & Splitting $35-100 $1.15
Phase 3: Transcription & Review $801 $82
Phase 4: Data Cleaning $1 $9
Phase 5: Model Training $13-25 $0
Grand Total ~$850-927 + TBD ~$94.45/month

Cost Distribution Visualization


Note: This visualization shows the relative proportions of one-time costs across project phases. Phase 1 is excluded as costs are TBD.

1 Like