Cost Analysis: Building a Custom Speaker-Specific Speech-to-Text Model
This document provides a detailed breakdown of the estimated costs associated with each phase of developing a custom speech-to-text (STT) model for a specific speaker.
Phase 1: Cataloging Audio and Video Sources
| Cost Category | Details | Estimated Cost (USD) |
|---|---|---|
| AWS S3 Storage | Initial storage of raw audio/video files (assuming 100GB) | $2.30/month |
| Spreadsheet Solution | Google Sheets for catalog management | $0 (free) |
| Alternative: Custom CMS | Development of dedicated content management system | TBD |
| Manual Cataloging Labor | Organizing Catalogs | TBD |
| Automated Collection Scripts | Development of scripts for automation | TBD |
| Total Phase 1 | $2.30/month + TBD one-time costs |
Note: If opting for a custom CMS instead of Google Sheets, additional development and hosting costs would apply
Note: Storage costs will increase as more raw audio/video is collected
Phase 2: Filtering and Splitting Audio
Note: The costs below assume no custom CMS is used in Phase 1. If a custom CMS is implemented, some of these costs may be integrated into the CMS functionality or may vary depending on the CMS capabilities.
| Cost Category | Details | Estimated Cost (USD) |
|---|---|---|
| Computing Resources | VastAI instance (8 vCPU, 32GB RAM) for audio processing | $0.25/hour × 40 hours = $10 |
| AWS S3 Storage | Processed audio segments (16kHz WAV, ~50GB) | $1.15/month |
| S3 Data Transfer | Uploading processed audio to S3 | $0.09/GB × 50GB = $4.50 |
| Preliminary Transcription | Hugging Face Inference API with custom 24GB GPU instance | ~$1.20-2.00/hour of inference (depends on specific GPU type) |
| AWS EC2 Instance | t3.large for running processing scripts | $0.08/hour × 40 hours = $3.20 |
| Total Phase 2 | ~$35-100 + $1.15/month |
Note: Total depends on inference speed - assuming 100 hours of audio might take 5-10 hours of actual processing time on a 24GB GPU instance
Phase 3: Transcription and Review
| Cost Category | Details | Estimated Cost (INR) | Estimated Cost (USD) |
|---|---|---|---|
| Transcriber Payment | Base cost (5 INR per minute of audio) | 10 hours × 60 minutes × 5 INR = 3,000 INR | ~$36 |
| Syllable-based cost (0.35 INR per syllable) | 10 hours × 60 minutes × 60 sec × 4 syllables × 0.35 INR = 50,400 INR | ~$605 | |
| Total Transcriber Cost | 53,400 INR | ~$641 | |
| Reviewer Payment | 25% of transcription cost | 53,400 INR × 0.25 = 13,350 INR | ~$160 |
| Database Hosting | MongoDB Atlas or similar for transcription database | $57/month | |
| Web Interface | For transcribers/reviewers (App hosting) | $25/month | |
| Total Phase 3 | 66,750 INR + monthly costs | ~$801 + $82/month |
Note: USD conversion based on approximate exchange rate of 1 USD = 83.3 INR
Note: Syllable-based calculations assume an average of 4 Tibetan syllables per second of audio
Transcription Throughput Metrics
| Metric | Estimate |
|---|---|
| Average transcription speed | 5-10 minutes per 1 minute of audio |
| Average hours transcribed per transcriber per week | 2-3 hours |
| Number of active transcribers | 5 |
| Total weekly throughput | 10 hours of audio transcribed |
| Estimated timeline for 10 hours | 1 week |
Phase 4: Data Cleaning and Organization
| Cost Category | Details | Estimated Cost (USD) |
|---|---|---|
| AWS Notebook Instance | ml.t3.medium for data processing | $0.05/hour × 20 hours = $1 |
| Hugging Face Pro Account | For private dataset hosting (if needed) | $9/month |
| Total Phase 4 | ~$1 + $9/month |
Phase 5: Model Training
| Cost Category | Details | Estimated Cost (USD) |
|---|---|---|
| GPU Computing | VastAI instance (24GB GPU RAM) | $0.50-1.00/hour × 24 hours = $12-24 per training run |
| Hugging Face Storage | Model checkpoints and artifacts (10GB) | $0 (included with Hugging Face account) |
| Evaluation Computing | CPU instance for model evaluation | $0.10/hour × 10 hours = $1 |
| Total Phase 5 | ~$13-25 |
Total Project Cost Estimation
| Phase | One-time Costs (USD) | Recurring Costs (USD/month) |
|---|---|---|
| Phase 1: Cataloging | TBD | $2.30 |
| Phase 2: Filtering & Splitting | $35-100 | $1.15 |
| Phase 3: Transcription & Review | $801 | $82 |
| Phase 4: Data Cleaning | $1 | $9 |
| Phase 5: Model Training | $13-25 | $0 |
| Grand Total | ~$850-927 + TBD | ~$94.45/month |
Cost Distribution Visualization
Note: This visualization shows the relative proportions of one-time costs across project phases. Phase 1 is excluded as costs are TBD.
