Why We’re Doing This
To build a strong Tibetan-to-English translation model, we need lots of real Tibetan text (and audio when available). Our goal is to gather articles from many Tibetan news sites, organize them well, and store them safely. That way, our machine translation systems can learn from a rich dataset of real-world content or for other use cases.
What We’ve Built
We made a set of Python scripts that:
- Collect article links from key Tibetan news websites.
- Download and parse the articles’ text.
- Fetch audio files when the site offers them.
- Save everything in a clear structure, both on our Git repository (code) and in AWS S3(data).
Every article we scrape—whether it’s just text or text plus audio—gets a standardized .json
and .mp3
file.
Where the Data Lives
-
In the repo under
News_Articles/
, sorted by region (India or Tibet) and by website name. -
On S3, split across two buckets:
s3://tibetan-news-data/
for most sites and audio.s3://voa-rfa-data/
for Voice of America and Radio Free Asia content.
We use .json
files for article text and metadata, and .mp3
for audio.
How the Repository Is Organized
tibetan-news-article-scraping/
├── News_Articles/
│ ├── in_india_website/ ← Indian-based Tibetan news
│ └── in_tibet_website/ ← Mainland Tibet news
├── Web Translation/ ← Glossaries from Glosbe & LinguaTools
├── wikipedia/ ← Wikipedia dumps
├── test code/ ← Notebooks for conversion & audio tests
└── README.md
Repo link: GitHub - OpenPecha/tibetan-news-article-scraping
Inside each site folder you’ll find:
- A folder for raw data
- An
analysis
file with stats on total articles, audio status, failures, and more
Our Scraping Pipeline
- URL Collection: Crawl site sections, gather links, remove duplicates, and check they work.
- Content Extraction: Parse HTML for text, download audio if it exists, and grab metadata (author, date, tags).
- Data Processing: (In progress) We will later add cleaning, language checks, and formatting steps.
- Storage: Save JSON, push to S3, commit to Git, and verify backups.
Keeping Track of Progress
Each site folder has an analysis
file. It shows:
- How many articles we downloaded
- Which audio files succeeded or failed
- Word-count and title statistics
- Any errors and why they happened
These stats help us spot problems and measure our coverage.
What We’ve Collected So Far
-
All “new_news_Articles” (sites under
in_tibet_website/
) on S3:- 1,476 Text files
- Total size: ~1.3 GB
- (
aws s3 ls s3://tibetan-news-data/new_news_Articles/ --recursive --summarize | findstr "Total Size"
)
-
All “News Article” (sites under
in_india_website/
) on S3:- 200 Text files
- Total size: ~0.7 GB
- (
aws s3 ls s3://tibetan-news-data/"News Article"/ --recursive --summarize | findstr "Total Size"
)
-
Radio Free Asia (RFA) data in
s3://voa-rfa-data/RFA_Tibetan/
:-
Total: ~115 GB
-
Text: ~2 GB (≈ 47,172 articles)
-
Audio: ~113 GB (43,919 files)
- We extracted 16,386 audio files (≈ 113 GB).
- We skipped 27,533 large files (≈ 190 GB more) to save space.
-
-
VOA Tibetan in
s3://voa-rfa-data/VOA_Tibetan/
after cleanup:- 10,535 articles (from 31,119)
- Total size: ~12.9 GB
- Audio files: 7,306