Scraping Tibetan News Websites

uchiha_tashi · May 30, 2025, 9:29am

Why We’re Doing This

To build a strong Tibetan-to-English translation model, we need lots of real Tibetan text (and audio when available). Our goal is to gather articles from many Tibetan news sites, organize them well, and store them safely. That way, our machine translation systems can learn from a rich dataset of real-world content or for other use cases.

What We’ve Built

We made a set of Python scripts that:

Collect article links from key Tibetan news websites.
Download and parse the articles’ text.
Fetch audio files when the site offers them.
Save everything in a clear structure, both on our Git repository (code) and in AWS S3(data).

Every article we scrape—whether it’s just text or text plus audio—gets a standardized .json and .mp3 file.

Where the Data Lives

In the repo under News_Articles/, sorted by region (India or Tibet) and by website name.
On S3, split across two buckets:
- s3://tibetan-news-data/ for most sites and audio.
- s3://voa-rfa-data/ for Voice of America and Radio Free Asia content.

We use .json files for article text and metadata, and .mp3 for audio.

How the Repository Is Organized

tibetan-news-article-scraping/
├── News_Articles/
│   ├── in_india_website/      ← Indian-based Tibetan news
│   └── in_tibet_website/       ← Mainland Tibet news
├── Web Translation/           ← Glossaries from Glosbe & LinguaTools
├── wikipedia/                 ← Wikipedia dumps
├── test code/                 ← Notebooks for conversion & audio tests
└── README.md

Repo link: GitHub - OpenPecha/tibetan-news-article-scraping

Inside each site folder you’ll find:

A folder for raw data
An analysis file with stats on total articles, audio status, failures, and more

Our Scraping Pipeline

URL Collection: Crawl site sections, gather links, remove duplicates, and check they work.
Content Extraction: Parse HTML for text, download audio if it exists, and grab metadata (author, date, tags).
Data Processing: (In progress) We will later add cleaning, language checks, and formatting steps.
Storage: Save JSON, push to S3, commit to Git, and verify backups.

Keeping Track of Progress

Each site folder has an analysis file. It shows:

How many articles we downloaded
Which audio files succeeded or failed
Word-count and title statistics
Any errors and why they happened

image3569×1772 235 KB

These stats help us spot problems and measure our coverage.

What We’ve Collected So Far

All “new_news_Articles” (sites under in_tibet_website/) on S3:
- 1,476 Text files
- Total size: ~1.3 GB
- (aws s3 ls s3://tibetan-news-data/new_news_Articles/ --recursive --summarize | findstr "Total Size")
All “News Article” (sites under in_india_website/) on S3:
- 200 Text files
- Total size: ~0.7 GB
- (aws s3 ls s3://tibetan-news-data/"News Article"/ --recursive --summarize | findstr "Total Size")
Radio Free Asia (RFA) data in s3://voa-rfa-data/RFA_Tibetan/:
- Total: ~115 GB
- Text: ~2 GB (≈ 47,172 articles)
- Audio: ~113 GB (43,919 files)
  - We extracted 16,386 audio files (≈ 113 GB).
  - We skipped 27,533 large files (≈ 190 GB more) to save space.
VOA Tibetan in s3://voa-rfa-data/VOA_Tibetan/ after cleanup:
- 10,535 articles (from 31,119)
- Total size: ~12.9 GB
- Audio files: 7,306

Topic		Replies	Views
Toward a Cleaner Translation Dataset Topic Modeling data-cleaning , translate	0	64	November 3, 2024
Agentic AI Tibetan Buddhist Text Translation [Draft] Machine Translation	2	57	April 14, 2025
Aggregating Publically Available Tibetan-English Parallel Corpora Dataset	0	35	February 8, 2025
A custom ASR model to transcribe the speech of Tai Situ Rinpoche STT	0	50	December 19, 2024
Fine-Tuning a Multi-Dialect Speech Recognition Model for Tibetan Languages with balanced dialect data General	0	33	January 24, 2025