Scraping Tibetan News Websites

Why We’re Doing This

To build a strong Tibetan-to-English translation model, we need lots of real Tibetan text (and audio when available). Our goal is to gather articles from many Tibetan news sites, organize them well, and store them safely. That way, our machine translation systems can learn from a rich dataset of real-world content or for other use cases.

What We’ve Built

We made a set of Python scripts that:

  1. Collect article links from key Tibetan news websites.
  2. Download and parse the articles’ text.
  3. Fetch audio files when the site offers them.
  4. Save everything in a clear structure, both on our Git repository (code) and in AWS S3(data).

Every article we scrape—whether it’s just text or text plus audio—gets a standardized .json and .mp3 file.

Where the Data Lives

  • In the repo under News_Articles/, sorted by region (India or Tibet) and by website name.

  • On S3, split across two buckets:

    • s3://tibetan-news-data/ for most sites and audio.
    • s3://voa-rfa-data/ for Voice of America and Radio Free Asia content.

We use .json files for article text and metadata, and .mp3 for audio.

How the Repository Is Organized

tibetan-news-article-scraping/
├── News_Articles/
│   ├── in_india_website/      ← Indian-based Tibetan news
│   └── in_tibet_website/       ← Mainland Tibet news
├── Web Translation/           ← Glossaries from Glosbe & LinguaTools
├── wikipedia/                 ← Wikipedia dumps
├── test code/                 ← Notebooks for conversion & audio tests
└── README.md

Repo link: GitHub - OpenPecha/tibetan-news-article-scraping

Inside each site folder you’ll find:

  • A folder for raw data
  • An analysis file with stats on total articles, audio status, failures, and more

Our Scraping Pipeline

  1. URL Collection: Crawl site sections, gather links, remove duplicates, and check they work.
  2. Content Extraction: Parse HTML for text, download audio if it exists, and grab metadata (author, date, tags).
  3. Data Processing: (In progress) We will later add cleaning, language checks, and formatting steps.
  4. Storage: Save JSON, push to S3, commit to Git, and verify backups.

Keeping Track of Progress

Each site folder has an analysis file. It shows:

  • How many articles we downloaded
  • Which audio files succeeded or failed
  • Word-count and title statistics
  • Any errors and why they happened

These stats help us spot problems and measure our coverage.

What We’ve Collected So Far

  • All “new_news_Articles” (sites under in_tibet_website/) on S3:

    • 1,476 Text files
    • Total size: ~1.3 GB
    • (aws s3 ls s3://tibetan-news-data/new_news_Articles/ --recursive --summarize | findstr "Total Size")
  • All “News Article” (sites under in_india_website/) on S3:

    • 200 Text files
    • Total size: ~0.7 GB
    • (aws s3 ls s3://tibetan-news-data/"News Article"/ --recursive --summarize | findstr "Total Size")
  • Radio Free Asia (RFA) data in s3://voa-rfa-data/RFA_Tibetan/:

    • Total: ~115 GB

    • Text: ~2 GB (≈ 47,172 articles)

    • Audio: ~113 GB (43,919 files)

      • We extracted 16,386 audio files (≈ 113 GB).
      • We skipped 27,533 large files (≈ 190 GB more) to save space.
  • VOA Tibetan in s3://voa-rfa-data/VOA_Tibetan/ after cleanup:

    • 10,535 articles (from 31,119)
    • Total size: ~12.9 GB
    • Audio files: 7,306

For more information on Data Structure and Scraped News Sources, please see this link: OpenPecha/tibetan-news-article-scraping