OpenRefine for WikiMedia

Setting Up and Using OpenRefine for Buddhist Data Population in Wikimedia Projects

OpenRefine is a powerful tool for cleaning, transforming, and uploading data to Wikimedia sister projects like Wikidata and Wikimedia Commons. This blog post will guide you through setting up OpenRefine and using it effectively for Buddhist data population tasks.

1. Introduction to OpenRefine

OpenRefine (formerly Google Refine) is an open-source tool for working with messy data. It’s particularly valuable for:

  • Data cleaning and transformation
  • Linking datasets to knowledge bases like Wikidata
  • Enriching data from external sources
  • Creating and uploading structured data to Wikimedia projects

2. Setting Up OpenRefine

Installation Options

  1. Desktop Installation:
    • Visit the OpenRefine download page
    • Choose the appropriate version for your operating system (Windows, Mac, or Linux)
    • Download and follow installation instructions

3. Main Features of OpenRefine for Data Cleaning

Faceting

Faceting allows you to quickly analyze your data and find patterns:

  1. Text Faceting: Groups similar text values

    • Great for finding variants of Buddhist terms or names
    • Helps identify inconsistencies in transliterations
  2. Numeric Faceting: Filters data by numeric ranges

    • Useful for dates of Buddhist artifacts or historical events
  3. Custom Faceting: Create expressions to facet on computed values

    • Can combine multiple criteria specific to Buddhist datasets

Clustering

Clustering helps identify and merge similar values that might represent the same entity:

  1. Key Collision Methods:

    • Fingerprint: Normalizes capitalization, punctuation, and whitespace
    • N-Gram Fingerprint: Helps with minor spelling variations in Buddhist terms
  2. Nearest Neighbor Methods:

    • Levenshtein Distance: Identifies strings with minor differences
    • PPM (Prediction by Partial Matching): Good for longer texts

For Buddhist data, clustering is invaluable for normalizing:

  • Different spellings of the same Buddhist term
  • Variant transliterations of Sanskrit or Pali terms
  • Different naming conventions for historical Buddhist figures

4. Reconciliation with Wikidata

Reconciliation is the process of matching your data to existing entries in Wikidata.

Reconciliation Process

  1. Set Up Reconciliation Service:

    • In your column’s dropdown menu, select “Reconcile” → “Start reconciling”
  2. Restrict Types (optional):

    • For Buddhist content, you might restrict to specific classes like:
      • Religious text (Q179461)
      • Buddhist temple (Q4414081)
      • Buddhist concept (Q25341675)
      • Buddhist term (Q86691240)
  3. Use Property Matching:

    • Enhance reconciliation by matching additional columns to Wikidata properties
    • Example: Use columns like “location,” “time period,” or “school of Buddhism”
  4. Review and Select Matches:

    • For each row, OpenRefine will suggest potential Wikidata matches
    • Review and select the correct match, or mark as “Create new item” if none exists

Creating New Items

When reconciling Buddhist data not yet in Wikidata:

  1. Select “Create new item” during reconciliation
  2. This flags the item for creation when you upload to Wikidata

5. Creating Schemas for Wikidata Upload

Schemas tell OpenRefine how to translate your data into Wikidata statements.

Basic Schema Creation

  1. Access Schema Editor:

    • Click the “Wikidata” button → “Edit Wikidata schema”
  2. Define Subject Items:

    • Add an item by clicking “+ add item”
    • Drag your reconciled column to the subject field
  3. Add Statements:

    • Click “+ add statement” under each item
    • Select the appropriate property (P number)
    • Drag the relevant column as the value
    • Add qualifiers or references as needed

Example Schema for Buddhist Texts

For a dataset of Buddhist texts, your schema might include:

  • Item: Reconciled “Text Title” column
  • Statements:
    • instance of (P31) → Buddhist text (Q179461)
    • language of work (P407) → appropriate language
    • author (P50) → author column (possibly reconciled)
    • inception (P571) → date column
    • part of (P361) → appropriate collection or canon

Adding References

Always add references to your statements:

  1. Click “+ add reference” under a statement
  2. Add appropriate reference properties:
    • stated in (P248) → source publication
    • retrieved (P813) → date of data collection
    • reference URL (P854) → source URL

6. Upload Methods

Direct Upload to Wikidata

  1. Preview Edits:

    • Ensure schema is complete and error-free
    • Click “Preview” to see what will be uploaded
  2. Fix Issues:

    • Address any format issues or missing references
    • OpenRefine highlights issues that need correction
  3. Upload:

    • Click “Upload edits to Wikibase…”
    • Log in with your Wikimedia account
    • Provide an edit summary
    • Click “Upload edits”

Export to QuickStatements

QuickStatements is an alternative tool for batch editing Wikidata.

  1. Export from OpenRefine:

    • Click “Export” → “QuickStatements v1” format
    • Copy the generated commands or save to a file
  2. Use QuickStatements:

  3. Autoconfirmed Privileges:

    • For batch uploading, ensure your account is “autoconfirmed”
    • Requirements: Account is more than 4 days old and has at least 50 edits

7. OpenRefine for Wikimedia Commons

Setting Up for Commons Uploads

  1. Install Commons Extension (optional but recommended):

  2. Prepare Your Dataset:

    • Create columns for all required metadata:
      • File path or URL
      • Desired filename on Commons
      • Description (wikitext)
      • Structured data statements
      • Source and license information

Uploading Images (JPEG)

  1. Create File Path Column:

    • Either local paths or URLs to images
    • For URLs, ensure the domain is allowed on Commons
  2. Create Filename Column:

    • Follow Commons naming conventions
    • Descriptive filenames for Buddhist content, e.g., “Borobudur_relief_panel_depicting_Jataka_tale.jpg”
  3. Create Wikitext Column:

    • Include categories, e.g., “[[Category:Buddhist temples in Indonesia]]”
    • Include license templates, e.g., “{{CC-BY-SA-4.0}}”
    • Include source information
  4. Create Schema:

    • Click “Schema” tab or “Edit Wikibase schema”
    • Click “+ add media”
    • Drag appropriate columns to each field
  5. Upload:

    • Select “Upload edits to Wikibase…”
    • Log in and provide an edit summary
    • Start with a small test batch

Uploading PDF Files

The process is similar to uploading images, with these considerations:

  1. File Size Limits:

    • PDFs on Commons are limited to 100MB
    • Large Buddhist texts may need splitting
  2. Required Metadata:

    • For Buddhist manuscripts, include:
      • Language
      • Script (e.g., Devanagari, Thai script)
      • Time period
      • Origin
  3. OCR Text (when applicable):

    • If the PDF contains machine-readable text, mention this in the description
    • Helps with searchability of Buddhist texts

Conclusion

OpenRefine is an invaluable tool for processing and uploading Buddhist data to Wikimedia projects. By mastering data cleaning, reconciliation, schema creation, and the upload process, you can contribute significantly to making Buddhist knowledge more accessible and interconnected within the Wikimedia ecosystem.