Setting Up and Using OpenRefine for Buddhist Data Population in Wikimedia Projects
OpenRefine is a powerful tool for cleaning, transforming, and uploading data to Wikimedia sister projects like Wikidata and Wikimedia Commons. This blog post will guide you through setting up OpenRefine and using it effectively for Buddhist data population tasks.
1. Introduction to OpenRefine
OpenRefine (formerly Google Refine) is an open-source tool for working with messy data. It’s particularly valuable for:
- Data cleaning and transformation
- Linking datasets to knowledge bases like Wikidata
- Enriching data from external sources
- Creating and uploading structured data to Wikimedia projects
2. Setting Up OpenRefine
Installation Options
- Desktop Installation:
- Visit the OpenRefine download page
- Choose the appropriate version for your operating system (Windows, Mac, or Linux)
- Download and follow installation instructions
3. Main Features of OpenRefine for Data Cleaning
Faceting
Faceting allows you to quickly analyze your data and find patterns:
-
Text Faceting: Groups similar text values
- Great for finding variants of Buddhist terms or names
- Helps identify inconsistencies in transliterations
-
Numeric Faceting: Filters data by numeric ranges
- Useful for dates of Buddhist artifacts or historical events
-
Custom Faceting: Create expressions to facet on computed values
- Can combine multiple criteria specific to Buddhist datasets
Clustering
Clustering helps identify and merge similar values that might represent the same entity:
-
Key Collision Methods:
- Fingerprint: Normalizes capitalization, punctuation, and whitespace
- N-Gram Fingerprint: Helps with minor spelling variations in Buddhist terms
-
Nearest Neighbor Methods:
- Levenshtein Distance: Identifies strings with minor differences
- PPM (Prediction by Partial Matching): Good for longer texts
For Buddhist data, clustering is invaluable for normalizing:
- Different spellings of the same Buddhist term
- Variant transliterations of Sanskrit or Pali terms
- Different naming conventions for historical Buddhist figures
4. Reconciliation with Wikidata
Reconciliation is the process of matching your data to existing entries in Wikidata.
Reconciliation Process
-
Set Up Reconciliation Service:
- In your column’s dropdown menu, select “Reconcile” → “Start reconciling”
-
Restrict Types (optional):
- For Buddhist content, you might restrict to specific classes like:
- Religious text (Q179461)
- Buddhist temple (Q4414081)
- Buddhist concept (Q25341675)
- Buddhist term (Q86691240)
- For Buddhist content, you might restrict to specific classes like:
-
Use Property Matching:
- Enhance reconciliation by matching additional columns to Wikidata properties
- Example: Use columns like “location,” “time period,” or “school of Buddhism”
-
Review and Select Matches:
- For each row, OpenRefine will suggest potential Wikidata matches
- Review and select the correct match, or mark as “Create new item” if none exists
Creating New Items
When reconciling Buddhist data not yet in Wikidata:
- Select “Create new item” during reconciliation
- This flags the item for creation when you upload to Wikidata
5. Creating Schemas for Wikidata Upload
Schemas tell OpenRefine how to translate your data into Wikidata statements.
Basic Schema Creation
-
Access Schema Editor:
- Click the “Wikidata” button → “Edit Wikidata schema”
-
Define Subject Items:
- Add an item by clicking “+ add item”
- Drag your reconciled column to the subject field
-
Add Statements:
- Click “+ add statement” under each item
- Select the appropriate property (P number)
- Drag the relevant column as the value
- Add qualifiers or references as needed
Example Schema for Buddhist Texts
For a dataset of Buddhist texts, your schema might include:
- Item: Reconciled “Text Title” column
- Statements:
- instance of (P31) → Buddhist text (Q179461)
- language of work (P407) → appropriate language
- author (P50) → author column (possibly reconciled)
- inception (P571) → date column
- part of (P361) → appropriate collection or canon
Adding References
Always add references to your statements:
- Click “+ add reference” under a statement
- Add appropriate reference properties:
- stated in (P248) → source publication
- retrieved (P813) → date of data collection
- reference URL (P854) → source URL
6. Upload Methods
Direct Upload to Wikidata
-
Preview Edits:
- Ensure schema is complete and error-free
- Click “Preview” to see what will be uploaded
-
Fix Issues:
- Address any format issues or missing references
- OpenRefine highlights issues that need correction
-
Upload:
- Click “Upload edits to Wikibase…”
- Log in with your Wikimedia account
- Provide an edit summary
- Click “Upload edits”
Export to QuickStatements
QuickStatements is an alternative tool for batch editing Wikidata.
-
Export from OpenRefine:
- Click “Export” → “QuickStatements v1” format
- Copy the generated commands or save to a file
-
Use QuickStatements:
- Go to QuickStatements
- Paste your commands
- Review and run the batch
-
Autoconfirmed Privileges:
- For batch uploading, ensure your account is “autoconfirmed”
- Requirements: Account is more than 4 days old and has at least 50 edits
7. OpenRefine for Wikimedia Commons
Setting Up for Commons Uploads
-
Install Commons Extension (optional but recommended):
- Download the Commons extension for OpenRefine
- Place in your OpenRefine extensions folder
-
Prepare Your Dataset:
- Create columns for all required metadata:
- File path or URL
- Desired filename on Commons
- Description (wikitext)
- Structured data statements
- Source and license information
- Create columns for all required metadata:
Uploading Images (JPEG)
-
Create File Path Column:
- Either local paths or URLs to images
- For URLs, ensure the domain is allowed on Commons
-
Create Filename Column:
- Follow Commons naming conventions
- Descriptive filenames for Buddhist content, e.g., “Borobudur_relief_panel_depicting_Jataka_tale.jpg”
-
Create Wikitext Column:
- Include categories, e.g., “[[Category:Buddhist temples in Indonesia]]”
- Include license templates, e.g., “{{CC-BY-SA-4.0}}”
- Include source information
-
Create Schema:
- Click “Schema” tab or “Edit Wikibase schema”
- Click “+ add media”
- Drag appropriate columns to each field
-
Upload:
- Select “Upload edits to Wikibase…”
- Log in and provide an edit summary
- Start with a small test batch
Uploading PDF Files
The process is similar to uploading images, with these considerations:
-
File Size Limits:
- PDFs on Commons are limited to 100MB
- Large Buddhist texts may need splitting
-
Required Metadata:
- For Buddhist manuscripts, include:
- Language
- Script (e.g., Devanagari, Thai script)
- Time period
- Origin
- For Buddhist manuscripts, include:
-
OCR Text (when applicable):
- If the PDF contains machine-readable text, mention this in the description
- Helps with searchability of Buddhist texts
Conclusion
OpenRefine is an invaluable tool for processing and uploading Buddhist data to Wikimedia projects. By mastering data cleaning, reconciliation, schema creation, and the upload process, you can contribute significantly to making Buddhist knowledge more accessible and interconnected within the Wikimedia ecosystem.