Using Topic Modeling to Cluster User Translation (BERTopic)

Using Topic Modeling to Cluster User Translation (BERTopic)

Introduction

We live in a world overflowing with unstructured text—customer emails, online reviews, tweets, academic papers, and so much more. Believe it or not, unstructured text makes up around 90% of the world’s data. Hidden within this vast pool are countless insights waiting to be discovered. But here’s the catch: analyzing and organizing this data at scale is a massive challenge. That’s where topic modeling steps in.


Why Topic Modeling Matters

Topic modeling is like having a map for a jungle of text. It’s a technique that helps identify recurring themes within documents, making sense of what otherwise seems chaotic. With topic modeling, we can:

  • Manage and sort enormous volumes of text quickly.
  • Spot patterns and trends that are not immediately obvious.
  • Summarize lengthy reports or entire collections of text.
  • Enhance search tools and recommendation systems.
  • Better understand what customers think and feel.

Traditionally, methods like Latent Dirichlet Allocation (LDA) have been used to perform these tasks. While they’ve served well, today’s complex data demands more refined tools. Enter BERTopic, the modern solution for advanced topic modeling.


Why Choose BERTopic?

BERTopic combines advanced technology with ease of use, making it a game-changer. Unlike older techniques, it utilizes transformer models and c-TF-IDF to group similar topics into dense clusters. Here’s what makes BERTopic stand out:

  1. Deeper Understanding: It uses BERT embeddings to grasp context and subtle nuances, capturing what older models might miss.

  2. Minimal Effort: No need to preprocess the text—BERTopic works well with raw input, saving time.

  3. Clear Results: Topics are presented in an easy-to-understand way, with important words highlighted.

  4. Adaptable: You can tweak it to suit specific industries, languages, or needs.

  5. Smart Clustering: Techniques like UMAP and HDBSCAN ensure topics are meaningfully grouped.

  6. Dynamic Tracking: BERTopic lets you see how topics evolve over time, making it perfect for analyzing trends.


Our Case Study: Understanding User Translation Preferences

We’ve embraced BERTopic to tackle a unique challenge: understanding what content our users frequently translate. This goes beyond simply counting words. It’s about uncovering the context and themes behind translations to better serve a diverse audience.

Through BERTopic, we aim to:

  • Understand user translation preferences in depth.
  • Focus on and improve translations for the most relevant topics.
  • Ensure translations maintain correct context and meaning.
  • Continuously enhance our translation model based on real user data.
  • Identify emerging trends and shifts in user translation needs.
  • Optimize resource allocation for translation improvements.

Preparing the Data for BERTopic

Clean data is the backbone of any good analysis. Here’s how we prepared our dataset:

  1. Removing unnecessary characters while keeping important punctuation.
  2. Normalizing spaces and filtering irrelevant documents.
  3. Remove non English words includes Tibetan, Chinese and etc.

This thorough process helped retain 88.18% of our original 339,000 documents, leaving us with high-quality data.

Data Deduplication and Similarity Analysis:

  1. We implement cosine similarity to identify near-duplicate entries efficiently.
  2. Threshold Setting: A conservative similarity threshold is set, initially at 0.9, to balance uniqueness and redundancy.
  3. Documents that are very similar are grouped together within a specific time frame to manage the translations. From each group, only the longest text is kept, and the rest are removed.

After deduplication, we ended up with 211k (63%) unique entries—perfect for BERTopic analysis.


BERTopic Model Configuration:

Our initial BERTopic model configuration is carefully designed to capture the nuances of translation preferences:

  1. Embedding Model Selection: We utilize SentenceTransformer (‘all-MiniLM-L6-v2’) for generating high-quality sentence embeddings that capture semantic meaning.

    # Load a sentence transformer model
    sentence_model = SentenceTransformer('all-MiniLM-L6-v2')
    
  2. Custom Vectorization: A CountVectorizer with custom settings, including bigrams, is implemented to capture domain-specific language patterns.

    # Create a custom vectorizer with trigrams and custom stop words
    vectorizer = CountVectorizer(ngram_range=(1,2), stop_words=custom_stop_words, max_features=5000)
    
  3. Topic Number Optimization: We configure the model for 20 initial topics, striking a balance between granularity and interpretability.

  4. Minimum Topic Size: A minimum topic size of 5 is set to ensure relevance and meaningful clustering.


Training Process and Topic Generation

The BERTopic model training process involves several key steps:

  1. Model Fitting: We fit the model to our preprocessed documents, allowing it to learn the underlying topic structure.

    # Create and fit the BERTopic model
    topic_model = BERTopic(
        embedding_model=sentence_model,
        vectorizer_model=vectorizer,
        nr_topics=20,
        min_topic_size=5,
        verbose=True,
    )
    
  2. Visualize Documents: We can take a more detailed approach by visualizing the documents within the topics to check if they were correctly assigned and if they logically fit the topics. This method is particularly useful because we can directly read the documents clustered under a specific topic simply by hovering over the dots, where each dot represents a document.

    To achieve this, we can use the topic_model.visualize_documents() function. This function recalculates the document embeddings and reduces them to 2-dimensional space, making it easier to visualize.

    visualize_documents

  3. Topic Extraction: The model generates topics and their associated probabilities, providing insights into the main themes in our translation data. We can create bar charts to show the selected terms for different topics using c-TF-IDF scores for each topic. These charts help us understand and compare the scores within and between topics. You can also compare different topics with each other.

    Topic Extraction

    Each topic (Topic -1 to Topic 18) represents a cluster of related words, where the words are ranked by their probability of belonging to that topic. The first number in each bracket represents the word, and the second number represents its probability of being associated with that specific topic.

  4. Topic Analysis: We examine the generated topics, including top words and representative documents, to understand the emerging themes.

    Topic Analysis

    Insights:

    • Topic -1 and 0 both focuses on Tibetan Culture, Life and Society (keywords: tibetan, tibet, life, lama, love, dalai, chinese, and more )
    • Topic 2 seems focused on geopolitics (keywords: china, chinese, president, us, taiwan, ukraine)
    • Topic 5 is clearly about health (keywords: health, cancer, hospital, blood, medical)
    • Topic 11 is about sports, specifically football (keywords: madrid, football, team, real madrid, neymar, messi)
    • Topic 10 explores biology and neuroscience (keywords: brain, cells, neurons, DNA)
  5. Distribution Analysis: The sizes and distributions of topics are analysed to ensure a balanced representation of translation preferences.

    • We have used topic analysis to name each topic and count how often it appears in the document.

    Insights:

    • Tibetan Culture and Society (45.08%) and Tibetan Life and Relationships (29.96%). Combined, these topics represent 75.04% of all content
    • Top 5 topics account for 91.46% of the content.
    • Strong emphasis on cultural and social themes
    • Dataset likely focused on Tibetan-related research or content

Visualization

Here are some visualizations to see how the model has clustered the data and how it represents each topic:

  1. Intertopic Distance Map:

    We chose this type of visualization, inspired by [LDAvis], because it enables interactive exploration of our topic models. By embedding our class-based TF-IDF topic representations into a 2D space using UMAP, it provides a clear, visual context for understanding the relationships between topics. The use of Plotly for visualization allows us to create an interactive display where each circle represents a topic. The size of each circle indicates the frequency of the topic across all documents, making it easy to see which topics are most prevalent. This interactive element helps in comprehending the scope and distribution of topics throughout our dataset, offering a more intuitive way to engage with the model’s output.

    Intertopic Distance Map

  2. Hierarchical Clusters:

    The topics we generate can be organized into a hierarchy. To explore the possible hierarchical structure of these topics, we can use topic_model.visualize_hierarchy(). This function allows us to create clusters of topics and visualize their relationships. This is particularly useful when deciding how many topics we should keep by setting an appropriate number of topics (nr_topics). This approach also becomes valuable when dealing with numerous topics, as it helps us understand how these topics are interrelated. By visualizing the connections between topics, we can uncover deeper insights and see how different themes are grouped together within our dataset.

Conclusion:

BERTopic has proven to be a powerful tool for understanding user translation preferences. By analyzing a large dataset of 211,000 unique entries, we’ve gained valuable insights into the topics our users frequently translate. This approach has allowed us to identify key themes such as Tibetan, geopolitics, health, sports, and biology, among others. Our dataset strongly emphasis on cultural and social themes.

The process involved careful data preparation, including cleaning and deduplication, followed by a standard model configuration using SentenceTransformer and custom vectorization. The resulting topic model provided clear, interpretable results that can guide our translation efforts and resource allocation.

Challenges

One of the biggest challenges we face is the dataset. It’s a user input translation dataset with many grammar and spelling errors, which makes clustering difficult—even when using one of the best clustering techniques available, like the BERTopic model.

Next Steps

To further refine our understanding and improve our topic modeling approach, we will:

  1. Benchmark Multiple Models: We’ll compare different models to determine which produces the best results. This will involve:

    • Creating a custom test dataset
    • Evaluating models based on diversity and distribution of topics
    • Conducting human evaluation for qualitative assessment
  2. Hyperparameter Tuning: We’ll optimize our models through extensive hyperparameter tuning to improve performance.

  3. Advanced Topic Modeling: We’ll expand and refine our approach by:

    • Breaking down topics further and refining the process
    • Developing multiple models for different topic subsets
    • Comparing combined models against a single expanded model
  4. Classification Model Development: We’ll transition from topic modeling to classification by:

    • Preparing data for classification tasks
    • Selecting, training, and evaluating classification models
    • Developing deployment and maintenance strategies

By implementing these steps, we aim to create a more robust and accurate system for understanding and predicting user translation needs, ultimately enhancing our translation services.

Citations

Official Documentation and Resources

  1. BERTopic documentation: Official documentation providing comprehensive information on using the BERTopic library, including installation, basic usage, and advanced features.

  2. BERTopic GitHub repository: The main source code repository for BERTopic, containing the latest updates, issue tracking, and contribution guidelines.

Introductory Articles and Overviews

  1. Topic Modeling with BERT: An introduction to using BERT for topic modeling, explaining the advantages of this approach over traditional methods.

  2. BERTopic: Topic Modeling as You Have Never Seen It Before: An overview of BERTopic’s unique features and capabilities compared to traditional topic modeling approaches.

Advanced Techniques and Applications

  1. Advanced Topic Modeling with BERTopic: A detailed guide on using BERTopic for advanced topic modeling tasks, covering various techniques and best practices.

  2. Topics per Class Using BERTopic: A tutorial on how to analyze topics across different classes or categories using BERTopic, useful for comparative analysis.

  3. Dynamic Topic Modeling with BERTopic: Exploration of using BERTopic for dynamic topic modeling, allowing analysis of topic evolution over time.

Practical Guides and Tutorials

  1. Interactive Topic Modeling with BERTopic: An article exploring how to use BERTopic for interactive topic modeling, demonstrating its flexibility and user-friendly features.

  2. Topic Modeling with BERTopic Cookbook: A practical guide to using BERTopic, providing step-by-step instructions and examples for various use cases.

  3. Tips and Tricks for BERTopic: Collection of useful tips and best practices for optimizing BERTopic usage and improving results.

Visualization Techniques

  1. Visualize Terms in BERTopic: Guide on visualizing terms and their relationships within topics using BERTopic’s built-in visualization tools.

  2. Visualize Hierarchy in BERTopic: Instructions on creating and visualizing hierarchical topic structures using BERTopic.

  3. Visualize Documents in BERTopic: Guide on visualizing document-topic relationships and clustering using BERTopic’s document visualization features.

Integration and Deployment

  1. Introducing BERTopic Integration with Hugging Face Hub: Announcement of BERTopic’s integration with the Hugging Face Hub, enabling easier model sharing and deployment.
2 Likes