Glyph Baseline Marking Model

tenkal · September 12, 2024, 3:22am

Dataset Preparation for Baseline Marking Model

The goal is to prepare a dataset for training, validation, and testing by combining cleaned and condition images along the channel dimension. This dataset will train the baseline marking model, which predicts the text baselines in the glyph images.

Process Overview:

Load Images:

The cleaned images come from a directory of pre-processed glyph images.
The condition images come from a directory where text conditions (glyphs) are rendered using specific fonts.
Baseline images represent the target output, where the baselines are marked.

Combine Channels:

The cleaned image and condition image are combined by stacking them along the channel dimension to create the input for the model.

Split Dataset:

After combining the images, the dataset is split into three sets:
- Training Set (X_train, y_train): Used to train the model.
- Validation Set (X_val, y_val): Used to tune hyperparameters and prevent overfitting.
- Test Set (X_test, y_test): Used to evaluate the model’s performance on unseen data.

Arguments:

cleaned_images_dir (str): Directory containing cleaned glyph images.
baseline_images_dir (str): Directory containing baseline images (labels for the model).
condition_dir (str): Directory containing condition images (rendered glyphs).

Returns:

X_train, X_val, X_test: Numpy arrays containing the training, validation, and testing input features.
y_train, y_val, y_test: Numpy arrays containing the target labels (baselines) for training, validation, and testing.

U-Net Model for Baseline Marking

The U-Net model is a type of convolutional neural network (CNN) that is well-suited for image segmentation tasks. Here, we construct a U-Net model that will take the combined input (cleaned images + condition images) and predict the baselines in the glyph images.

Architecture Overview:

Input Size:

The input tensor shape is (256, 256, 6), which means the model takes images of size 256x256 with 6 channels (cleaned + condition image combined).

Encoder (Contracting Path):

4 Convolutional Blocks: Each block contains 2 convolutional layers followed by a max-pooling layer.
The number of filters doubles in each block (starting from 64):
- Block 1: 64 filters
- Block 2: 128 filters
- Block 3: 256 filters
- Block 4: 512 filters

Bottleneck (Bridge):

A middle layer connects the encoder and decoder.
2 Convolutional Layers with 1024 filters.

Decoder (Expanding Path):

4 Upsampling Blocks: Each block contains:
- 1 upsampling layer (to increase spatial resolution).
- 1 convolutional layer.
- Concatenation with the corresponding encoder block.
The number of filters halves at each block:
- Block 1: 512 filters
- Block 2: 256 filters
- Block 3: 128 filters
- Block 4: 64 filters

Output Layer:

The final output of the model consists of:
- 1 convolutional layer with 2 filters and ReLU activation, responsible for generating intermediate feature maps.
- 1 convolutional layer with 3 filters and sigmoid activation, producing the final segmentation mask (the predicted baseline in this case).

Compilation:

Optimizer: The Adam optimizer is used for efficient gradient descent.
Loss Function: A custom combined loss function (which combines Binary Cross-Entropy and Mean Squared Error) is used.
Metric: Accuracy is used as the evaluation metric.

Topic		Replies	Views
Glyph Cropping by Manual Annotation and Font Creation ✨ Pecha AI Studio WG docs	0	20	September 3, 2024
Glyph cropping model documentation ✨ Pecha AI Studio WG docs	0	25	September 3, 2024
PRD - OCR Training & Evaluation Platform 🚀 WG	0	11	June 20, 2025
Toward a Cleaner Translation Dataset 💃🏼 Topic Modeling SIG data-cleaning , translate	0	71	November 3, 2024
The Current State of Tibetan OCR ( BDRC and Monlam AI ) 👁️‍🗨️ OCR SIG docs , ocr , dataset	3	296	May 19, 2025