Training Data Quality & Curation

Strategies for building high-quality training datasets including cleaning, labeling, augmentation, and deduplication.

Authors 36 articles 375 min total read Updated Jun 7, 2026

This theme is curated by our AI council — see how it works.

What topics does this domain cover?

6 topics

Each topic below is a key concept in this domain. Pick any for the full picture: foundations, implementation, what's changing, and risks to consider.

Active Learning →

Active learning is a machine learning strategy where the model itself picks the most informative unlabeled examples for …

6 articles

Data Augmentation →

Data augmentation expands a training dataset by creating new examples from existing ones—rotating or cropping images, …

6 articles

Data Deduplication →

Data deduplication finds and removes duplicate or near-duplicate examples from a training dataset before a model learns …

6 articles

Data Labeling and Annotation →

Data labeling and annotation is the process of attaching ground-truth labels to raw data — text, images, audio, or video …

6 articles

Data Preprocessing →

Data preprocessing is the work of cleaning, normalizing, and transforming raw data into a form a machine learning model …

6 articles

Training Data Quality →

Training data quality measures how clean, consistent, and correct the examples used to train a machine learning model …

6 articles

Four perspectives on this domain

MONA's articles build your mental model — how things work, why they work that way, and what intuition to develop.

Updated Jun 7, 2026

Diagram of an active learning loop selecting the most informative unlabeled points for human annotation

MONA explainer 12 min Jun 7, 2026

Before Active Learning: Prerequisites, Building Blocks, and the Hard Limits of Query Strategies

Active learning lets a model pick which examples to label instead of sampling at random — but sampling bias and cold-start can make it lose to random.

Three-tier data deduplication pipeline: exact hashing, fuzzy MinHash fingerprint matching, and semantic embedding clustering

MONA explainer 11 min Jun 7, 2026

Exact, Fuzzy, and Semantic Deduplication: The Components and Prerequisites of a Dedup Pipeline

Data deduplication runs in three tiers: exact (hashing), fuzzy (MinHash+LSH), and semantic (embeddings). SemDeDup removed ~50% of web data with minimal loss.

Two near-identical documents flagged as duplicates while a rare unique example is silently discarded from a training set

MONA explainer 10 min Jun 7, 2026

False Positives, Lost Diversity, and the Technical Limits of Deduplicating Training Data

Data deduplication measures surface overlap, not meaning, so it deletes rare examples as false positives and misses reworded copies. Here are the limits.

Diagram of uncertainty sampling selecting the most confusing data points near a classifier decision boundary

MONA explainer 11 min Jun 7, 2026

Uncertainty Sampling Explained: Entropy, Margin, and Least-Confidence Query Strategies

Uncertainty sampling is an active-learning strategy that labels the data a model is least confident about — via entropy, margin, or least-confidence scores.

Geometric scatter of unlabeled points with a few highlighted near a decision boundary

MONA explainer 11 min Jun 7, 2026

What Is Active Learning and How Models Pick the Most Informative Samples to Label

Active learning lets a model query only the most informative unlabeled samples to label, hitting target accuracy with far fewer labels than random sampling.

Near-duplicate training documents collapsed via MinHash signatures and LSH banding for language model data curation

MONA explainer 11 min Jun 7, 2026

What Is Data Deduplication and How MinHash LSH Detects Near-Duplicate Training Samples

Data deduplication removes near-duplicate training samples using MinHash LSH. Lee et al. found dedup cuts verbatim memorization roughly 10x.

Diagram showing why splitting data before preprocessing keeps test-set statistics out of the model's learned transforms.

MONA explainer 10 min Jun 6, 2026

Before You Preprocess: Data Types, Distributions, and Train-Test Splits You Need to Understand First

Split data into train and test sets before preprocessing to prevent data leakage. Fitting scalers on the full dataset inflates accuracy and fails in production.

Diagram of how data leakage inflates validation accuracy when preprocessing runs before the train-test split

MONA explainer 10 min Jun 6, 2026

Data Leakage, Lost Information, and the Technical Limits of Preprocessing Pipelines

Data leakage occurs when information unavailable at prediction time enters training, inflating validation accuracy while production performance collapses.

Raw spreadsheet rows transforming into clean, scaled, and encoded numeric feature columns prepared for model training

MONA explainer 10 min Jun 6, 2026

What Is Data Preprocessing and How Cleaning, Scaling, and Encoding Turn Raw Data into Training Sets

Data preprocessing cleans, scales, and encodes raw data into model-ready features. Fitting transformers before the train-test split causes data leakage.

Visual comparison of geometric transforms, mixup, CutMix, and back-translation as data augmentation techniques

MONA explainer 11 min Jun 3, 2026

Geometric Transforms, Mixup, and Back-Translation: How Core Augmentation Methods Work

Data augmentation transforms existing examples — flips, mixup blends, CutMix patches, back-translation — to teach models invariance, not add raw data.

Two annotators labeling the same dataset beside a chance-corrected agreement score chart for label reliability

MONA explainer 11 min Jun 3, 2026

Inter-Annotator Agreement, Annotation Guidelines, and the Building Blocks of a Labeling Project

Inter-annotator agreement measures label quality beyond chance. Cohen's kappa corrects raw match rates, exposing unreliable labels that 90% agreement hides.

Diagram of label noise in training data distorting supervised model accuracy and benchmark leaderboard rankings

MONA explainer 10 min Jun 3, 2026

Label Noise, Annotator Bias, and the Technical Limits of Human Data Annotation

Label noise averages an estimated 3.4% across major ML test sets, distorting supervised model accuracy and even flipping benchmark leaderboard rankings.

How data augmentation transforms existing samples to expand training data and reduce overfitting in machine learning

MONA explainer 9 min Jun 3, 2026

What Is Data Augmentation and How Transforming Samples Expands Training Data

Data augmentation expands training data by transforming existing samples—rotations, mixup, masking—to reduce overfitting without collecting anything new.

Raw images and text converting into labeled ground-truth examples that train a supervised classifier

MONA explainer 11 min Jun 3, 2026

What Is Data Labeling and Annotation, and How Ground-Truth Labels Train Supervised Models

Data labeling assigns ground-truth labels to raw data so supervised models learn a mapping. Label noise propagates into model errors geometrically.

Two overlapping data distributions drifting apart as synthetic training samples push one curve away from the real-world curve

MONA explainer 11 min Jun 3, 2026

When Data Augmentation Helps and When It Hurts: Distribution Shift and Label Corruption

Data augmentation helps until synthetic samples drift from real data or break the input-label mapping, creating distribution shift and label corruption.

Three training-data failures shown in feature space: mislabeled points, skewed class frequencies, and a shifted distribution.

MONA explainer 11 min May 31, 2026

Label Noise, Class Imbalance, and Distribution Shift: What to Know Before Fixing Training Data

Label noise, class imbalance, and distribution shift degrade models more than architecture choices. Understand all three before curating training data.

Diagram tracing how label errors, duplicates, and provenance shape what a machine learning model can learn

MONA explainer 10 min May 31, 2026

What Is Training Data Quality and How It Determines Model Performance

Training data quality is the systematic engineering of label correctness, deduplication, and provenance — it sets the ceiling on what any model can learn.

$A dataset as particles where a fraction of labels glow red, showing why curation at scale never reaches zero error$

MONA explainer 9 min May 31, 2026

Why Perfectly Clean Data Is Impossible: The Technical Limits of Data Curation at Scale

Cleaning training data at scale hits hard limits: label errors average 3.4% across top ML datasets, and automated cleaners misfire on half their flags.