Skip to content

bead.evaluation

Metrics and evaluation utilities for convergence detection and inter-annotator agreement.

Convergence Detection

convergence

Convergence detection for active learning.

This module provides tools for detecting when a model has converged to human-level performance, which serves as a stopping criterion for active learning loops.

ConvergenceReport

Bases: TypedDict

Convergence report structure.

Attributes:

Name Type Description
converged bool

Whether model has converged.

model_accuracy float

Model's current accuracy.

human_agreement float

Human agreement score.

gap float

Difference between human agreement and model accuracy.

required_accuracy float

Minimum accuracy required for convergence.

threshold float

Convergence threshold.

iteration int

Current iteration number.

meets_min_iterations bool

Whether minimum iterations requirement is met.

min_iterations_required int

Minimum iterations required before checking convergence.

ConvergenceDetector

Detect convergence of model performance to human agreement.

This class monitors model performance and compares it to human inter-annotator agreement to determine when active learning can stop. Convergence is achieved when the model's accuracy matches or exceeds human agreement within a specified threshold.

Parameters:

Name Type Description Default
human_agreement_metric str

Which inter-annotator agreement metric to use as baseline: - "krippendorff_alpha": Most general (handles missing data, multiple raters) - "fleiss_kappa": Multiple raters, no missing data - "cohens_kappa": Two raters only - "percentage_agreement": Simple agreement rate

"krippendorff_alpha"
convergence_threshold float

Model must be within this threshold of human agreement to converge. For example, 0.05 means model accuracy must be >= (human_agreement - 0.05).

0.05
min_iterations int

Minimum number of iterations before checking convergence. Prevents premature stopping.

3
statistical_test bool

Whether to run statistical significance test comparing model to humans.

True
alpha float

Significance level for statistical tests.

0.05

Attributes:

Name Type Description
human_agreement_metric str

Agreement metric being used.

convergence_threshold float

Threshold for convergence.

min_iterations int

Minimum iterations required.

statistical_test bool

Whether to run significance tests.

alpha float

Significance level.

human_baseline float | None

Computed human agreement baseline (set via compute_human_baseline).

Examples:

>>> detector = ConvergenceDetector(
...     human_agreement_metric='krippendorff_alpha',
...     convergence_threshold=0.05,
...     min_iterations=3
... )
>>> # Compute human baseline from ratings
>>> ratings = {
...     'human1': [1, 1, 0, 1, 0],
...     'human2': [1, 1, 0, 0, 0],
...     'human3': [1, 0, 0, 1, 0]
... }
>>> detector.compute_human_baseline(ratings)
>>> detector.human_baseline > 0.0
True
>>> # Check if model converged
>>> converged = detector.check_convergence(
...     model_accuracy=0.75,
...     iteration=5
... )
>>> isinstance(converged, bool)
True

__init__(human_agreement_metric: str = 'krippendorff_alpha', convergence_threshold: float = 0.05, min_iterations: int = 3, statistical_test: bool = True, alpha: float = 0.05) -> None

Initialize convergence detector.

Parameters:

Name Type Description Default
human_agreement_metric str

Inter-annotator agreement metric to use.

'krippendorff_alpha'
convergence_threshold float

Threshold for convergence (model must be within this of human).

0.05
min_iterations int

Minimum iterations before checking convergence.

3
statistical_test bool

Whether to run statistical tests.

True
alpha float

Significance level for tests.

0.05

Raises:

Type Description
ValueError

If parameters are invalid.

compute_human_baseline(human_ratings: dict[str, list[Label | None]], **kwargs: str | int | float | bool | None) -> float

Compute human inter-rater agreement baseline.

Parameters:

Name Type Description Default
human_ratings dict[str, list[Label | None]]

Dictionary mapping human rater IDs to their ratings. For example: {'rater1': [1, 0, 1, ...], 'rater2': [1, 1, 1, ...]}. Missing ratings can be represented as None.

required
**kwargs str | int | float | bool | None

Additional arguments passed to agreement metric function. For example, metric='nominal' for Krippendorff's alpha.

{}

Returns:

Type Description
float

Human agreement score.

Raises:

Type Description
ValueError

If human_ratings is empty or has fewer than 2 raters.

Examples:

>>> detector = ConvergenceDetector()
>>> ratings = {
...     'human1': [1, 1, 0, 1],
...     'human2': [1, 1, 0, 0],
...     'human3': [1, 0, 0, 1]
... }
>>> baseline = detector.compute_human_baseline(ratings)
>>> 0.0 <= baseline <= 1.0
True

check_convergence(model_accuracy: float, iteration: int, human_agreement: float | None = None) -> bool

Check if model has converged to human performance.

Parameters:

Name Type Description Default
model_accuracy float

Model's accuracy on the task.

required
iteration int

Current iteration number (1-indexed).

required
human_agreement float | None

Human agreement score. If None, uses self.human_baseline (which must have been set via compute_human_baseline).

None

Returns:

Type Description
bool

True if model has converged, False otherwise.

Raises:

Type Description
ValueError

If human_agreement is None and human_baseline not set.

Examples:

>>> detector = ConvergenceDetector(min_iterations=2, convergence_threshold=0.05)
>>> detector.human_baseline = 0.80
>>> # Too early (iteration 1 < min_iterations 2)
>>> detector.check_convergence(0.79, iteration=1)
False
>>> # Still not converged (0.74 < 0.80 - 0.05)
>>> detector.check_convergence(0.74, iteration=3)
False
>>> # Converged (0.77 >= 0.80 - 0.05)
>>> detector.check_convergence(0.77, iteration=3)
True

compute_statistical_test(model_predictions: list[Label], human_consensus: list[Label], test_type: str = 'mcnemar') -> dict[str, float]

Run statistical test comparing model to human performance.

Parameters:

Name Type Description Default
model_predictions list[Label]

Model's predictions.

required
human_consensus list[Label]

Human consensus labels (e.g., majority vote).

required
test_type str

Type of statistical test: - "mcnemar": McNemar's test for paired nominal data - "ttest": Paired t-test (requires multiple samples)

"mcnemar"

Returns:

Type Description
dict[str, float]

Dictionary with keys 'statistic' and 'p_value'.

Raises:

Type Description
ValueError

If predictions and consensus have different lengths.

Examples:

>>> detector = ConvergenceDetector()
>>> model_preds = [1, 1, 0, 1, 0]
>>> human_consensus = [1, 1, 0, 0, 0]
>>> result = detector.compute_statistical_test(model_preds, human_consensus)
>>> 'statistic' in result and 'p_value' in result
True

get_convergence_report(model_accuracy: float, iteration: int, human_agreement: float | None = None) -> ConvergenceReport

Generate convergence report with status and metrics.

Parameters:

Name Type Description Default
model_accuracy float

Model's current accuracy.

required
iteration int

Current iteration number.

required
human_agreement float | None

Human agreement score (uses baseline if None).

None

Returns:

Type Description
ConvergenceReport

Report with convergence status and metrics.

Examples:

>>> detector = ConvergenceDetector(convergence_threshold=0.05)
>>> detector.human_baseline = 0.80
>>> report = detector.get_convergence_report(0.77, iteration=5)
>>> report['converged']
True
>>> report['gap']
0.03

Inter-Annotator Agreement

interannotator

Inter-annotator agreement metrics.

This module provides inter-annotator agreement metrics for assessing reliability and consistency across multiple human annotators. Uses sklearn.metrics for Cohen's kappa, statsmodels for Fleiss' kappa, and krippendorff package for Krippendorff's alpha.

InterAnnotatorMetrics

Inter-annotator agreement metrics for reliability assessment.

Provides static methods for computing various agreement metrics: - Percentage agreement (simple) - Cohen's kappa (2 raters, categorical) - Fleiss' kappa (multiple raters, categorical) - Krippendorff's alpha (general, multiple data types) - Pairwise agreement (all pairs of raters)

Examples:

>>> # Cohen's kappa for 2 raters
>>> rater1 = [0, 1, 0, 1, 1]
>>> rater2 = [0, 1, 1, 1, 1]
>>> InterAnnotatorMetrics.cohens_kappa(rater1, rater2)
0.6
>>> # Percentage agreement
>>> InterAnnotatorMetrics.percentage_agreement(rater1, rater2)
0.8

percentage_agreement(rater1: list[Label], rater2: list[Label]) -> float staticmethod

Compute simple percentage agreement between two raters.

Parameters:

Name Type Description Default
rater1 list[Label]

Ratings from first rater.

required
rater2 list[Label]

Ratings from second rater.

required

Returns:

Type Description
float

Percentage agreement (0.0 to 1.0).

Raises:

Type Description
ValueError

If rater lists have different lengths.

Examples:

>>> rater1 = [1, 2, 3, 1, 2]
>>> rater2 = [1, 2, 2, 1, 2]
>>> InterAnnotatorMetrics.percentage_agreement(rater1, rater2)
0.8

cohens_kappa(rater1: list[Label], rater2: list[Label]) -> float staticmethod

Compute Cohen's kappa for two raters.

Cohen's kappa measures agreement between two raters beyond chance. Values range from -1 (complete disagreement) to 1 (perfect agreement), with 0 indicating chance-level agreement.

Parameters:

Name Type Description Default
rater1 list[Label]

Ratings from first rater.

required
rater2 list[Label]

Ratings from second rater.

required

Returns:

Type Description
float

Cohen's kappa coefficient.

Raises:

Type Description
ValueError

If rater lists have different lengths or are empty.

Examples:

>>> # Perfect agreement
>>> rater1 = [0, 1, 0, 1]
>>> rater2 = [0, 1, 0, 1]
>>> InterAnnotatorMetrics.cohens_kappa(rater1, rater2)
1.0
>>> # No agreement beyond chance
>>> rater1 = [0, 0, 1, 1]
>>> rater2 = [1, 1, 0, 0]
>>> kappa = InterAnnotatorMetrics.cohens_kappa(rater1, rater2)
>>> abs(kappa - (-1.0)) < 0.01
True

fleiss_kappa(ratings_matrix: np.ndarray[int, np.dtype[np.int_]]) -> float staticmethod

Compute Fleiss' kappa for multiple raters.

Fleiss' kappa generalizes Cohen's kappa to multiple raters. It measures agreement beyond chance when multiple raters assign categorical ratings to a set of items.

Parameters:

Name Type Description Default
ratings_matrix ndarray

Matrix of shape (n_items, n_categories) where element [i, j] contains the number of raters who assigned item i to category j.

required

Returns:

Type Description
float

Fleiss' kappa coefficient.

Raises:

Type Description
ValueError

If matrix is empty or has wrong shape.

ImportError

If statsmodels is not installed.

Examples:

>>> # 4 items, 3 categories, 5 raters each
>>> # Item 1: 3 raters chose cat 0, 2 chose cat 1, 0 chose cat 2
>>> ratings = np.array([
...     [3, 2, 0],  # Item 1
...     [0, 0, 5],  # Item 2
...     [2, 3, 0],  # Item 3
...     [1, 1, 3],  # Item 4
... ])
>>> kappa = InterAnnotatorMetrics.fleiss_kappa(ratings)
>>> 0.0 <= kappa <= 1.0
True

krippendorff_alpha(reliability_data: dict[str, list[Label | None]], metric: str = 'nominal') -> float staticmethod

Compute Krippendorff's alpha for multiple raters.

Krippendorff's alpha is the most general inter-rater reliability measure. It handles: - Any number of raters - Missing data - Different data types (nominal, ordinal, interval, ratio)

Parameters:

Name Type Description Default
reliability_data dict[str, list[Label | None]]

Dictionary mapping rater IDs to their ratings. Each rater's ratings list must have same length (use None for missing values).

required
metric str

Distance metric to use: - "nominal": for categorical data (default) - "ordinal": for ordered categories - "interval": for interval-scaled data - "ratio": for ratio-scaled data

"nominal"

Returns:

Type Description
float

Krippendorff's alpha coefficient (1.0 = perfect agreement, 0.0 = chance agreement, < 0.0 = systematic disagreement).

Raises:

Type Description
ValueError

If reliability_data is empty or rater lists have different lengths.

Examples:

>>> # 3 raters, 5 items (with one missing value)
>>> data = {
...     'rater1': [1, 2, 3, 4, 5],
...     'rater2': [1, 2, 3, 4, 5],
...     'rater3': [1, 2, None, 4, 5]
... }
>>> alpha = InterAnnotatorMetrics.krippendorff_alpha(data)
>>> alpha > 0.8  # High agreement
True

pairwise_agreement(ratings: dict[str, list[Label]]) -> dict[str, dict[str, float]] staticmethod

Compute pairwise agreement metrics for all rater pairs.

Parameters:

Name Type Description Default
ratings dict[str, list[Label]]

Dictionary mapping rater IDs to their ratings.

required

Returns:

Type Description
dict[str, dict[str, float]]

Nested dictionary with structure: { 'percentage_agreement': {('rater1', 'rater2'): 0.85, ...}, 'cohens_kappa': {('rater1', 'rater2'): 0.75, ...} }

Examples:

>>> ratings = {
...     'rater1': [1, 2, 3],
...     'rater2': [1, 2, 3],
...     'rater3': [1, 2, 2]
... }
>>> result = InterAnnotatorMetrics.pairwise_agreement(ratings)
>>> result['percentage_agreement'][('rater1', 'rater2')]
1.0
>>> result['cohens_kappa'][('rater1', 'rater2')]
1.0