bead.evaluation

Metrics and evaluation utilities for convergence detection and inter-annotator agreement.

Convergence Detection

convergence

Convergence detection for active learning.

This module provides tools for detecting when a model has converged to human-level performance, which serves as a stopping criterion for active learning loops.

ConvergenceReport

Bases: TypedDict

Convergence report structure.

Attributes:

Name Type Description
converged bool

Whether model has converged.

model_accuracy float

Model's current accuracy.

human_agreement float

Human agreement score.

gap float

Difference between human agreement and model accuracy.

required_accuracy float

Minimum accuracy required for convergence.

threshold float

Convergence threshold.

iteration int

Current iteration number.

meets_min_iterations bool

Whether minimum iterations requirement is met.

min_iterations_required int

Minimum iterations required before checking convergence.

ConvergenceDetector

Detect convergence of model performance to human agreement.

This class monitors model performance and compares it to human inter-annotator agreement to determine when active learning can stop. Convergence is achieved when the model's accuracy matches or exceeds human agreement within a specified threshold.

Parameters:

Name Type Description Default
human_agreement_metric str

Which inter-annotator agreement metric to use as baseline: - "krippendorff_alpha": Most general (handles missing data, multiple raters) - "fleiss_kappa": Multiple raters, no missing data - "cohens_kappa": Two raters only - "percentage_agreement": Simple agreement rate

"krippendorff_alpha"
convergence_threshold float

Model must be within this threshold of human agreement to converge. For example, 0.05 means model accuracy must be >= (human_agreement - 0.05).

0.05
min_iterations int

Minimum number of iterations before checking convergence. Prevents premature stopping.

3
statistical_test bool

Whether to run statistical significance test comparing model to humans.

True
alpha float

Significance level for statistical tests.

0.05

Attributes:

Name Type Description
human_agreement_metric str

Agreement metric being used.

convergence_threshold float

Threshold for convergence.

min_iterations int

Minimum iterations required.

statistical_test bool

Whether to run significance tests.

alpha float

Significance level.

human_baseline float | None

Computed human agreement baseline (set via compute_human_baseline).

Examples:

>>> detector = ConvergenceDetector(
...     human_agreement_metric='krippendorff_alpha',
...     convergence_threshold=0.05,
...     min_iterations=3
... )
>>> # Compute human baseline from ratings
>>> ratings = {
...     'human1': [1, 1, 0, 1, 0],
...     'human2': [1, 1, 0, 0, 0],
...     'human3': [1, 0, 0, 1, 0]
... }
>>> detector.compute_human_baseline(ratings)
>>> detector.human_baseline > 0.0
True
>>> # Check if model converged
>>> converged = detector.check_convergence(
...     model_accuracy=0.75,
...     iteration=5
... )
>>> isinstance(converged, bool)
True

__init__(human_agreement_metric: str = 'krippendorff_alpha', convergence_threshold: float = 0.05, min_iterations: int = 3, statistical_test: bool = True, alpha: float = 0.05) -> None

Initialize convergence detector.

Parameters:

Name Type Description Default
human_agreement_metric str

Inter-annotator agreement metric to use.

'krippendorff_alpha'
convergence_threshold float

Threshold for convergence (model must be within this of human).

0.05
min_iterations int

Minimum iterations before checking convergence.

3
statistical_test bool

Whether to run statistical tests.

True
alpha float

Significance level for tests.

0.05

Raises:

Type Description
ValueError

If parameters are invalid.

compute_human_baseline(human_ratings: dict[str, list[Label | None]], **kwargs: str | int | float | bool | None) -> float

Compute human inter-rater agreement baseline.

Parameters:

Name Type Description Default
human_ratings dict[str, list[Label | None]]

Dictionary mapping human rater IDs to their ratings. For example: {'rater1': [1, 0, 1, ...], 'rater2': [1, 1, 1, ...]}. Missing ratings can be represented as None.

required
**kwargs str | int | float | bool | None

Additional arguments passed to agreement metric function. For example, metric='nominal' for Krippendorff's alpha.

{}

Returns:

Type Description
float

Human agreement score.

Raises:

Type Description
ValueError

If human_ratings is empty or has fewer than 2 raters.

Examples:

>>> detector = ConvergenceDetector()
>>> ratings = {
...     'human1': [1, 1, 0, 1],
...     'human2': [1, 1, 0, 0],
...     'human3': [1, 0, 0, 1]
... }
>>> baseline = detector.compute_human_baseline(ratings)
>>> 0.0 <= baseline <= 1.0
True

check_convergence(model_accuracy: float, iteration: int, human_agreement: float | None = None) -> bool

Check if model has converged to human performance.

Parameters:

Name Type Description Default
model_accuracy float

Model's accuracy on the task.

required
iteration int

Current iteration number (1-indexed).

required
human_agreement float | None

Human agreement score. If None, uses self.human_baseline (which must have been set via compute_human_baseline).

None

Returns:

Type Description
bool

True if model has converged, False otherwise.

Raises:

Type Description
ValueError

If human_agreement is None and human_baseline not set.

Examples:

>>> detector = ConvergenceDetector(min_iterations=2, convergence_threshold=0.05)
>>> detector.human_baseline = 0.80
>>> # Too early (iteration 1 < min_iterations 2)
>>> detector.check_convergence(0.79, iteration=1)
False
>>> # Still not converged (0.74 < 0.80 - 0.05)
>>> detector.check_convergence(0.74, iteration=3)
False
>>> # Converged (0.77 >= 0.80 - 0.05)
>>> detector.check_convergence(0.77, iteration=3)
True

compute_statistical_test(model_predictions: list[Label], human_consensus: list[Label], test_type: str = 'mcnemar') -> dict[str, float]

Run statistical test comparing model to human performance.

Parameters:

Name Type Description Default
model_predictions list[Label]

Model's predictions.

required
human_consensus list[Label]

Human consensus labels (e.g., majority vote).

required
test_type str

Type of statistical test: - "mcnemar": McNemar's test for paired nominal data - "ttest": Paired t-test (requires multiple samples)

"mcnemar"

Returns:

Type Description
dict[str, float]

Dictionary with keys 'statistic' and 'p_value'.

Raises:

Type Description
ValueError

If predictions and consensus have different lengths.

Examples:

>>> detector = ConvergenceDetector()
>>> model_preds = [1, 1, 0, 1, 0]
>>> human_consensus = [1, 1, 0, 0, 0]
>>> result = detector.compute_statistical_test(model_preds, human_consensus)
>>> 'statistic' in result and 'p_value' in result
True

get_convergence_report(model_accuracy: float, iteration: int, human_agreement: float | None = None) -> ConvergenceReport

Generate convergence report with status and metrics.

Parameters:

Name Type Description Default
model_accuracy float

Model's current accuracy.

required
iteration int

Current iteration number.

required
human_agreement float | None

Human agreement score (uses baseline if None).

None

Returns:

Type Description
ConvergenceReport

Report with convergence status and metrics.

Examples:

>>> detector = ConvergenceDetector(convergence_threshold=0.05)
>>> detector.human_baseline = 0.80
>>> report = detector.get_convergence_report(0.77, iteration=5)
>>> report['converged']
True
>>> report['gap']
0.03

Inter-Annotator Agreement

interannotator

Inter-annotator agreement metrics.

This module provides inter-annotator agreement metrics for assessing reliability and consistency across multiple human annotators. Uses sklearn.metrics for Cohen's kappa, statsmodels for Fleiss' kappa, and krippendorff package for Krippendorff's alpha.

InterAnnotatorMetrics

Inter-annotator agreement metrics for reliability assessment.

Provides static methods for computing various agreement metrics: - Percentage agreement (simple) - Cohen's kappa (2 raters, categorical) - Fleiss' kappa (multiple raters, categorical) - Krippendorff's alpha (general, multiple data types) - Pairwise agreement (all pairs of raters)

Examples:

>>> # Cohen's kappa for 2 raters
>>> rater1 = [0, 1, 0, 1, 1]
>>> rater2 = [0, 1, 1, 1, 1]
>>> InterAnnotatorMetrics.cohens_kappa(rater1, rater2)
0.6
>>> # Percentage agreement
>>> InterAnnotatorMetrics.percentage_agreement(rater1, rater2)
0.8

percentage_agreement(rater1: list[Label], rater2: list[Label]) -> float staticmethod

Compute simple percentage agreement between two raters.

Parameters:

Name Type Description Default
rater1 list[Label]

Ratings from first rater.

required
rater2 list[Label]

Ratings from second rater.

required

Returns:

Type Description
float

Percentage agreement (0.0 to 1.0).

Raises:

Type Description
ValueError

If rater lists have different lengths.

Examples:

>>> rater1 = [1, 2, 3, 1, 2]
>>> rater2 = [1, 2, 2, 1, 2]
>>> InterAnnotatorMetrics.percentage_agreement(rater1, rater2)
0.8

cohens_kappa(rater1: list[Label], rater2: list[Label]) -> float staticmethod

Compute Cohen's kappa for two raters.

Cohen's kappa measures agreement between two raters beyond chance. Values range from -1 (complete disagreement) to 1 (perfect agreement), with 0 indicating chance-level agreement.

Parameters:

Name Type Description Default
rater1 list[Label]

Ratings from first rater.

required
rater2 list[Label]

Ratings from second rater.

required

Returns:

Type Description
float

Cohen's kappa coefficient.

Raises:

Type Description
ValueError

If rater lists have different lengths or are empty.

Examples:

>>> # Perfect agreement
>>> rater1 = [0, 1, 0, 1]
>>> rater2 = [0, 1, 0, 1]
>>> InterAnnotatorMetrics.cohens_kappa(rater1, rater2)
1.0
>>> # No agreement beyond chance
>>> rater1 = [0, 0, 1, 1]
>>> rater2 = [1, 1, 0, 0]
>>> kappa = InterAnnotatorMetrics.cohens_kappa(rater1, rater2)
>>> abs(kappa - (-1.0)) < 0.01
True

fleiss_kappa(ratings_matrix: np.ndarray[int, np.dtype[np.int_]]) -> float staticmethod

Compute Fleiss' kappa for multiple raters.

Fleiss' kappa generalizes Cohen's kappa to multiple raters. It measures agreement beyond chance when multiple raters assign categorical ratings to a set of items.

Parameters:

Name Type Description Default
ratings_matrix ndarray

Matrix of shape (n_items, n_categories) where element [i, j] contains the number of raters who assigned item i to category j.

required

Returns:

Type Description
float

Fleiss' kappa coefficient.

Raises:

Type Description
ValueError

If matrix is empty or has wrong shape.

ImportError

If statsmodels is not installed.

Examples:

>>> # 4 items, 3 categories, 5 raters each
>>> # Item 1: 3 raters chose cat 0, 2 chose cat 1, 0 chose cat 2
>>> ratings = np.array([
...     [3, 2, 0],  # Item 1
...     [0, 0, 5],  # Item 2
...     [2, 3, 0],  # Item 3
...     [1, 1, 3],  # Item 4
... ])
>>> kappa = InterAnnotatorMetrics.fleiss_kappa(ratings)
>>> 0.0 <= kappa <= 1.0
True

krippendorff_alpha(reliability_data: dict[str, list[Label | None]], metric: str = 'nominal') -> float staticmethod

Compute Krippendorff's alpha for multiple raters.

Krippendorff's alpha is the most general inter-rater reliability measure. It handles: - Any number of raters - Missing data - Different data types (nominal, ordinal, interval, ratio)

Parameters:

Name Type Description Default
reliability_data dict[str, list[Label | None]]

Dictionary mapping rater IDs to their ratings. Each rater's ratings list must have same length (use None for missing values).

required
metric str

Distance metric to use: - "nominal": for categorical data (default) - "ordinal": for ordered categories - "interval": for interval-scaled data - "ratio": for ratio-scaled data

"nominal"

Returns:

Type Description
float

Krippendorff's alpha coefficient (1.0 = perfect agreement, 0.0 = chance agreement, < 0.0 = systematic disagreement).

Raises:

Type Description
ValueError

If reliability_data is empty or rater lists have different lengths.

Examples:

>>> # 3 raters, 5 items (with one missing value)
>>> data = {
...     'rater1': [1, 2, 3, 4, 5],
...     'rater2': [1, 2, 3, 4, 5],
...     'rater3': [1, 2, None, 4, 5]
... }
>>> alpha = InterAnnotatorMetrics.krippendorff_alpha(data)
>>> alpha > 0.8  # High agreement
True

pairwise_agreement(ratings: dict[str, list[Label]]) -> dict[str, dict[str, float]] staticmethod

Compute pairwise agreement metrics for all rater pairs.

Parameters:

Name Type Description Default
ratings dict[str, list[Label]]

Dictionary mapping rater IDs to their ratings.

required

Returns:

Type Description
dict[str, dict[str, float]]

Nested dictionary with structure: { 'percentage_agreement': {('rater1', 'rater2'): 0.85, ...}, 'cohens_kappa': {('rater1', 'rater2'): 0.75, ...} }

Examples:

>>> ratings = {
...     'rater1': [1, 2, 3],
...     'rater2': [1, 2, 3],
...     'rater3': [1, 2, 2]
... }
>>> result = InterAnnotatorMetrics.pairwise_agreement(ratings)
>>> result['percentage_agreement'][('rater1', 'rater2')]
1.0
>>> result['cohens_kappa'][('rater1', 'rater2')]
1.0

Per-Annotator Reliability

reliability

Per-annotator reliability summaries.

Sits next to :class:bead.evaluation.InterAnnotatorMetrics. Where the inter-annotator metrics quantify agreement across raters, this module quantifies response diversity of each individual rater. Low within-annotator entropy is a flag that the annotator is collapsing the response space (always picking "yes", always picking the midpoint, and so on), which biases agreement metrics in misleading directions.

The canonical input is a sequence of :class:AnnotationRecord instances, each carrying an annotator_id, item_id, response_label, and question_name. The Shannon entropy of each annotator's per-question response distribution is computed in bits.

AnnotationRecord

Bases: BeadBaseModel

A single annotator response.

Canonical record shape consumed by reliability and inter-annotator metrics. Conforms structurally to :class:bead.protocol.diagnostics.RecordLike.

Attributes:

Name Type Description
annotator_id str

Identifier of the annotator who produced the response.

item_id str

Identifier of the annotation item.

question_name str

Anchor name of the question that was answered.

response_label str

The annotator's response label (must be one of the labels of the corresponding :class:ResponseEncoding).

AnnotatorReliability

Bases: BeadBaseModel

Per-annotator reliability summary.

Captures how diverse a single annotator's responses are within each question. Low entropy means the annotator collapses the response space.

Attributes:

Name Type Description
annotator_id str

The annotator's identifier.

n_responses int

Total responses from this annotator across all questions.

response_distribution dict[str, dict[str, int]]

Per-question distribution of responses, keyed by anchor name and then by response label, with counts as values.

entropy_per_question dict[str, float]

Per-question Shannon entropy in bits. 0.0 when the annotator only used one label for that question.

Examples:

>>> rel = AnnotatorReliability(
...     annotator_id="ann_1",
...     n_responses=4,
...     response_distribution={
...         "completion": {"yes": 2, "no": 2},
...     },
...     entropy_per_question={"completion": 1.0},
... )
>>> rel.entropy("completion")
1.0
>>> rel.entropy("missing") is None
True

entropy(question_name: str) -> float | None

Return the Shannon entropy for one question, or None.

Parameters:

Name Type Description Default
question_name str

Anchor name to look up.

required

Returns:

Type Description
float | None

Entropy in bits, or None if no responses were recorded for this question.

annotator_reliability(records: Sequence[AnnotationRecord], encodings: Mapping[str, ResponseEncoding] | None = None) -> tuple[AnnotatorReliability, ...]

Compute per-annotator reliability summaries.

Groups records by annotator, then by question, and computes Shannon entropy in bits on each annotator-question label distribution. When encodings is supplied, response labels not present in the encoding for a question are silently skipped (a common case after schema evolution).

Parameters:

Name Type Description Default
records Sequence[AnnotationRecord]

All records across questions and annotators.

required
encodings Mapping[str, ResponseEncoding] | None

Per-question encodings used to filter unrecognized labels. When None (the default), every label is counted.

None

Returns:

Type Description
tuple[AnnotatorReliability, ...]

One summary per annotator, sorted by annotator id.

Examples:

>>> records = [
...     AnnotationRecord(annotator_id="a1", item_id="i1",
...                      question_name="q", response_label="yes"),
...     AnnotationRecord(annotator_id="a1", item_id="i2",
...                      question_name="q", response_label="no"),
...     AnnotationRecord(annotator_id="a2", item_id="i1",
...                      question_name="q", response_label="yes"),
...     AnnotationRecord(annotator_id="a2", item_id="i2",
...                      question_name="q", response_label="yes"),
... ]
>>> profiles = annotator_reliability(records)
>>> [(p.annotator_id, p.entropy("q")) for p in profiles]
[('a1', 1.0), ('a2', 0.0)]

low_entropy_annotators(profiles: Sequence[AnnotatorReliability], *, threshold: float, question_name: str | None = None, require_min_responses: int = 1) -> tuple[str, ...]

Return annotator ids whose entropy falls at or below a threshold.

Useful for flagging annotators who collapse the response space. When question_name is supplied, the threshold is checked against that one question's entropy; otherwise it is checked against the minimum per-question entropy across every question the annotator answered.

Parameters:

Name Type Description Default
profiles Sequence[AnnotatorReliability]

Reliability summaries to scan.

required
threshold float

Entropy ceiling in bits. Annotators with entropy at or below this value are returned.

required
question_name str | None

Restrict the check to one question. Defaults to None (all questions, returning the minimum).

None
require_min_responses int

Skip annotators whose response count is below this value. Defaults to 1.

1

Returns:

Type Description
tuple[str, ...]

Annotator ids meeting the criterion, sorted.

Examples:

>>> profiles = (
...     AnnotatorReliability(annotator_id="a1", n_responses=10,
...                          entropy_per_question={"q": 0.0}),
...     AnnotatorReliability(annotator_id="a2", n_responses=10,
...                          entropy_per_question={"q": 0.95}),
... )
>>> low_entropy_annotators(profiles, threshold=0.5)
('a1',)