bead.evaluation¶

Metrics and evaluation utilities for convergence detection and inter-annotator agreement.

Convergence Detection¶

`convergence` ¶

Convergence detection for active learning.

This module provides tools for detecting when a model has converged to human-level performance, which serves as a stopping criterion for active learning loops.

`ConvergenceReport` ¶

Bases: TypedDict

Convergence report structure.

Attributes:

Name	Type	Description
`converged`	`bool`	Whether model has converged.
`model_accuracy`	`float`	Model's current accuracy.
`human_agreement`	`float`	Human agreement score.
`gap`	`float`	Difference between human agreement and model accuracy.
`required_accuracy`	`float`	Minimum accuracy required for convergence.
`threshold`	`float`	Convergence threshold.
`iteration`	`int`	Current iteration number.
`meets_min_iterations`	`bool`	Whether minimum iterations requirement is met.
`min_iterations_required`	`int`	Minimum iterations required before checking convergence.

`ConvergenceDetector` ¶

Detect convergence of model performance to human agreement.

This class monitors model performance and compares it to human inter-annotator agreement to determine when active learning can stop. Convergence is achieved when the model's accuracy matches or exceeds human agreement within a specified threshold.

Parameters:

Name	Type	Description	Default
`human_agreement_metric`	`str`	Which inter-annotator agreement metric to use as baseline: - "krippendorff_alpha": Most general (handles missing data, multiple raters) - "fleiss_kappa": Multiple raters, no missing data - "cohens_kappa": Two raters only - "percentage_agreement": Simple agreement rate	`"krippendorff_alpha"`
`convergence_threshold`	`float`	Model must be within this threshold of human agreement to converge. For example, 0.05 means model accuracy must be >= (human_agreement - 0.05).	`0.05`
`min_iterations`	`int`	Minimum number of iterations before checking convergence. Prevents premature stopping.	`3`
`statistical_test`	`bool`	Whether to run statistical significance test comparing model to humans.	`True`
`alpha`	`float`	Significance level for statistical tests.	`0.05`

Attributes:

Name	Type	Description
`human_agreement_metric`	`str`	Agreement metric being used.
`convergence_threshold`	`float`	Threshold for convergence.
`min_iterations`	`int`	Minimum iterations required.
`statistical_test`	`bool`	Whether to run significance tests.
`alpha`	`float`	Significance level.
`human_baseline`	`float \| None`	Computed human agreement baseline (set via compute_human_baseline).

Examples:

>>> detector = ConvergenceDetector(
...     human_agreement_metric='krippendorff_alpha',
...     convergence_threshold=0.05,
...     min_iterations=3
... )
>>> # Compute human baseline from ratings
>>> ratings = {
...     'human1': [1, 1, 0, 1, 0],
...     'human2': [1, 1, 0, 0, 0],
...     'human3': [1, 0, 0, 1, 0]
... }
>>> detector.compute_human_baseline(ratings)
>>> detector.human_baseline > 0.0
True
>>> # Check if model converged
>>> converged = detector.check_convergence(
...     model_accuracy=0.75,
...     iteration=5
... )
>>> isinstance(converged, bool)
True

`init(human_agreement_metric: str = 'krippendorff_alpha', convergence_threshold: float = 0.05, min_iterations: int = 3, statistical_test: bool = True, alpha: float = 0.05) -> None` ¶

Initialize convergence detector.

Parameters:

Name	Type	Description	Default
`human_agreement_metric`	`str`	Inter-annotator agreement metric to use.	`'krippendorff_alpha'`
`convergence_threshold`	`float`	Threshold for convergence (model must be within this of human).	`0.05`
`min_iterations`	`int`	Minimum iterations before checking convergence.	`3`
`statistical_test`	`bool`	Whether to run statistical tests.	`True`
`alpha`	`float`	Significance level for tests.	`0.05`

Raises:

Type	Description
`ValueError`	If parameters are invalid.

`compute_human_baseline(human_ratings: dict[str, list[Label | None]], **kwargs: str | int | float | bool | None) -> float` ¶

Compute human inter-rater agreement baseline.

Parameters:

Name	Type	Description	Default
`human_ratings`	`dict[str, list[Label \| None]]`	Dictionary mapping human rater IDs to their ratings. For example: {'rater1': [1, 0, 1, ...], 'rater2': [1, 1, 1, ...]}. Missing ratings can be represented as None.	required
`**kwargs`	`str \| int \| float \| bool \| None`	Additional arguments passed to agreement metric function. For example, metric='nominal' for Krippendorff's alpha.	`{}`

Returns:

Type	Description
`float`	Human agreement score.

Raises:

Type	Description
`ValueError`	If human_ratings is empty or has fewer than 2 raters.

Examples:

>>> detector = ConvergenceDetector()
>>> ratings = {
...     'human1': [1, 1, 0, 1],
...     'human2': [1, 1, 0, 0],
...     'human3': [1, 0, 0, 1]
... }
>>> baseline = detector.compute_human_baseline(ratings)
>>> 0.0 <= baseline <= 1.0
True

`check_convergence(model_accuracy: float, iteration: int, human_agreement: float | None = None) -> bool` ¶

Check if model has converged to human performance.

Parameters:

Name	Type	Description	Default
`model_accuracy`	`float`	Model's accuracy on the task.	required
`iteration`	`int`	Current iteration number (1-indexed).	required
`human_agreement`	`float \| None`	Human agreement score. If None, uses self.human_baseline (which must have been set via compute_human_baseline).	`None`

Returns:

Type	Description
`bool`	True if model has converged, False otherwise.

Raises:

Type	Description
`ValueError`	If human_agreement is None and human_baseline not set.

Examples:

>>> detector = ConvergenceDetector(min_iterations=2, convergence_threshold=0.05)
>>> detector.human_baseline = 0.80
>>> # Too early (iteration 1 < min_iterations 2)
>>> detector.check_convergence(0.79, iteration=1)
False
>>> # Still not converged (0.74 < 0.80 - 0.05)
>>> detector.check_convergence(0.74, iteration=3)
False
>>> # Converged (0.77 >= 0.80 - 0.05)
>>> detector.check_convergence(0.77, iteration=3)
True

`compute_statistical_test(model_predictions: list[Label], human_consensus: list[Label], test_type: str = 'mcnemar') -> dict[str, float]` ¶

Run statistical test comparing model to human performance.

Parameters:

Name	Type	Description	Default
`model_predictions`	`list[Label]`	Model's predictions.	required
`human_consensus`	`list[Label]`	Human consensus labels (e.g., majority vote).	required
`test_type`	`str`	Type of statistical test: - "mcnemar": McNemar's test for paired nominal data - "ttest": Paired t-test (requires multiple samples)	`"mcnemar"`

Returns:

Type	Description
`dict[str, float]`	Dictionary with keys 'statistic' and 'p_value'.

Raises:

Type	Description
`ValueError`	If predictions and consensus have different lengths.

Examples:

>>> detector = ConvergenceDetector()
>>> model_preds = [1, 1, 0, 1, 0]
>>> human_consensus = [1, 1, 0, 0, 0]
>>> result = detector.compute_statistical_test(model_preds, human_consensus)
>>> 'statistic' in result and 'p_value' in result
True

`get_convergence_report(model_accuracy: float, iteration: int, human_agreement: float | None = None) -> ConvergenceReport` ¶

Generate convergence report with status and metrics.

Parameters:

Name	Type	Description	Default
`model_accuracy`	`float`	Model's current accuracy.	required
`iteration`	`int`	Current iteration number.	required
`human_agreement`	`float \| None`	Human agreement score (uses baseline if None).	`None`

Returns:

Type	Description
`ConvergenceReport`	Report with convergence status and metrics.

Examples:

>>> detector = ConvergenceDetector(convergence_threshold=0.05)
>>> detector.human_baseline = 0.80
>>> report = detector.get_convergence_report(0.77, iteration=5)
>>> report['converged']
True
>>> report['gap']
0.03

Inter-Annotator Agreement¶

`interannotator` ¶

Inter-annotator agreement metrics.

This module provides inter-annotator agreement metrics for assessing reliability and consistency across multiple human annotators. Uses sklearn.metrics for Cohen's kappa, statsmodels for Fleiss' kappa, and krippendorff package for Krippendorff's alpha.

`InterAnnotatorMetrics` ¶

Inter-annotator agreement metrics for reliability assessment.

Provides static methods for computing various agreement metrics: - Percentage agreement (simple) - Cohen's kappa (2 raters, categorical) - Fleiss' kappa (multiple raters, categorical) - Krippendorff's alpha (general, multiple data types) - Pairwise agreement (all pairs of raters)

Examples:

>>> # Cohen's kappa for 2 raters
>>> rater1 = [0, 1, 0, 1, 1]
>>> rater2 = [0, 1, 1, 1, 1]
>>> InterAnnotatorMetrics.cohens_kappa(rater1, rater2)
0.6
>>> # Percentage agreement
>>> InterAnnotatorMetrics.percentage_agreement(rater1, rater2)
0.8

`percentage_agreement(rater1: list[Label], rater2: list[Label]) -> float` `staticmethod` ¶

Compute simple percentage agreement between two raters.

Parameters:

Name	Type	Description	Default
`rater1`	`list[Label]`	Ratings from first rater.	required
`rater2`	`list[Label]`	Ratings from second rater.	required

Returns:

Type	Description
`float`	Percentage agreement (0.0 to 1.0).

Raises:

Type	Description
`ValueError`	If rater lists have different lengths.

Examples:

>>> rater1 = [1, 2, 3, 1, 2]
>>> rater2 = [1, 2, 2, 1, 2]
>>> InterAnnotatorMetrics.percentage_agreement(rater1, rater2)
0.8

`cohens_kappa(rater1: list[Label], rater2: list[Label]) -> float` `staticmethod` ¶

Compute Cohen's kappa for two raters.

Cohen's kappa measures agreement between two raters beyond chance. Values range from -1 (complete disagreement) to 1 (perfect agreement), with 0 indicating chance-level agreement.

Parameters:

Name	Type	Description	Default
`rater1`	`list[Label]`	Ratings from first rater.	required
`rater2`	`list[Label]`	Ratings from second rater.	required

Returns:

Type	Description
`float`	Cohen's kappa coefficient.

Raises:

Type	Description
`ValueError`	If rater lists have different lengths or are empty.

Examples:

>>> # Perfect agreement
>>> rater1 = [0, 1, 0, 1]
>>> rater2 = [0, 1, 0, 1]
>>> InterAnnotatorMetrics.cohens_kappa(rater1, rater2)
1.0
>>> # No agreement beyond chance
>>> rater1 = [0, 0, 1, 1]
>>> rater2 = [1, 1, 0, 0]
>>> kappa = InterAnnotatorMetrics.cohens_kappa(rater1, rater2)
>>> abs(kappa - (-1.0)) < 0.01
True

`fleiss_kappa(ratings_matrix: np.ndarray[int, np.dtype[np.int_]]) -> float` `staticmethod` ¶

Compute Fleiss' kappa for multiple raters.

Fleiss' kappa generalizes Cohen's kappa to multiple raters. It measures agreement beyond chance when multiple raters assign categorical ratings to a set of items.

Parameters:

Name	Type	Description	Default
`ratings_matrix`	`ndarray`	Matrix of shape (n_items, n_categories) where element [i, j] contains the number of raters who assigned item i to category j.	required

Returns:

Type	Description
`float`	Fleiss' kappa coefficient.

Raises:

Type	Description
`ValueError`	If matrix is empty or has wrong shape.
`ImportError`	If statsmodels is not installed.

Examples:

>>> # 4 items, 3 categories, 5 raters each
>>> # Item 1: 3 raters chose cat 0, 2 chose cat 1, 0 chose cat 2
>>> ratings = np.array([
...     [3, 2, 0],  # Item 1
...     [0, 0, 5],  # Item 2
...     [2, 3, 0],  # Item 3
...     [1, 1, 3],  # Item 4
... ])
>>> kappa = InterAnnotatorMetrics.fleiss_kappa(ratings)
>>> 0.0 <= kappa <= 1.0
True

`krippendorff_alpha(reliability_data: dict[str, list[Label | None]], metric: str = 'nominal') -> float` `staticmethod` ¶

Compute Krippendorff's alpha for multiple raters.

Krippendorff's alpha is the most general inter-rater reliability measure. It handles: - Any number of raters - Missing data - Different data types (nominal, ordinal, interval, ratio)

Parameters:

Name	Type	Description	Default
`reliability_data`	`dict[str, list[Label \| None]]`	Dictionary mapping rater IDs to their ratings. Each rater's ratings list must have same length (use None for missing values).	required
`metric`	`str`	Distance metric to use: - "nominal": for categorical data (default) - "ordinal": for ordered categories - "interval": for interval-scaled data - "ratio": for ratio-scaled data	`"nominal"`

Returns:

Type	Description
`float`	Krippendorff's alpha coefficient (1.0 = perfect agreement, 0.0 = chance agreement, < 0.0 = systematic disagreement).

Raises:

Type	Description
`ValueError`	If reliability_data is empty or rater lists have different lengths.

Examples:

>>> # 3 raters, 5 items (with one missing value)
>>> data = {
...     'rater1': [1, 2, 3, 4, 5],
...     'rater2': [1, 2, 3, 4, 5],
...     'rater3': [1, 2, None, 4, 5]
... }
>>> alpha = InterAnnotatorMetrics.krippendorff_alpha(data)
>>> alpha > 0.8  # High agreement
True

`pairwise_agreement(ratings: dict[str, list[Label]]) -> dict[str, dict[str, float]]` `staticmethod` ¶

Compute pairwise agreement metrics for all rater pairs.

Parameters:

Name	Type	Description	Default
`ratings`	`dict[str, list[Label]]`	Dictionary mapping rater IDs to their ratings.	required

Returns:

Type	Description
`dict[str, dict[str, float]]`	Nested dictionary with structure: { 'percentage_agreement': {('rater1', 'rater2'): 0.85, ...}, 'cohens_kappa': {('rater1', 'rater2'): 0.75, ...} }

Examples:

>>> ratings = {
...     'rater1': [1, 2, 3],
...     'rater2': [1, 2, 3],
...     'rater3': [1, 2, 2]
... }
>>> result = InterAnnotatorMetrics.pairwise_agreement(ratings)
>>> result['percentage_agreement'][('rater1', 'rater2')]
1.0
>>> result['cohens_kappa'][('rater1', 'rater2')]
1.0

Per-Annotator Reliability¶

`reliability` ¶

Per-annotator reliability summaries.

Sits next to :class:bead.evaluation.InterAnnotatorMetrics. Where the inter-annotator metrics quantify agreement across raters, this module quantifies response diversity of each individual rater. Low within-annotator entropy is a flag that the annotator is collapsing the response space (always picking "yes", always picking the midpoint, and so on), which biases agreement metrics in misleading directions.

The canonical input is a sequence of :class:AnnotationRecord instances, each carrying an annotator_id, item_id, response_label, and question_name. The Shannon entropy of each annotator's per-question response distribution is computed in bits.

`AnnotationRecord` ¶

Bases: BeadBaseModel

A single annotator response.

Canonical record shape consumed by reliability and inter-annotator metrics. Conforms structurally to :class:bead.protocol.diagnostics.RecordLike.

Attributes:

Name	Type	Description
`annotator_id`	`str`	Identifier of the annotator who produced the response.
`item_id`	`str`	Identifier of the annotation item.
`question_name`	`str`	Anchor name of the question that was answered.
`response_label`	`str`	The annotator's response label (must be one of the labels of the corresponding :class:`ResponseEncoding`).

`AnnotatorReliability` ¶

Bases: BeadBaseModel

Per-annotator reliability summary.

Captures how diverse a single annotator's responses are within each question. Low entropy means the annotator collapses the response space.

Attributes:

Name	Type	Description
`annotator_id`	`str`	The annotator's identifier.
`n_responses`	`int`	Total responses from this annotator across all questions.
`response_distribution`	`dict[str, dict[str, int]]`	Per-question distribution of responses, keyed by anchor name and then by response label, with counts as values.
`entropy_per_question`	`dict[str, float]`	Per-question Shannon entropy in bits. `0.0` when the annotator only used one label for that question.

Examples:

>>> rel = AnnotatorReliability(
...     annotator_id="ann_1",
...     n_responses=4,
...     response_distribution={
...         "completion": {"yes": 2, "no": 2},
...     },
...     entropy_per_question={"completion": 1.0},
... )
>>> rel.entropy("completion")
1.0
>>> rel.entropy("missing") is None
True

`entropy(question_name: str) -> float | None` ¶

Return the Shannon entropy for one question, or None.

Parameters:

Name	Type	Description	Default
`question_name`	`str`	Anchor name to look up.	required

Returns:

Type	Description
`float \| None`	Entropy in bits, or `None` if no responses were recorded for this question.

`annotator_reliability(records: Sequence[AnnotationRecord], encodings: Mapping[str, ResponseEncoding] | None = None) -> tuple[AnnotatorReliability, ...]` ¶

Compute per-annotator reliability summaries.

Groups records by annotator, then by question, and computes Shannon entropy in bits on each annotator-question label distribution. When encodings is supplied, response labels not present in the encoding for a question are silently skipped (a common case after schema evolution).

Parameters:

Name	Type	Description	Default
`records`	`Sequence[AnnotationRecord]`	All records across questions and annotators.	required
`encodings`	`Mapping[str, ResponseEncoding] \| None`	Per-question encodings used to filter unrecognized labels. When `None` (the default), every label is counted.	`None`

Returns:

Type	Description
`tuple[AnnotatorReliability, ...]`	One summary per annotator, sorted by annotator id.

Examples:

>>> records = [
...     AnnotationRecord(annotator_id="a1", item_id="i1",
...                      question_name="q", response_label="yes"),
...     AnnotationRecord(annotator_id="a1", item_id="i2",
...                      question_name="q", response_label="no"),
...     AnnotationRecord(annotator_id="a2", item_id="i1",
...                      question_name="q", response_label="yes"),
...     AnnotationRecord(annotator_id="a2", item_id="i2",
...                      question_name="q", response_label="yes"),
... ]
>>> profiles = annotator_reliability(records)
>>> [(p.annotator_id, p.entropy("q")) for p in profiles]
[('a1', 1.0), ('a2', 0.0)]

`low_entropy_annotators(profiles: Sequence[AnnotatorReliability], *, threshold: float, question_name: str | None = None, require_min_responses: int = 1) -> tuple[str, ...]` ¶

Return annotator ids whose entropy falls at or below a threshold.

Useful for flagging annotators who collapse the response space. When question_name is supplied, the threshold is checked against that one question's entropy; otherwise it is checked against the minimum per-question entropy across every question the annotator answered.

Parameters:

Name	Type	Description	Default
`profiles`	`Sequence[AnnotatorReliability]`	Reliability summaries to scan.	required
`threshold`	`float`	Entropy ceiling in bits. Annotators with entropy at or below this value are returned.	required
`question_name`	`str \| None`	Restrict the check to one question. Defaults to `None` (all questions, returning the minimum).	`None`
`require_min_responses`	`int`	Skip annotators whose response count is below this value. Defaults to `1`.	`1`

Returns:

Type	Description
`tuple[str, ...]`	Annotator ids meeting the criterion, sorted.

Examples:

>>> profiles = (
...     AnnotatorReliability(annotator_id="a1", n_responses=10,
...                          entropy_per_question={"q": 0.0}),
...     AnnotatorReliability(annotator_id="a2", n_responses=10,
...                          entropy_per_question={"q": 0.95}),
... )
>>> low_entropy_annotators(profiles, threshold=0.5)
('a1',)

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search

bead.evaluation¶

Convergence Detection¶

convergence ¶

ConvergenceReport ¶

ConvergenceDetector ¶

__init__(human_agreement_metric: str = 'krippendorff_alpha', convergence_threshold: float = 0.05, min_iterations: int = 3, statistical_test: bool = True, alpha: float = 0.05) -> None ¶

compute_human_baseline(human_ratings: dict[str, list[Label | None]], **kwargs: str | int | float | bool | None) -> float ¶

check_convergence(model_accuracy: float, iteration: int, human_agreement: float | None = None) -> bool ¶

compute_statistical_test(model_predictions: list[Label], human_consensus: list[Label], test_type: str = 'mcnemar') -> dict[str, float] ¶

get_convergence_report(model_accuracy: float, iteration: int, human_agreement: float | None = None) -> ConvergenceReport ¶

Inter-Annotator Agreement¶

interannotator ¶

InterAnnotatorMetrics ¶

percentage_agreement(rater1: list[Label], rater2: list[Label]) -> float staticmethod ¶

cohens_kappa(rater1: list[Label], rater2: list[Label]) -> float staticmethod ¶

fleiss_kappa(ratings_matrix: np.ndarray[int, np.dtype[np.int_]]) -> float staticmethod ¶

krippendorff_alpha(reliability_data: dict[str, list[Label | None]], metric: str = 'nominal') -> float staticmethod ¶

pairwise_agreement(ratings: dict[str, list[Label]]) -> dict[str, dict[str, float]] staticmethod ¶

Per-Annotator Reliability¶

reliability ¶

AnnotationRecord ¶

AnnotatorReliability ¶

entropy(question_name: str) -> float | None ¶

annotator_reliability(records: Sequence[AnnotationRecord], encodings: Mapping[str, ResponseEncoding] | None = None) -> tuple[AnnotatorReliability, ...] ¶

low_entropy_annotators(profiles: Sequence[AnnotatorReliability], *, threshold: float, question_name: str | None = None, require_min_responses: int = 1) -> tuple[str, ...] ¶

`convergence` ¶

`ConvergenceReport` ¶

`ConvergenceDetector` ¶

`init(human_agreement_metric: str = 'krippendorff_alpha', convergence_threshold: float = 0.05, min_iterations: int = 3, statistical_test: bool = True, alpha: float = 0.05) -> None` ¶

`compute_human_baseline(human_ratings: dict[str, list[Label | None]], **kwargs: str | int | float | bool | None) -> float` ¶

`check_convergence(model_accuracy: float, iteration: int, human_agreement: float | None = None) -> bool` ¶

`compute_statistical_test(model_predictions: list[Label], human_consensus: list[Label], test_type: str = 'mcnemar') -> dict[str, float]` ¶

`get_convergence_report(model_accuracy: float, iteration: int, human_agreement: float | None = None) -> ConvergenceReport` ¶

`interannotator` ¶

`InterAnnotatorMetrics` ¶

`percentage_agreement(rater1: list[Label], rater2: list[Label]) -> float` `staticmethod` ¶

`cohens_kappa(rater1: list[Label], rater2: list[Label]) -> float` `staticmethod` ¶

`fleiss_kappa(ratings_matrix: np.ndarray[int, np.dtype[np.int_]]) -> float` `staticmethod` ¶

`krippendorff_alpha(reliability_data: dict[str, list[Label | None]], metric: str = 'nominal') -> float` `staticmethod` ¶

`pairwise_agreement(ratings: dict[str, list[Label]]) -> dict[str, dict[str, float]]` `staticmethod` ¶

`reliability` ¶

`AnnotationRecord` ¶

`AnnotatorReliability` ¶

`entropy(question_name: str) -> float | None` ¶

`annotator_reliability(records: Sequence[AnnotationRecord], encodings: Mapping[str, ResponseEncoding] | None = None) -> tuple[AnnotatorReliability, ...]` ¶

`low_entropy_annotators(profiles: Sequence[AnnotatorReliability], *, threshold: float, question_name: str | None = None, require_min_responses: int = 1) -> tuple[str, ...]` ¶