bead.evaluation¶

Metrics and evaluation utilities for convergence detection and inter-annotator agreement.

Convergence Detection¶

`convergence` ¶

Convergence detection for active learning.

This module provides tools for detecting when a model has converged to human-level performance, which serves as a stopping criterion for active learning loops.

`ConvergenceReport` ¶

Bases: TypedDict

Convergence report structure.

Attributes:

Name	Type	Description
`converged`	`bool`	Whether model has converged.
`model_accuracy`	`float`	Model's current accuracy.
`human_agreement`	`float`	Human agreement score.
`gap`	`float`	Difference between human agreement and model accuracy.
`required_accuracy`	`float`	Minimum accuracy required for convergence.
`threshold`	`float`	Convergence threshold.
`iteration`	`int`	Current iteration number.
`meets_min_iterations`	`bool`	Whether minimum iterations requirement is met.
`min_iterations_required`	`int`	Minimum iterations required before checking convergence.

`ConvergenceDetector` ¶

Detect convergence of model performance to human agreement.

This class monitors model performance and compares it to human inter-annotator agreement to determine when active learning can stop. Convergence is achieved when the model's accuracy matches or exceeds human agreement within a specified threshold.

Parameters:

Name	Type	Description	Default
`human_agreement_metric`	`str`	Which inter-annotator agreement metric to use as baseline: - "krippendorff_alpha": Most general (handles missing data, multiple raters) - "fleiss_kappa": Multiple raters, no missing data - "cohens_kappa": Two raters only - "percentage_agreement": Simple agreement rate	`"krippendorff_alpha"`
`convergence_threshold`	`float`	Model must be within this threshold of human agreement to converge. For example, 0.05 means model accuracy must be >= (human_agreement - 0.05).	`0.05`
`min_iterations`	`int`	Minimum number of iterations before checking convergence. Prevents premature stopping.	`3`
`statistical_test`	`bool`	Whether to run statistical significance test comparing model to humans.	`True`
`alpha`	`float`	Significance level for statistical tests.	`0.05`

Attributes:

Name	Type	Description
`human_agreement_metric`	`str`	Agreement metric being used.
`convergence_threshold`	`float`	Threshold for convergence.
`min_iterations`	`int`	Minimum iterations required.
`statistical_test`	`bool`	Whether to run significance tests.
`alpha`	`float`	Significance level.
`human_baseline`	`float \| None`	Computed human agreement baseline (set via compute_human_baseline).

Examples:

>>> detector = ConvergenceDetector(
...     human_agreement_metric='krippendorff_alpha',
...     convergence_threshold=0.05,
...     min_iterations=3
... )
>>> # Compute human baseline from ratings
>>> ratings = {
...     'human1': [1, 1, 0, 1, 0],
...     'human2': [1, 1, 0, 0, 0],
...     'human3': [1, 0, 0, 1, 0]
... }
>>> detector.compute_human_baseline(ratings)
>>> detector.human_baseline > 0.0
True
>>> # Check if model converged
>>> converged = detector.check_convergence(
...     model_accuracy=0.75,
...     iteration=5
... )
>>> isinstance(converged, bool)
True

`init(human_agreement_metric: str = 'krippendorff_alpha', convergence_threshold: float = 0.05, min_iterations: int = 3, statistical_test: bool = True, alpha: float = 0.05) -> None` ¶

Initialize convergence detector.

Parameters:

Name	Type	Description	Default
`human_agreement_metric`	`str`	Inter-annotator agreement metric to use.	`'krippendorff_alpha'`
`convergence_threshold`	`float`	Threshold for convergence (model must be within this of human).	`0.05`
`min_iterations`	`int`	Minimum iterations before checking convergence.	`3`
`statistical_test`	`bool`	Whether to run statistical tests.	`True`
`alpha`	`float`	Significance level for tests.	`0.05`

Raises:

Type	Description
`ValueError`	If parameters are invalid.

`compute_human_baseline(human_ratings: dict[str, list[Label | None]], **kwargs: str | int | float | bool | None) -> float` ¶

Compute human inter-rater agreement baseline.

Parameters:

Name	Type	Description	Default
`human_ratings`	`dict[str, list[Label \| None]]`	Dictionary mapping human rater IDs to their ratings. For example: {'rater1': [1, 0, 1, ...], 'rater2': [1, 1, 1, ...]}. Missing ratings can be represented as None.	required
`**kwargs`	`str \| int \| float \| bool \| None`	Additional arguments passed to agreement metric function. For example, metric='nominal' for Krippendorff's alpha.	`{}`

Returns:

Type	Description
`float`	Human agreement score.

Raises:

Type	Description
`ValueError`	If human_ratings is empty or has fewer than 2 raters.

Examples:

>>> detector = ConvergenceDetector()
>>> ratings = {
...     'human1': [1, 1, 0, 1],
...     'human2': [1, 1, 0, 0],
...     'human3': [1, 0, 0, 1]
... }
>>> baseline = detector.compute_human_baseline(ratings)
>>> 0.0 <= baseline <= 1.0
True

`check_convergence(model_accuracy: float, iteration: int, human_agreement: float | None = None) -> bool` ¶

Check if model has converged to human performance.

Parameters:

Name	Type	Description	Default
`model_accuracy`	`float`	Model's accuracy on the task.	required
`iteration`	`int`	Current iteration number (1-indexed).	required
`human_agreement`	`float \| None`	Human agreement score. If None, uses self.human_baseline (which must have been set via compute_human_baseline).	`None`

Returns:

Type	Description
`bool`	True if model has converged, False otherwise.

Raises:

Type	Description
`ValueError`	If human_agreement is None and human_baseline not set.

Examples:

>>> detector = ConvergenceDetector(min_iterations=2, convergence_threshold=0.05)
>>> detector.human_baseline = 0.80
>>> # Too early (iteration 1 < min_iterations 2)
>>> detector.check_convergence(0.79, iteration=1)
False
>>> # Still not converged (0.74 < 0.80 - 0.05)
>>> detector.check_convergence(0.74, iteration=3)
False
>>> # Converged (0.77 >= 0.80 - 0.05)
>>> detector.check_convergence(0.77, iteration=3)
True

`compute_statistical_test(model_predictions: list[Label], human_consensus: list[Label], test_type: str = 'mcnemar') -> dict[str, float]` ¶

Run statistical test comparing model to human performance.

Parameters:

Name	Type	Description	Default
`model_predictions`	`list[Label]`	Model's predictions.	required
`human_consensus`	`list[Label]`	Human consensus labels (e.g., majority vote).	required
`test_type`	`str`	Type of statistical test: - "mcnemar": McNemar's test for paired nominal data - "ttest": Paired t-test (requires multiple samples)	`"mcnemar"`

Returns:

Type	Description
`dict[str, float]`	Dictionary with keys 'statistic' and 'p_value'.

Raises:

Type	Description
`ValueError`	If predictions and consensus have different lengths.

Examples:

>>> detector = ConvergenceDetector()
>>> model_preds = [1, 1, 0, 1, 0]
>>> human_consensus = [1, 1, 0, 0, 0]
>>> result = detector.compute_statistical_test(model_preds, human_consensus)
>>> 'statistic' in result and 'p_value' in result
True

`get_convergence_report(model_accuracy: float, iteration: int, human_agreement: float | None = None) -> ConvergenceReport` ¶

Generate convergence report with status and metrics.

Parameters:

Name	Type	Description	Default
`model_accuracy`	`float`	Model's current accuracy.	required
`iteration`	`int`	Current iteration number.	required
`human_agreement`	`float \| None`	Human agreement score (uses baseline if None).	`None`

Returns:

Type	Description
`ConvergenceReport`	Report with convergence status and metrics.

Examples:

>>> detector = ConvergenceDetector(convergence_threshold=0.05)
>>> detector.human_baseline = 0.80
>>> report = detector.get_convergence_report(0.77, iteration=5)
>>> report['converged']
True
>>> report['gap']
0.03

Inter-Annotator Agreement¶

`interannotator` ¶

Inter-annotator agreement metrics.

This module provides inter-annotator agreement metrics for assessing reliability and consistency across multiple human annotators. Uses sklearn.metrics for Cohen's kappa, statsmodels for Fleiss' kappa, and krippendorff package for Krippendorff's alpha.

`InterAnnotatorMetrics` ¶

Inter-annotator agreement metrics for reliability assessment.

Provides static methods for computing various agreement metrics: - Percentage agreement (simple) - Cohen's kappa (2 raters, categorical) - Fleiss' kappa (multiple raters, categorical) - Krippendorff's alpha (general, multiple data types) - Pairwise agreement (all pairs of raters)

Examples:

>>> # Cohen's kappa for 2 raters
>>> rater1 = [0, 1, 0, 1, 1]
>>> rater2 = [0, 1, 1, 1, 1]
>>> InterAnnotatorMetrics.cohens_kappa(rater1, rater2)
0.6
>>> # Percentage agreement
>>> InterAnnotatorMetrics.percentage_agreement(rater1, rater2)
0.8

`percentage_agreement(rater1: list[Label], rater2: list[Label]) -> float` `staticmethod` ¶

Compute simple percentage agreement between two raters.

Parameters:

Name	Type	Description	Default
`rater1`	`list[Label]`	Ratings from first rater.	required
`rater2`	`list[Label]`	Ratings from second rater.	required

Returns:

Type	Description
`float`	Percentage agreement (0.0 to 1.0).

Raises:

Type	Description
`ValueError`	If rater lists have different lengths.

Examples:

>>> rater1 = [1, 2, 3, 1, 2]
>>> rater2 = [1, 2, 2, 1, 2]
>>> InterAnnotatorMetrics.percentage_agreement(rater1, rater2)
0.8

`cohens_kappa(rater1: list[Label], rater2: list[Label]) -> float` `staticmethod` ¶

Compute Cohen's kappa for two raters.

Cohen's kappa measures agreement between two raters beyond chance. Values range from -1 (complete disagreement) to 1 (perfect agreement), with 0 indicating chance-level agreement.

Parameters:

Name	Type	Description	Default
`rater1`	`list[Label]`	Ratings from first rater.	required
`rater2`	`list[Label]`	Ratings from second rater.	required

Returns:

Type	Description
`float`	Cohen's kappa coefficient.

Raises:

Type	Description
`ValueError`	If rater lists have different lengths or are empty.

Examples:

>>> # Perfect agreement
>>> rater1 = [0, 1, 0, 1]
>>> rater2 = [0, 1, 0, 1]
>>> InterAnnotatorMetrics.cohens_kappa(rater1, rater2)
1.0
>>> # No agreement beyond chance
>>> rater1 = [0, 0, 1, 1]
>>> rater2 = [1, 1, 0, 0]
>>> kappa = InterAnnotatorMetrics.cohens_kappa(rater1, rater2)
>>> abs(kappa - (-1.0)) < 0.01
True

`fleiss_kappa(ratings_matrix: np.ndarray[int, np.dtype[np.int_]]) -> float` `staticmethod` ¶

Compute Fleiss' kappa for multiple raters.

Fleiss' kappa generalizes Cohen's kappa to multiple raters. It measures agreement beyond chance when multiple raters assign categorical ratings to a set of items.

Parameters:

Name	Type	Description	Default
`ratings_matrix`	`ndarray`	Matrix of shape (n_items, n_categories) where element [i, j] contains the number of raters who assigned item i to category j.	required

Returns:

Type	Description
`float`	Fleiss' kappa coefficient.

Raises:

Type	Description
`ValueError`	If matrix is empty or has wrong shape.
`ImportError`	If statsmodels is not installed.

Examples:

>>> # 4 items, 3 categories, 5 raters each
>>> # Item 1: 3 raters chose cat 0, 2 chose cat 1, 0 chose cat 2
>>> ratings = np.array([
...     [3, 2, 0],  # Item 1
...     [0, 0, 5],  # Item 2
...     [2, 3, 0],  # Item 3
...     [1, 1, 3],  # Item 4
... ])
>>> kappa = InterAnnotatorMetrics.fleiss_kappa(ratings)
>>> 0.0 <= kappa <= 1.0
True

`krippendorff_alpha(reliability_data: dict[str, list[Label | None]], metric: str = 'nominal') -> float` `staticmethod` ¶

Compute Krippendorff's alpha for multiple raters.

Krippendorff's alpha is the most general inter-rater reliability measure. It handles: - Any number of raters - Missing data - Different data types (nominal, ordinal, interval, ratio)

Parameters:

Name	Type	Description	Default
`reliability_data`	`dict[str, list[Label \| None]]`	Dictionary mapping rater IDs to their ratings. Each rater's ratings list must have same length (use None for missing values).	required
`metric`	`str`	Distance metric to use: - "nominal": for categorical data (default) - "ordinal": for ordered categories - "interval": for interval-scaled data - "ratio": for ratio-scaled data	`"nominal"`

Returns:

Type	Description
`float`	Krippendorff's alpha coefficient (1.0 = perfect agreement, 0.0 = chance agreement, < 0.0 = systematic disagreement).

Raises:

Type	Description
`ValueError`	If reliability_data is empty or rater lists have different lengths.

Examples:

>>> # 3 raters, 5 items (with one missing value)
>>> data = {
...     'rater1': [1, 2, 3, 4, 5],
...     'rater2': [1, 2, 3, 4, 5],
...     'rater3': [1, 2, None, 4, 5]
... }
>>> alpha = InterAnnotatorMetrics.krippendorff_alpha(data)
>>> alpha > 0.8  # High agreement
True

`pairwise_agreement(ratings: dict[str, list[Label]]) -> dict[str, dict[str, float]]` `staticmethod` ¶

Compute pairwise agreement metrics for all rater pairs.

Parameters:

Name	Type	Description	Default
`ratings`	`dict[str, list[Label]]`	Dictionary mapping rater IDs to their ratings.	required

Returns:

Type	Description
`dict[str, dict[str, float]]`	Nested dictionary with structure: { 'percentage_agreement': {('rater1', 'rater2'): 0.85, ...}, 'cohens_kappa': {('rater1', 'rater2'): 0.75, ...} }

Examples:

>>> ratings = {
...     'rater1': [1, 2, 3],
...     'rater2': [1, 2, 3],
...     'rater3': [1, 2, 2]
... }
>>> result = InterAnnotatorMetrics.pairwise_agreement(ratings)
>>> result['percentage_agreement'][('rater1', 'rater2')]
1.0
>>> result['cohens_kappa'][('rater1', 'rater2')]
1.0

bead.evaluation¶

Convergence Detection¶

convergence ¶

ConvergenceReport ¶

ConvergenceDetector ¶

__init__(human_agreement_metric: str = 'krippendorff_alpha', convergence_threshold: float = 0.05, min_iterations: int = 3, statistical_test: bool = True, alpha: float = 0.05) -> None ¶

compute_human_baseline(human_ratings: dict[str, list[Label | None]], **kwargs: str | int | float | bool | None) -> float ¶

check_convergence(model_accuracy: float, iteration: int, human_agreement: float | None = None) -> bool ¶

compute_statistical_test(model_predictions: list[Label], human_consensus: list[Label], test_type: str = 'mcnemar') -> dict[str, float] ¶

get_convergence_report(model_accuracy: float, iteration: int, human_agreement: float | None = None) -> ConvergenceReport ¶

Inter-Annotator Agreement¶

interannotator ¶

InterAnnotatorMetrics ¶

percentage_agreement(rater1: list[Label], rater2: list[Label]) -> float staticmethod ¶

cohens_kappa(rater1: list[Label], rater2: list[Label]) -> float staticmethod ¶

fleiss_kappa(ratings_matrix: np.ndarray[int, np.dtype[np.int_]]) -> float staticmethod ¶

krippendorff_alpha(reliability_data: dict[str, list[Label | None]], metric: str = 'nominal') -> float staticmethod ¶

pairwise_agreement(ratings: dict[str, list[Label]]) -> dict[str, dict[str, float]] staticmethod ¶

`convergence` ¶

`ConvergenceReport` ¶

`ConvergenceDetector` ¶

`init(human_agreement_metric: str = 'krippendorff_alpha', convergence_threshold: float = 0.05, min_iterations: int = 3, statistical_test: bool = True, alpha: float = 0.05) -> None` ¶

`compute_human_baseline(human_ratings: dict[str, list[Label | None]], **kwargs: str | int | float | bool | None) -> float` ¶

`check_convergence(model_accuracy: float, iteration: int, human_agreement: float | None = None) -> bool` ¶

`compute_statistical_test(model_predictions: list[Label], human_consensus: list[Label], test_type: str = 'mcnemar') -> dict[str, float]` ¶

`get_convergence_report(model_accuracy: float, iteration: int, human_agreement: float | None = None) -> ConvergenceReport` ¶

`interannotator` ¶

`InterAnnotatorMetrics` ¶

`percentage_agreement(rater1: list[Label], rater2: list[Label]) -> float` `staticmethod` ¶

`cohens_kappa(rater1: list[Label], rater2: list[Label]) -> float` `staticmethod` ¶

`fleiss_kappa(ratings_matrix: np.ndarray[int, np.dtype[np.int_]]) -> float` `staticmethod` ¶

`krippendorff_alpha(reliability_data: dict[str, list[Label | None]], metric: str = 'nominal') -> float` `staticmethod` ¶

`pairwise_agreement(ratings: dict[str, list[Label]]) -> dict[str, dict[str, float]]` `staticmethod` ¶