bead.evaluation¶
Metrics and evaluation utilities for convergence detection and inter-annotator agreement.
Convergence Detection¶
convergence
¶
Convergence detection for active learning.
This module provides tools for detecting when a model has converged to human-level performance, which serves as a stopping criterion for active learning loops.
ConvergenceReport
¶
Bases: TypedDict
Convergence report structure.
Attributes:
| Name | Type | Description |
|---|---|---|
converged |
bool
|
Whether model has converged. |
model_accuracy |
float
|
Model's current accuracy. |
human_agreement |
float
|
Human agreement score. |
gap |
float
|
Difference between human agreement and model accuracy. |
required_accuracy |
float
|
Minimum accuracy required for convergence. |
threshold |
float
|
Convergence threshold. |
iteration |
int
|
Current iteration number. |
meets_min_iterations |
bool
|
Whether minimum iterations requirement is met. |
min_iterations_required |
int
|
Minimum iterations required before checking convergence. |
ConvergenceDetector
¶
Detect convergence of model performance to human agreement.
This class monitors model performance and compares it to human inter-annotator agreement to determine when active learning can stop. Convergence is achieved when the model's accuracy matches or exceeds human agreement within a specified threshold.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
human_agreement_metric
|
str
|
Which inter-annotator agreement metric to use as baseline: - "krippendorff_alpha": Most general (handles missing data, multiple raters) - "fleiss_kappa": Multiple raters, no missing data - "cohens_kappa": Two raters only - "percentage_agreement": Simple agreement rate |
"krippendorff_alpha"
|
convergence_threshold
|
float
|
Model must be within this threshold of human agreement to converge. For example, 0.05 means model accuracy must be >= (human_agreement - 0.05). |
0.05
|
min_iterations
|
int
|
Minimum number of iterations before checking convergence. Prevents premature stopping. |
3
|
statistical_test
|
bool
|
Whether to run statistical significance test comparing model to humans. |
True
|
alpha
|
float
|
Significance level for statistical tests. |
0.05
|
Attributes:
| Name | Type | Description |
|---|---|---|
human_agreement_metric |
str
|
Agreement metric being used. |
convergence_threshold |
float
|
Threshold for convergence. |
min_iterations |
int
|
Minimum iterations required. |
statistical_test |
bool
|
Whether to run significance tests. |
alpha |
float
|
Significance level. |
human_baseline |
float | None
|
Computed human agreement baseline (set via compute_human_baseline). |
Examples:
>>> detector = ConvergenceDetector(
... human_agreement_metric='krippendorff_alpha',
... convergence_threshold=0.05,
... min_iterations=3
... )
>>> # Compute human baseline from ratings
>>> ratings = {
... 'human1': [1, 1, 0, 1, 0],
... 'human2': [1, 1, 0, 0, 0],
... 'human3': [1, 0, 0, 1, 0]
... }
>>> detector.compute_human_baseline(ratings)
>>> detector.human_baseline > 0.0
True
>>> # Check if model converged
>>> converged = detector.check_convergence(
... model_accuracy=0.75,
... iteration=5
... )
>>> isinstance(converged, bool)
True
__init__(human_agreement_metric: str = 'krippendorff_alpha', convergence_threshold: float = 0.05, min_iterations: int = 3, statistical_test: bool = True, alpha: float = 0.05) -> None
¶
Initialize convergence detector.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
human_agreement_metric
|
str
|
Inter-annotator agreement metric to use. |
'krippendorff_alpha'
|
convergence_threshold
|
float
|
Threshold for convergence (model must be within this of human). |
0.05
|
min_iterations
|
int
|
Minimum iterations before checking convergence. |
3
|
statistical_test
|
bool
|
Whether to run statistical tests. |
True
|
alpha
|
float
|
Significance level for tests. |
0.05
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If parameters are invalid. |
compute_human_baseline(human_ratings: dict[str, list[Label | None]], **kwargs: str | int | float | bool | None) -> float
¶
Compute human inter-rater agreement baseline.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
human_ratings
|
dict[str, list[Label | None]]
|
Dictionary mapping human rater IDs to their ratings. For example: {'rater1': [1, 0, 1, ...], 'rater2': [1, 1, 1, ...]}. Missing ratings can be represented as None. |
required |
**kwargs
|
str | int | float | bool | None
|
Additional arguments passed to agreement metric function. For example, metric='nominal' for Krippendorff's alpha. |
{}
|
Returns:
| Type | Description |
|---|---|
float
|
Human agreement score. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If human_ratings is empty or has fewer than 2 raters. |
Examples:
check_convergence(model_accuracy: float, iteration: int, human_agreement: float | None = None) -> bool
¶
Check if model has converged to human performance.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_accuracy
|
float
|
Model's accuracy on the task. |
required |
iteration
|
int
|
Current iteration number (1-indexed). |
required |
human_agreement
|
float | None
|
Human agreement score. If None, uses self.human_baseline (which must have been set via compute_human_baseline). |
None
|
Returns:
| Type | Description |
|---|---|
bool
|
True if model has converged, False otherwise. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If human_agreement is None and human_baseline not set. |
Examples:
>>> detector = ConvergenceDetector(min_iterations=2, convergence_threshold=0.05)
>>> detector.human_baseline = 0.80
>>> # Too early (iteration 1 < min_iterations 2)
>>> detector.check_convergence(0.79, iteration=1)
False
>>> # Still not converged (0.74 < 0.80 - 0.05)
>>> detector.check_convergence(0.74, iteration=3)
False
>>> # Converged (0.77 >= 0.80 - 0.05)
>>> detector.check_convergence(0.77, iteration=3)
True
compute_statistical_test(model_predictions: list[Label], human_consensus: list[Label], test_type: str = 'mcnemar') -> dict[str, float]
¶
Run statistical test comparing model to human performance.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_predictions
|
list[Label]
|
Model's predictions. |
required |
human_consensus
|
list[Label]
|
Human consensus labels (e.g., majority vote). |
required |
test_type
|
str
|
Type of statistical test: - "mcnemar": McNemar's test for paired nominal data - "ttest": Paired t-test (requires multiple samples) |
"mcnemar"
|
Returns:
| Type | Description |
|---|---|
dict[str, float]
|
Dictionary with keys 'statistic' and 'p_value'. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If predictions and consensus have different lengths. |
Examples:
get_convergence_report(model_accuracy: float, iteration: int, human_agreement: float | None = None) -> ConvergenceReport
¶
Generate convergence report with status and metrics.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_accuracy
|
float
|
Model's current accuracy. |
required |
iteration
|
int
|
Current iteration number. |
required |
human_agreement
|
float | None
|
Human agreement score (uses baseline if None). |
None
|
Returns:
| Type | Description |
|---|---|
ConvergenceReport
|
Report with convergence status and metrics. |
Examples:
Inter-Annotator Agreement¶
interannotator
¶
Inter-annotator agreement metrics.
This module provides inter-annotator agreement metrics for assessing reliability and consistency across multiple human annotators. Uses sklearn.metrics for Cohen's kappa, statsmodels for Fleiss' kappa, and krippendorff package for Krippendorff's alpha.
InterAnnotatorMetrics
¶
Inter-annotator agreement metrics for reliability assessment.
Provides static methods for computing various agreement metrics: - Percentage agreement (simple) - Cohen's kappa (2 raters, categorical) - Fleiss' kappa (multiple raters, categorical) - Krippendorff's alpha (general, multiple data types) - Pairwise agreement (all pairs of raters)
Examples:
>>> # Cohen's kappa for 2 raters
>>> rater1 = [0, 1, 0, 1, 1]
>>> rater2 = [0, 1, 1, 1, 1]
>>> InterAnnotatorMetrics.cohens_kappa(rater1, rater2)
0.6
>>> # Percentage agreement
>>> InterAnnotatorMetrics.percentage_agreement(rater1, rater2)
0.8
percentage_agreement(rater1: list[Label], rater2: list[Label]) -> float
staticmethod
¶
Compute simple percentage agreement between two raters.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rater1
|
list[Label]
|
Ratings from first rater. |
required |
rater2
|
list[Label]
|
Ratings from second rater. |
required |
Returns:
| Type | Description |
|---|---|
float
|
Percentage agreement (0.0 to 1.0). |
Raises:
| Type | Description |
|---|---|
ValueError
|
If rater lists have different lengths. |
Examples:
cohens_kappa(rater1: list[Label], rater2: list[Label]) -> float
staticmethod
¶
Compute Cohen's kappa for two raters.
Cohen's kappa measures agreement between two raters beyond chance. Values range from -1 (complete disagreement) to 1 (perfect agreement), with 0 indicating chance-level agreement.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rater1
|
list[Label]
|
Ratings from first rater. |
required |
rater2
|
list[Label]
|
Ratings from second rater. |
required |
Returns:
| Type | Description |
|---|---|
float
|
Cohen's kappa coefficient. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If rater lists have different lengths or are empty. |
Examples:
>>> # Perfect agreement
>>> rater1 = [0, 1, 0, 1]
>>> rater2 = [0, 1, 0, 1]
>>> InterAnnotatorMetrics.cohens_kappa(rater1, rater2)
1.0
>>> # No agreement beyond chance
>>> rater1 = [0, 0, 1, 1]
>>> rater2 = [1, 1, 0, 0]
>>> kappa = InterAnnotatorMetrics.cohens_kappa(rater1, rater2)
>>> abs(kappa - (-1.0)) < 0.01
True
fleiss_kappa(ratings_matrix: np.ndarray[int, np.dtype[np.int_]]) -> float
staticmethod
¶
Compute Fleiss' kappa for multiple raters.
Fleiss' kappa generalizes Cohen's kappa to multiple raters. It measures agreement beyond chance when multiple raters assign categorical ratings to a set of items.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ratings_matrix
|
ndarray
|
Matrix of shape (n_items, n_categories) where element [i, j] contains the number of raters who assigned item i to category j. |
required |
Returns:
| Type | Description |
|---|---|
float
|
Fleiss' kappa coefficient. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If matrix is empty or has wrong shape. |
ImportError
|
If statsmodels is not installed. |
Examples:
>>> # 4 items, 3 categories, 5 raters each
>>> # Item 1: 3 raters chose cat 0, 2 chose cat 1, 0 chose cat 2
>>> ratings = np.array([
... [3, 2, 0], # Item 1
... [0, 0, 5], # Item 2
... [2, 3, 0], # Item 3
... [1, 1, 3], # Item 4
... ])
>>> kappa = InterAnnotatorMetrics.fleiss_kappa(ratings)
>>> 0.0 <= kappa <= 1.0
True
krippendorff_alpha(reliability_data: dict[str, list[Label | None]], metric: str = 'nominal') -> float
staticmethod
¶
Compute Krippendorff's alpha for multiple raters.
Krippendorff's alpha is the most general inter-rater reliability measure. It handles: - Any number of raters - Missing data - Different data types (nominal, ordinal, interval, ratio)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
reliability_data
|
dict[str, list[Label | None]]
|
Dictionary mapping rater IDs to their ratings. Each rater's ratings list must have same length (use None for missing values). |
required |
metric
|
str
|
Distance metric to use: - "nominal": for categorical data (default) - "ordinal": for ordered categories - "interval": for interval-scaled data - "ratio": for ratio-scaled data |
"nominal"
|
Returns:
| Type | Description |
|---|---|
float
|
Krippendorff's alpha coefficient (1.0 = perfect agreement, 0.0 = chance agreement, < 0.0 = systematic disagreement). |
Raises:
| Type | Description |
|---|---|
ValueError
|
If reliability_data is empty or rater lists have different lengths. |
Examples:
pairwise_agreement(ratings: dict[str, list[Label]]) -> dict[str, dict[str, float]]
staticmethod
¶
Compute pairwise agreement metrics for all rater pairs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ratings
|
dict[str, list[Label]]
|
Dictionary mapping rater IDs to their ratings. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, dict[str, float]]
|
Nested dictionary with structure: { 'percentage_agreement': {('rater1', 'rater2'): 0.85, ...}, 'cohens_kappa': {('rater1', 'rater2'): 0.75, ...} } |
Examples: