bead.evaluation¶
Metrics and evaluation utilities for convergence detection and inter-annotator agreement.
Convergence Detection¶
convergence
¶
Convergence detection for active learning.
This module provides tools for detecting when a model has converged to human-level performance, which serves as a stopping criterion for active learning loops.
ConvergenceReport
¶
Bases: TypedDict
Convergence report structure.
Attributes:
| Name | Type | Description |
|---|---|---|
converged |
bool
|
Whether model has converged. |
model_accuracy |
float
|
Model's current accuracy. |
human_agreement |
float
|
Human agreement score. |
gap |
float
|
Difference between human agreement and model accuracy. |
required_accuracy |
float
|
Minimum accuracy required for convergence. |
threshold |
float
|
Convergence threshold. |
iteration |
int
|
Current iteration number. |
meets_min_iterations |
bool
|
Whether minimum iterations requirement is met. |
min_iterations_required |
int
|
Minimum iterations required before checking convergence. |
ConvergenceDetector
¶
Detect convergence of model performance to human agreement.
This class monitors model performance and compares it to human inter-annotator agreement to determine when active learning can stop. Convergence is achieved when the model's accuracy matches or exceeds human agreement within a specified threshold.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
human_agreement_metric
|
str
|
Which inter-annotator agreement metric to use as baseline: - "krippendorff_alpha": Most general (handles missing data, multiple raters) - "fleiss_kappa": Multiple raters, no missing data - "cohens_kappa": Two raters only - "percentage_agreement": Simple agreement rate |
"krippendorff_alpha"
|
convergence_threshold
|
float
|
Model must be within this threshold of human agreement to converge. For example, 0.05 means model accuracy must be >= (human_agreement - 0.05). |
0.05
|
min_iterations
|
int
|
Minimum number of iterations before checking convergence. Prevents premature stopping. |
3
|
statistical_test
|
bool
|
Whether to run statistical significance test comparing model to humans. |
True
|
alpha
|
float
|
Significance level for statistical tests. |
0.05
|
Attributes:
| Name | Type | Description |
|---|---|---|
human_agreement_metric |
str
|
Agreement metric being used. |
convergence_threshold |
float
|
Threshold for convergence. |
min_iterations |
int
|
Minimum iterations required. |
statistical_test |
bool
|
Whether to run significance tests. |
alpha |
float
|
Significance level. |
human_baseline |
float | None
|
Computed human agreement baseline (set via compute_human_baseline). |
Examples:
>>> detector = ConvergenceDetector(
... human_agreement_metric='krippendorff_alpha',
... convergence_threshold=0.05,
... min_iterations=3
... )
>>> # Compute human baseline from ratings
>>> ratings = {
... 'human1': [1, 1, 0, 1, 0],
... 'human2': [1, 1, 0, 0, 0],
... 'human3': [1, 0, 0, 1, 0]
... }
>>> detector.compute_human_baseline(ratings)
>>> detector.human_baseline > 0.0
True
>>> # Check if model converged
>>> converged = detector.check_convergence(
... model_accuracy=0.75,
... iteration=5
... )
>>> isinstance(converged, bool)
True
__init__(human_agreement_metric: str = 'krippendorff_alpha', convergence_threshold: float = 0.05, min_iterations: int = 3, statistical_test: bool = True, alpha: float = 0.05) -> None
¶
Initialize convergence detector.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
human_agreement_metric
|
str
|
Inter-annotator agreement metric to use. |
'krippendorff_alpha'
|
convergence_threshold
|
float
|
Threshold for convergence (model must be within this of human). |
0.05
|
min_iterations
|
int
|
Minimum iterations before checking convergence. |
3
|
statistical_test
|
bool
|
Whether to run statistical tests. |
True
|
alpha
|
float
|
Significance level for tests. |
0.05
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If parameters are invalid. |
compute_human_baseline(human_ratings: dict[str, list[Label | None]], **kwargs: str | int | float | bool | None) -> float
¶
Compute human inter-rater agreement baseline.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
human_ratings
|
dict[str, list[Label | None]]
|
Dictionary mapping human rater IDs to their ratings. For example: {'rater1': [1, 0, 1, ...], 'rater2': [1, 1, 1, ...]}. Missing ratings can be represented as None. |
required |
**kwargs
|
str | int | float | bool | None
|
Additional arguments passed to agreement metric function. For example, metric='nominal' for Krippendorff's alpha. |
{}
|
Returns:
| Type | Description |
|---|---|
float
|
Human agreement score. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If human_ratings is empty or has fewer than 2 raters. |
Examples:
>>> detector = ConvergenceDetector()
>>> ratings = {
... 'human1': [1, 1, 0, 1],
... 'human2': [1, 1, 0, 0],
... 'human3': [1, 0, 0, 1]
... }
>>> baseline = detector.compute_human_baseline(ratings)
>>> 0.0 <= baseline <= 1.0
True
check_convergence(model_accuracy: float, iteration: int, human_agreement: float | None = None) -> bool
¶
Check if model has converged to human performance.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_accuracy
|
float
|
Model's accuracy on the task. |
required |
iteration
|
int
|
Current iteration number (1-indexed). |
required |
human_agreement
|
float | None
|
Human agreement score. If None, uses self.human_baseline (which must have been set via compute_human_baseline). |
None
|
Returns:
| Type | Description |
|---|---|
bool
|
True if model has converged, False otherwise. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If human_agreement is None and human_baseline not set. |
Examples:
>>> detector = ConvergenceDetector(min_iterations=2, convergence_threshold=0.05)
>>> detector.human_baseline = 0.80
>>> # Too early (iteration 1 < min_iterations 2)
>>> detector.check_convergence(0.79, iteration=1)
False
>>> # Still not converged (0.74 < 0.80 - 0.05)
>>> detector.check_convergence(0.74, iteration=3)
False
>>> # Converged (0.77 >= 0.80 - 0.05)
>>> detector.check_convergence(0.77, iteration=3)
True
compute_statistical_test(model_predictions: list[Label], human_consensus: list[Label], test_type: str = 'mcnemar') -> dict[str, float]
¶
Run statistical test comparing model to human performance.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_predictions
|
list[Label]
|
Model's predictions. |
required |
human_consensus
|
list[Label]
|
Human consensus labels (e.g., majority vote). |
required |
test_type
|
str
|
Type of statistical test: - "mcnemar": McNemar's test for paired nominal data - "ttest": Paired t-test (requires multiple samples) |
"mcnemar"
|
Returns:
| Type | Description |
|---|---|
dict[str, float]
|
Dictionary with keys 'statistic' and 'p_value'. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If predictions and consensus have different lengths. |
Examples:
>>> detector = ConvergenceDetector()
>>> model_preds = [1, 1, 0, 1, 0]
>>> human_consensus = [1, 1, 0, 0, 0]
>>> result = detector.compute_statistical_test(model_preds, human_consensus)
>>> 'statistic' in result and 'p_value' in result
True
get_convergence_report(model_accuracy: float, iteration: int, human_agreement: float | None = None) -> ConvergenceReport
¶
Generate convergence report with status and metrics.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_accuracy
|
float
|
Model's current accuracy. |
required |
iteration
|
int
|
Current iteration number. |
required |
human_agreement
|
float | None
|
Human agreement score (uses baseline if None). |
None
|
Returns:
| Type | Description |
|---|---|
ConvergenceReport
|
Report with convergence status and metrics. |
Examples:
>>> detector = ConvergenceDetector(convergence_threshold=0.05)
>>> detector.human_baseline = 0.80
>>> report = detector.get_convergence_report(0.77, iteration=5)
>>> report['converged']
True
>>> report['gap']
0.03
Inter-Annotator Agreement¶
interannotator
¶
Inter-annotator agreement metrics.
This module provides inter-annotator agreement metrics for assessing reliability and consistency across multiple human annotators. Uses sklearn.metrics for Cohen's kappa, statsmodels for Fleiss' kappa, and krippendorff package for Krippendorff's alpha.
InterAnnotatorMetrics
¶
Inter-annotator agreement metrics for reliability assessment.
Provides static methods for computing various agreement metrics: - Percentage agreement (simple) - Cohen's kappa (2 raters, categorical) - Fleiss' kappa (multiple raters, categorical) - Krippendorff's alpha (general, multiple data types) - Pairwise agreement (all pairs of raters)
Examples:
>>> # Cohen's kappa for 2 raters
>>> rater1 = [0, 1, 0, 1, 1]
>>> rater2 = [0, 1, 1, 1, 1]
>>> InterAnnotatorMetrics.cohens_kappa(rater1, rater2)
0.6
>>> # Percentage agreement
>>> InterAnnotatorMetrics.percentage_agreement(rater1, rater2)
0.8
percentage_agreement(rater1: list[Label], rater2: list[Label]) -> float
staticmethod
¶
Compute simple percentage agreement between two raters.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rater1
|
list[Label]
|
Ratings from first rater. |
required |
rater2
|
list[Label]
|
Ratings from second rater. |
required |
Returns:
| Type | Description |
|---|---|
float
|
Percentage agreement (0.0 to 1.0). |
Raises:
| Type | Description |
|---|---|
ValueError
|
If rater lists have different lengths. |
Examples:
>>> rater1 = [1, 2, 3, 1, 2]
>>> rater2 = [1, 2, 2, 1, 2]
>>> InterAnnotatorMetrics.percentage_agreement(rater1, rater2)
0.8
cohens_kappa(rater1: list[Label], rater2: list[Label]) -> float
staticmethod
¶
Compute Cohen's kappa for two raters.
Cohen's kappa measures agreement between two raters beyond chance. Values range from -1 (complete disagreement) to 1 (perfect agreement), with 0 indicating chance-level agreement.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rater1
|
list[Label]
|
Ratings from first rater. |
required |
rater2
|
list[Label]
|
Ratings from second rater. |
required |
Returns:
| Type | Description |
|---|---|
float
|
Cohen's kappa coefficient. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If rater lists have different lengths or are empty. |
Examples:
>>> # Perfect agreement
>>> rater1 = [0, 1, 0, 1]
>>> rater2 = [0, 1, 0, 1]
>>> InterAnnotatorMetrics.cohens_kappa(rater1, rater2)
1.0
>>> # No agreement beyond chance
>>> rater1 = [0, 0, 1, 1]
>>> rater2 = [1, 1, 0, 0]
>>> kappa = InterAnnotatorMetrics.cohens_kappa(rater1, rater2)
>>> abs(kappa - (-1.0)) < 0.01
True
fleiss_kappa(ratings_matrix: np.ndarray[int, np.dtype[np.int_]]) -> float
staticmethod
¶
Compute Fleiss' kappa for multiple raters.
Fleiss' kappa generalizes Cohen's kappa to multiple raters. It measures agreement beyond chance when multiple raters assign categorical ratings to a set of items.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ratings_matrix
|
ndarray
|
Matrix of shape (n_items, n_categories) where element [i, j] contains the number of raters who assigned item i to category j. |
required |
Returns:
| Type | Description |
|---|---|
float
|
Fleiss' kappa coefficient. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If matrix is empty or has wrong shape. |
ImportError
|
If statsmodels is not installed. |
Examples:
>>> # 4 items, 3 categories, 5 raters each
>>> # Item 1: 3 raters chose cat 0, 2 chose cat 1, 0 chose cat 2
>>> ratings = np.array([
... [3, 2, 0], # Item 1
... [0, 0, 5], # Item 2
... [2, 3, 0], # Item 3
... [1, 1, 3], # Item 4
... ])
>>> kappa = InterAnnotatorMetrics.fleiss_kappa(ratings)
>>> 0.0 <= kappa <= 1.0
True
krippendorff_alpha(reliability_data: dict[str, list[Label | None]], metric: str = 'nominal') -> float
staticmethod
¶
Compute Krippendorff's alpha for multiple raters.
Krippendorff's alpha is the most general inter-rater reliability measure. It handles: - Any number of raters - Missing data - Different data types (nominal, ordinal, interval, ratio)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
reliability_data
|
dict[str, list[Label | None]]
|
Dictionary mapping rater IDs to their ratings. Each rater's ratings list must have same length (use None for missing values). |
required |
metric
|
str
|
Distance metric to use: - "nominal": for categorical data (default) - "ordinal": for ordered categories - "interval": for interval-scaled data - "ratio": for ratio-scaled data |
"nominal"
|
Returns:
| Type | Description |
|---|---|
float
|
Krippendorff's alpha coefficient (1.0 = perfect agreement, 0.0 = chance agreement, < 0.0 = systematic disagreement). |
Raises:
| Type | Description |
|---|---|
ValueError
|
If reliability_data is empty or rater lists have different lengths. |
Examples:
>>> # 3 raters, 5 items (with one missing value)
>>> data = {
... 'rater1': [1, 2, 3, 4, 5],
... 'rater2': [1, 2, 3, 4, 5],
... 'rater3': [1, 2, None, 4, 5]
... }
>>> alpha = InterAnnotatorMetrics.krippendorff_alpha(data)
>>> alpha > 0.8 # High agreement
True
pairwise_agreement(ratings: dict[str, list[Label]]) -> dict[str, dict[str, float]]
staticmethod
¶
Compute pairwise agreement metrics for all rater pairs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ratings
|
dict[str, list[Label]]
|
Dictionary mapping rater IDs to their ratings. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, dict[str, float]]
|
Nested dictionary with structure: { 'percentage_agreement': {('rater1', 'rater2'): 0.85, ...}, 'cohens_kappa': {('rater1', 'rater2'): 0.75, ...} } |
Examples:
>>> ratings = {
... 'rater1': [1, 2, 3],
... 'rater2': [1, 2, 3],
... 'rater3': [1, 2, 2]
... }
>>> result = InterAnnotatorMetrics.pairwise_agreement(ratings)
>>> result['percentage_agreement'][('rater1', 'rater2')]
1.0
>>> result['cohens_kappa'][('rater1', 'rater2')]
1.0
Per-Annotator Reliability¶
reliability
¶
Per-annotator reliability summaries.
Sits next to :class:bead.evaluation.InterAnnotatorMetrics. Where the
inter-annotator metrics quantify agreement across raters, this
module quantifies response diversity of each individual rater. Low
within-annotator entropy is a flag that the annotator is collapsing
the response space (always picking "yes", always picking the
midpoint, and so on), which biases agreement metrics in misleading
directions.
The canonical input is a sequence of :class:AnnotationRecord
instances, each carrying an annotator_id, item_id,
response_label, and question_name. The Shannon entropy of
each annotator's per-question response distribution is computed in
bits.
AnnotationRecord
¶
Bases: BeadBaseModel
A single annotator response.
Canonical record shape consumed by reliability and inter-annotator
metrics. Conforms structurally to
:class:bead.protocol.diagnostics.RecordLike.
Attributes:
| Name | Type | Description |
|---|---|---|
annotator_id |
str
|
Identifier of the annotator who produced the response. |
item_id |
str
|
Identifier of the annotation item. |
question_name |
str
|
Anchor name of the question that was answered. |
response_label |
str
|
The annotator's response label (must be one of the labels of
the corresponding :class: |
AnnotatorReliability
¶
Bases: BeadBaseModel
Per-annotator reliability summary.
Captures how diverse a single annotator's responses are within each question. Low entropy means the annotator collapses the response space.
Attributes:
| Name | Type | Description |
|---|---|---|
annotator_id |
str
|
The annotator's identifier. |
n_responses |
int
|
Total responses from this annotator across all questions. |
response_distribution |
dict[str, dict[str, int]]
|
Per-question distribution of responses, keyed by anchor name and then by response label, with counts as values. |
entropy_per_question |
dict[str, float]
|
Per-question Shannon entropy in bits. |
Examples:
>>> rel = AnnotatorReliability(
... annotator_id="ann_1",
... n_responses=4,
... response_distribution={
... "completion": {"yes": 2, "no": 2},
... },
... entropy_per_question={"completion": 1.0},
... )
>>> rel.entropy("completion")
1.0
>>> rel.entropy("missing") is None
True
entropy(question_name: str) -> float | None
¶
Return the Shannon entropy for one question, or None.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
question_name
|
str
|
Anchor name to look up. |
required |
Returns:
| Type | Description |
|---|---|
float | None
|
Entropy in bits, or |
annotator_reliability(records: Sequence[AnnotationRecord], encodings: Mapping[str, ResponseEncoding] | None = None) -> tuple[AnnotatorReliability, ...]
¶
Compute per-annotator reliability summaries.
Groups records by annotator, then by question, and computes
Shannon entropy in bits on each annotator-question label
distribution. When encodings is supplied, response labels not
present in the encoding for a question are silently skipped (a
common case after schema evolution).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
records
|
Sequence[AnnotationRecord]
|
All records across questions and annotators. |
required |
encodings
|
Mapping[str, ResponseEncoding] | None
|
Per-question encodings used to filter unrecognized labels.
When |
None
|
Returns:
| Type | Description |
|---|---|
tuple[AnnotatorReliability, ...]
|
One summary per annotator, sorted by annotator id. |
Examples:
>>> records = [
... AnnotationRecord(annotator_id="a1", item_id="i1",
... question_name="q", response_label="yes"),
... AnnotationRecord(annotator_id="a1", item_id="i2",
... question_name="q", response_label="no"),
... AnnotationRecord(annotator_id="a2", item_id="i1",
... question_name="q", response_label="yes"),
... AnnotationRecord(annotator_id="a2", item_id="i2",
... question_name="q", response_label="yes"),
... ]
>>> profiles = annotator_reliability(records)
>>> [(p.annotator_id, p.entropy("q")) for p in profiles]
[('a1', 1.0), ('a2', 0.0)]
low_entropy_annotators(profiles: Sequence[AnnotatorReliability], *, threshold: float, question_name: str | None = None, require_min_responses: int = 1) -> tuple[str, ...]
¶
Return annotator ids whose entropy falls at or below a threshold.
Useful for flagging annotators who collapse the response space.
When question_name is supplied, the threshold is checked
against that one question's entropy; otherwise it is checked
against the minimum per-question entropy across every question
the annotator answered.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
profiles
|
Sequence[AnnotatorReliability]
|
Reliability summaries to scan. |
required |
threshold
|
float
|
Entropy ceiling in bits. Annotators with entropy at or below this value are returned. |
required |
question_name
|
str | None
|
Restrict the check to one question. Defaults to |
None
|
require_min_responses
|
int
|
Skip annotators whose response count is below this value.
Defaults to |
1
|
Returns:
| Type | Description |
|---|---|
tuple[str, ...]
|
Annotator ids meeting the criterion, sorted. |
Examples:
>>> profiles = (
... AnnotatorReliability(annotator_id="a1", n_responses=10,
... entropy_per_question={"q": 0.0}),
... AnnotatorReliability(annotator_id="a2", n_responses=10,
... entropy_per_question={"q": 0.95}),
... )
>>> low_entropy_annotators(profiles, threshold=0.5)
('a1',)