bead.tokenization¶
Configurable multilingual tokenization for span annotation and UI display.
Configuration¶
config
¶
Tokenizer configuration model.
TokenizerConfig
¶
Bases: Model
Configuration for display-level tokenization.
Attributes:
| Name | Type | Description |
|---|---|---|
backend |
TokenizerBackend
|
Tokenization backend to use. |
language |
str
|
ISO 639 language code (e.g. |
model_name |
str | None
|
Explicit model name; auto-resolved when |
Tokenizers¶
tokenizers
¶
Concrete tokenizer implementations.
Provides display-level tokenizers for span annotation. Each tokenizer
converts raw text into a sequence of DisplayToken objects that carry
rendering metadata (space_after) for artifact-free reconstruction.
DisplayToken
¶
Bases: Model
A word-level token with rendering metadata.
Attributes:
| Name | Type | Description |
|---|---|---|
text |
str
|
The token text. |
space_after |
bool
|
Whether whitespace follows this token in the original text. |
start_char |
int
|
Character offset of the token start in the original text. |
end_char |
int
|
Character offset of the token end in the original text. |
TokenizedText
¶
Bases: Model
Result of display-level tokenization.
Attributes:
| Name | Type | Description |
|---|---|---|
tokens |
tuple[DisplayToken, ...]
|
The sequence of display tokens. |
original_text |
str
|
The original input text. |
token_texts: tuple[str, ...]
property
¶
Plain token strings (for Item.tokenized_elements).
space_after_flags: tuple[bool, ...]
property
¶
Per-token space_after flags (for Item.token_space_after).
render() -> str
¶
Reconstruct display text from tokens with correct spacing.
Guarantees identical rendering to original when round-tripped.
Returns:
| Type | Description |
|---|---|
str
|
Reconstructed text. |
WhitespaceTokenizer
¶
Simple whitespace-split tokenizer.
Fallback for pre-tokenized text or languages not supported by spaCy
or Stanza. Splits on whitespace boundaries and infers space_after
from the original character offsets.
__call__(text: str) -> TokenizedText
¶
Tokenize text by splitting on whitespace.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Input text. |
required |
Returns:
| Type | Description |
|---|---|
TokenizedText
|
Tokenized result. |
SpacyTokenizer
¶
spaCy-based tokenizer.
Supports 49+ languages. Auto-resolves model from language code if
model_name is not specified. Handles punctuation attachment and
multi-word token (MWT) expansion correctly.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
language
|
str
|
ISO 639 language code. |
'en'
|
model_name
|
str | None
|
Explicit spaCy model name. When None, uses |
None
|
__call__(text: str) -> TokenizedText
¶
Tokenize text using spaCy.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Input text. |
required |
Returns:
| Type | Description |
|---|---|
TokenizedText
|
Tokenized result with correct |
StanzaTokenizer
¶
Stanza-based tokenizer.
Supports 80+ languages. Handles multi-word token (MWT) expansion for languages like German, French, and Arabic. Better coverage for low-resource and morphologically rich languages.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
language
|
str
|
ISO 639 language code. |
'en'
|
model_name
|
str | None
|
Explicit Stanza model/package name. When None, uses the default package for the language. |
None
|
__call__(text: str) -> TokenizedText
¶
Tokenize text using Stanza.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Input text. |
required |
Returns:
| Type | Description |
|---|---|
TokenizedText
|
Tokenized result with correct |
spacy_space_after(token: _SpacyTokenProtocol) -> bool
¶
Whether whitespace follows a spaCy token in the source text.
Shared by SpacyTokenizer and SpacyParser (single canonical site).
create_tokenizer(config: TokenizerConfig) -> Callable[[str], TokenizedText]
¶
Return a tokenization function for the given config.
Lazy-loads the NLP backend (spaCy/Stanza) on first call.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
TokenizerConfig
|
Tokenizer configuration. |
required |
Returns:
| Type | Description |
|---|---|
Callable[[str], TokenizedText]
|
A callable that tokenizes text. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the backend is not recognized. |
Dependency Parsing¶
Dependency parsers (spaCy, Stanza) produce a per-sentence ParsedSentence of
ParsedToken records, and parse_to_spans projects a parse onto the standoff
Span + SpanRelation models used by bead.items.Item: one single-token
Span per token (carrying its governor as head_index and its
upos/xpos/lemma/deprel/morphology plus character offsets in
span_metadata), and one directed head-to-dependent SpanRelation per
syntactic arc labeled with the dependency relation.
parsers
¶
Dependency parsing into standoff spans.
Provides dependency parsers (spaCy, Stanza) that produce a per-sentence
ParsedSentence of ParsedToken records (token, lemma, upos, xpos,
morphological features, head, deprel), and parse_to_spans which projects a
parse onto bead's standoff Span + SpanRelation models.
The projection is deliberately aligned with the layers linguistic
annotation model so a parse stored on an Item carries every field a layers
dependency AnnotationLayer/Annotation needs: each token becomes a
single-token Span whose head_index is its governor and whose
span_metadata carries upos/xpos/lemma/deprel/formalism/
tool plus morphological features and character offsets; each syntactic arc
becomes a directed SpanRelation from head to dependent labeled with the
dependency relation. The conventions below (Universal Dependencies labels,
head -> dependent arc direction, retained character offsets) keep that
mapping lossless without coupling bead to layers' wire format.
DependencyParser
¶
Bases: Protocol
A callable that dependency-parses text into sentences.
Carries a tool identifier recorded in the layers-aligned provenance of
any spans projected from its output.
__call__(text: str) -> tuple[ParsedSentence, ...]
¶
Dependency-parse text into sentences.
ParsedToken
¶
Bases: Model
A dependency-parsed token.
A superset of DisplayToken: it adds the syntactic and morphological
fields produced by a dependency parser. Indices are sentence-local and
0-based; head is the 0-based index of the governor token, or None
for the sentence root.
Attributes:
| Name | Type | Description |
|---|---|---|
index |
int
|
Sentence-local 0-based token index. |
text |
str
|
Surface form of the token. |
lemma |
str | None
|
Lemma of the token. |
upos |
str | None
|
Universal part-of-speech tag. |
xpos |
str | None
|
Language-specific (treebank) part-of-speech tag. |
deprel |
str | None
|
Dependency relation of the token to its head. |
head |
int | None
|
Sentence-local 0-based index of the governor token; |
morph |
dict[str, str]
|
Morphological features (e.g. |
space_after |
bool
|
Whether whitespace follows this token in the source text. |
start_char |
int
|
Character offset of the token start in the sentence text. |
end_char |
int
|
Character offset of the token end in the sentence text. |
ParsedSentence
¶
Bases: Model
A single dependency-parsed sentence.
Attributes:
| Name | Type | Description |
|---|---|---|
original_text |
str
|
The sentence text. |
tokens |
tuple[ParsedToken, ...]
|
The parsed tokens, in order. |
SpacyParser
¶
spaCy-based dependency parser.
Loads a spaCy pipeline with tagger, parser, lemmatizer, and morphologizer
components and yields one ParsedSentence per sentence.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
language
|
str
|
ISO 639 language code. |
'en'
|
model_name
|
str | None
|
Explicit spaCy model name. When |
None
|
__call__(text: str) -> tuple[ParsedSentence, ...]
¶
Parse text into dependency-parsed sentences.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Input text (may contain multiple sentences). |
required |
Returns:
| Type | Description |
|---|---|
tuple[ParsedSentence, ...]
|
One |
StanzaParser
¶
Stanza-based dependency parser.
Loads a Stanza pipeline with tokenize,pos,lemma,depparse processors and
yields one ParsedSentence per sentence.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
language
|
str
|
ISO 639 language code. |
'en'
|
model_name
|
str | None
|
Explicit Stanza package name. When |
None
|
__call__(text: str) -> tuple[ParsedSentence, ...]
¶
Parse text into dependency-parsed sentences.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Input text (may contain multiple sentences). |
required |
Returns:
| Type | Description |
|---|---|
tuple[ParsedSentence, ...]
|
One |
create_parser(config: TokenizerConfig) -> DependencyParser
¶
Return a dependency-parsing function for the given config.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
TokenizerConfig
|
Tokenizer configuration. The |
required |
Returns:
| Type | Description |
|---|---|
DependencyParser
|
A callable that dependency-parses text into sentences. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the backend cannot produce a dependency parse. |
parse_to_spans(sentence: ParsedSentence, *, element_name: str = 'text', tokenization_id: str, formalism: str = UNIVERSAL_DEPENDENCIES, tool: str) -> tuple[tuple[Span, ...], tuple[SpanRelation, ...]]
¶
Project a parsed sentence onto standoff spans and relations.
Each token becomes a single-token Span (span_type == "token") whose
head_index is the governor index and whose span_metadata carries the
layers-aligned fields. Each non-root token contributes one directed
SpanRelation from its head (source) to itself (target), labeled
with the dependency relation. This function is the single canonical owner of
the span_id scheme and the head -> dependent arc direction.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sentence
|
ParsedSentence
|
The parsed sentence to project. |
required |
element_name
|
str
|
Rendered-element name the token indices refer to. |
'text'
|
tokenization_id
|
str
|
Stable identifier of the tokenization these tokens belong to (mirrors
layers' |
required |
formalism
|
str
|
Dependency formalism slug (default |
UNIVERSAL_DEPENDENCIES
|
tool
|
str
|
Identifier of the parser that produced the analysis. |
required |
Returns:
| Type | Description |
|---|---|
tuple[tuple[Span, ...], tuple[SpanRelation, ...]]
|
The token spans and the dependency-arc relations. |
Display-to-Subword Alignment¶
alignment
¶
Alignment between display tokens and subword model tokens.
Maps display-token-level span indices to subword-token indices so that active learning models can consume span annotations created in display-token space.
align_display_to_subword(display_tokens: list[str], subword_tokenizer: _PreTrainedTokenizerProtocol) -> list[list[int]]
¶
Map each display token index to its corresponding subword token indices.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
display_tokens
|
list[str]
|
Display-level token strings (word-level). |
required |
subword_tokenizer
|
_PreTrainedTokenizerProtocol
|
A HuggingFace-compatible tokenizer with |
required |
Returns:
| Type | Description |
|---|---|
list[list[int]]
|
A list where |
convert_span_indices(span_indices: list[int], alignment: list[list[int]]) -> list[int]
¶
Convert display-token span indices to subword-token indices.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
span_indices
|
list[int]
|
Display-token indices forming the span. |
required |
alignment
|
list[list[int]]
|
Alignment from |
required |
Returns:
| Type | Description |
|---|---|
list[int]
|
Corresponding subword-token indices. |
Raises:
| Type | Description |
|---|---|
IndexError
|
If any span index is out of range of the alignment. |