bead.tokenization¶

Configurable multilingual tokenization for span annotation and UI display.

Configuration¶

`config` ¶

Tokenizer configuration model.

`TokenizerConfig` ¶

Bases: Model

Configuration for display-level tokenization.

Attributes:

Name	Type	Description
`backend`	`TokenizerBackend`	Tokenization backend to use. `spacy` (default) supports 49+ languages. `stanza` covers 80+ languages including morphologically rich ones. `whitespace` is a simple fallback for pre-tokenized text.
`language`	`str`	ISO 639 language code (e.g. `"en"`, `"zh"`).
`model_name`	`str \| None`	Explicit model name; auto-resolved when `None`.

Tokenizers¶

`tokenizers` ¶

Concrete tokenizer implementations.

Provides display-level tokenizers for span annotation. Each tokenizer converts raw text into a sequence of DisplayToken objects that carry rendering metadata (space_after) for artifact-free reconstruction.

`DisplayToken` ¶

Bases: Model

A word-level token with rendering metadata.

Attributes:

Name	Type	Description
`text`	`str`	The token text.
`space_after`	`bool`	Whether whitespace follows this token in the original text.
`start_char`	`int`	Character offset of the token start in the original text.
`end_char`	`int`	Character offset of the token end in the original text.

`TokenizedText` ¶

Bases: Model

Result of display-level tokenization.

Attributes:

Name	Type	Description
`tokens`	`tuple[DisplayToken, ...]`	The sequence of display tokens.
`original_text`	`str`	The original input text.

`token_texts: tuple[str, ...]` `property` ¶

Plain token strings (for Item.tokenized_elements).

`space_after_flags: tuple[bool, ...]` `property` ¶

Per-token space_after flags (for Item.token_space_after).

`render() -> str` ¶

Reconstruct display text from tokens with correct spacing.

Guarantees identical rendering to original when round-tripped.

Returns:

Type	Description
`str`	Reconstructed text.

`WhitespaceTokenizer` ¶

Simple whitespace-split tokenizer.

Fallback for pre-tokenized text or languages not supported by spaCy or Stanza. Splits on whitespace boundaries and infers space_after from the original character offsets.

`call(text: str) -> TokenizedText` ¶

Tokenize text by splitting on whitespace.

Parameters:

Name	Type	Description	Default
`text`	`str`	Input text.	required

Returns:

Type	Description
`TokenizedText`	Tokenized result.

`SpacyTokenizer` ¶

spaCy-based tokenizer.

Supports 49+ languages. Auto-resolves model from language code if model_name is not specified. Handles punctuation attachment and multi-word token (MWT) expansion correctly.

Parameters:

Name	Type	Description	Default
`language`	`str`	ISO 639 language code.	`'en'`
`model_name`	`str \| None`	Explicit spaCy model name. When None, uses `{language}_core_web_sm` for common languages, falling back to a blank model.	`None`

`call(text: str) -> TokenizedText` ¶

Tokenize text using spaCy.

Parameters:

Name	Type	Description	Default
`text`	`str`	Input text.	required

Returns:

Type	Description
`TokenizedText`	Tokenized result with correct `space_after` metadata.

`StanzaTokenizer` ¶

Stanza-based tokenizer.

Supports 80+ languages. Handles multi-word token (MWT) expansion for languages like German, French, and Arabic. Better coverage for low-resource and morphologically rich languages.

Parameters:

Name	Type	Description	Default
`language`	`str`	ISO 639 language code.	`'en'`
`model_name`	`str \| None`	Explicit Stanza model/package name. When None, uses the default package for the language.	`None`

`call(text: str) -> TokenizedText` ¶

Tokenize text using Stanza.

Parameters:

Name	Type	Description	Default
`text`	`str`	Input text.	required

Returns:

Type	Description
`TokenizedText`	Tokenized result with correct `space_after` metadata.

`spacy_space_after(token: _SpacyTokenProtocol) -> bool` ¶

Whether whitespace follows a spaCy token in the source text.

Shared by SpacyTokenizer and SpacyParser (single canonical site).

`create_tokenizer(config: TokenizerConfig) -> Callable[[str], TokenizedText]` ¶

Return a tokenization function for the given config.

Lazy-loads the NLP backend (spaCy/Stanza) on first call.

Parameters:

Name	Type	Description	Default
`config`	`TokenizerConfig`	Tokenizer configuration.	required

Returns:

Type	Description
`Callable[[str], TokenizedText]`	A callable that tokenizes text.

Raises:

Type	Description
`ValueError`	If the backend is not recognized.

Dependency Parsing¶

Dependency parsers (spaCy, Stanza) produce a per-sentence ParsedSentence of ParsedToken records, and parse_to_spans projects a parse onto the standoff Span + SpanRelation models used by bead.items.Item: one single-token Span per token (carrying its governor as head_index and its upos/xpos/lemma/deprel/morphology plus character offsets in span_metadata), and one directed head-to-dependent SpanRelation per syntactic arc labeled with the dependency relation.

`parsers` ¶

Dependency parsing into standoff spans.

Provides dependency parsers (spaCy, Stanza) that produce a per-sentence ParsedSentence of ParsedToken records (token, lemma, upos, xpos, morphological features, head, deprel), and parse_to_spans which projects a parse onto bead's standoff Span + SpanRelation models.

The projection is deliberately aligned with the layers linguistic annotation model so a parse stored on an Item carries every field a layers dependency AnnotationLayer/Annotation needs: each token becomes a single-token Span whose head_index is its governor and whose span_metadata carries upos/xpos/lemma/deprel/formalism/ tool plus morphological features and character offsets; each syntactic arc becomes a directed SpanRelation from head to dependent labeled with the dependency relation. The conventions below (Universal Dependencies labels, head -> dependent arc direction, retained character offsets) keep that mapping lossless without coupling bead to layers' wire format.

`DependencyParser` ¶

Bases: Protocol

A callable that dependency-parses text into sentences.

Carries a tool identifier recorded in the layers-aligned provenance of any spans projected from its output.

`call(text: str) -> tuple[ParsedSentence, ...]` ¶

Dependency-parse text into sentences.

`ParsedToken` ¶

Bases: Model

A dependency-parsed token.

A superset of DisplayToken: it adds the syntactic and morphological fields produced by a dependency parser. Indices are sentence-local and 0-based; head is the 0-based index of the governor token, or None for the sentence root.

Attributes:

Name	Type	Description
`index`	`int`	Sentence-local 0-based token index.
`text`	`str`	Surface form of the token.
`lemma`	`str \| None`	Lemma of the token.
`upos`	`str \| None`	Universal part-of-speech tag.
`xpos`	`str \| None`	Language-specific (treebank) part-of-speech tag.
`deprel`	`str \| None`	Dependency relation of the token to its head.
`head`	`int \| None`	Sentence-local 0-based index of the governor token; `None` for the root.
`morph`	`dict[str, str]`	Morphological features (e.g. `{"Number": "Sing"}`).
`space_after`	`bool`	Whether whitespace follows this token in the source text.
`start_char`	`int`	Character offset of the token start in the sentence text.
`end_char`	`int`	Character offset of the token end in the sentence text.

`ParsedSentence` ¶

Bases: Model

A single dependency-parsed sentence.

Attributes:

Name	Type	Description
`original_text`	`str`	The sentence text.
`tokens`	`tuple[ParsedToken, ...]`	The parsed tokens, in order.

`SpacyParser` ¶

spaCy-based dependency parser.

Loads a spaCy pipeline with tagger, parser, lemmatizer, and morphologizer components and yields one ParsedSentence per sentence.

Parameters:

Name	Type	Description	Default
`language`	`str`	ISO 639 language code.	`'en'`
`model_name`	`str \| None`	Explicit spaCy model name. When `None`, uses `{language}_core_web_sm`.	`None`

`call(text: str) -> tuple[ParsedSentence, ...]` ¶

Parse text into dependency-parsed sentences.

Parameters:

Name	Type	Description	Default
`text`	`str`	Input text (may contain multiple sentences).	required

Returns:

Type	Description
`tuple[ParsedSentence, ...]`	One `ParsedSentence` per detected sentence.

`StanzaParser` ¶

Stanza-based dependency parser.

Loads a Stanza pipeline with tokenize,pos,lemma,depparse processors and yields one ParsedSentence per sentence.

Parameters:

Name	Type	Description	Default
`language`	`str`	ISO 639 language code.	`'en'`
`model_name`	`str \| None`	Explicit Stanza package name. When `None`, uses the default package.	`None`

`call(text: str) -> tuple[ParsedSentence, ...]` ¶

Parse text into dependency-parsed sentences.

Parameters:

Name	Type	Description	Default
`text`	`str`	Input text (may contain multiple sentences).	required

Returns:

Type	Description
`tuple[ParsedSentence, ...]`	One `ParsedSentence` per detected sentence.

`create_parser(config: TokenizerConfig) -> DependencyParser` ¶

Return a dependency-parsing function for the given config.

Parameters:

Name	Type	Description	Default
`config`	`TokenizerConfig`	Tokenizer configuration. The `backend` selects the parser; the `whitespace` backend cannot parse and raises.	required

Returns:

Type	Description
`DependencyParser`	A callable that dependency-parses text into sentences.

Raises:

Type	Description
`ValueError`	If the backend cannot produce a dependency parse.

`parse_to_spans(sentence: ParsedSentence, *, element_name: str = 'text', tokenization_id: str, formalism: str = UNIVERSAL_DEPENDENCIES, tool: str) -> tuple[tuple[Span, ...], tuple[SpanRelation, ...]]` ¶

Project a parsed sentence onto standoff spans and relations.

Each token becomes a single-token Span (span_type == "token") whose head_index is the governor index and whose span_metadata carries the layers-aligned fields. Each non-root token contributes one directed SpanRelation from its head (source) to itself (target), labeled with the dependency relation. This function is the single canonical owner of the span_id scheme and the head -> dependent arc direction.

Parameters:

Name	Type	Description	Default
`sentence`	`ParsedSentence`	The parsed sentence to project.	required
`element_name`	`str`	Rendered-element name the token indices refer to.	`'text'`
`tokenization_id`	`str`	Stable identifier of the tokenization these tokens belong to (mirrors layers' `TokenRef.tokenization_id`). Recorded in each span's metadata.	required
`formalism`	`str`	Dependency formalism slug (default `"universal-dependencies"`).	`UNIVERSAL_DEPENDENCIES`
`tool`	`str`	Identifier of the parser that produced the analysis.	required

Returns:

Type	Description
`tuple[tuple[Span, ...], tuple[SpanRelation, ...]]`	The token spans and the dependency-arc relations.

Display-to-Subword Alignment¶

`alignment` ¶

Alignment between display tokens and subword model tokens.

Maps display-token-level span indices to subword-token indices so that active learning models can consume span annotations created in display-token space.

`align_display_to_subword(display_tokens: list[str], subword_tokenizer: _PreTrainedTokenizerProtocol) -> list[list[int]]` ¶

Map each display token index to its corresponding subword token indices.

Parameters:

Name	Type	Description	Default
`display_tokens`	`list[str]`	Display-level token strings (word-level).	required
`subword_tokenizer`	`_PreTrainedTokenizerProtocol`	A HuggingFace-compatible tokenizer with `__call__` and `convert_ids_to_tokens` methods.	required

Returns:

Type	Description
`list[list[int]]`	A list where `entry[i]` is the list of subword token indices for display token `i`. Special tokens (CLS, SEP, etc.) are excluded.

`convert_span_indices(span_indices: list[int], alignment: list[list[int]]) -> list[int]` ¶

Convert display-token span indices to subword-token indices.

Parameters:

Name	Type	Description	Default
`span_indices`	`list[int]`	Display-token indices forming the span.	required
`alignment`	`list[list[int]]`	Alignment from `align_display_to_subword`.	required

Returns:

Type	Description
`list[int]`	Corresponding subword-token indices.

Raises:

Type	Description
`IndexError`	If any span index is out of range of the alignment.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search

bead.tokenization¶

Configuration¶

config ¶

TokenizerConfig ¶

Tokenizers¶

tokenizers ¶

DisplayToken ¶

TokenizedText ¶

token_texts: tuple[str, ...] property ¶

space_after_flags: tuple[bool, ...] property ¶

render() -> str ¶

WhitespaceTokenizer ¶

__call__(text: str) -> TokenizedText ¶

SpacyTokenizer ¶

__call__(text: str) -> TokenizedText ¶

StanzaTokenizer ¶

__call__(text: str) -> TokenizedText ¶

spacy_space_after(token: _SpacyTokenProtocol) -> bool ¶

create_tokenizer(config: TokenizerConfig) -> Callable[[str], TokenizedText] ¶

Dependency Parsing¶

parsers ¶

DependencyParser ¶

__call__(text: str) -> tuple[ParsedSentence, ...] ¶

ParsedToken ¶

ParsedSentence ¶

SpacyParser ¶

__call__(text: str) -> tuple[ParsedSentence, ...] ¶

StanzaParser ¶

__call__(text: str) -> tuple[ParsedSentence, ...] ¶

create_parser(config: TokenizerConfig) -> DependencyParser ¶

parse_to_spans(sentence: ParsedSentence, *, element_name: str = 'text', tokenization_id: str, formalism: str = UNIVERSAL_DEPENDENCIES, tool: str) -> tuple[tuple[Span, ...], tuple[SpanRelation, ...]] ¶

Display-to-Subword Alignment¶

alignment ¶

align_display_to_subword(display_tokens: list[str], subword_tokenizer: _PreTrainedTokenizerProtocol) -> list[list[int]] ¶

convert_span_indices(span_indices: list[int], alignment: list[list[int]]) -> list[int] ¶

`config` ¶

`TokenizerConfig` ¶

`tokenizers` ¶

`DisplayToken` ¶

`TokenizedText` ¶

`token_texts: tuple[str, ...]` `property` ¶

`space_after_flags: tuple[bool, ...]` `property` ¶

`render() -> str` ¶

`WhitespaceTokenizer` ¶

`call(text: str) -> TokenizedText` ¶

`SpacyTokenizer` ¶

`call(text: str) -> TokenizedText` ¶

`StanzaTokenizer` ¶

`call(text: str) -> TokenizedText` ¶

`spacy_space_after(token: _SpacyTokenProtocol) -> bool` ¶

`create_tokenizer(config: TokenizerConfig) -> Callable[[str], TokenizedText]` ¶

`parsers` ¶

`DependencyParser` ¶

`call(text: str) -> tuple[ParsedSentence, ...]` ¶

`ParsedToken` ¶

`ParsedSentence` ¶

`SpacyParser` ¶

`call(text: str) -> tuple[ParsedSentence, ...]` ¶

`StanzaParser` ¶

`call(text: str) -> tuple[ParsedSentence, ...]` ¶

`create_parser(config: TokenizerConfig) -> DependencyParser` ¶

`parse_to_spans(sentence: ParsedSentence, *, element_name: str = 'text', tokenization_id: str, formalism: str = UNIVERSAL_DEPENDENCIES, tool: str) -> tuple[tuple[Span, ...], tuple[SpanRelation, ...]]` ¶

`alignment` ¶

`align_display_to_subword(display_tokens: list[str], subword_tokenizer: _PreTrainedTokenizerProtocol) -> list[list[int]]` ¶

`convert_span_indices(span_indices: list[int], alignment: list[list[int]]) -> list[int]` ¶