belt_nlp package

Submodules

belt_nlp.bert module

class belt_nlp.bert.BertClassifier(batch_size: int, learning_rate: float, epochs: int, accumulation_steps: int = 1, tokenizer: PreTrainedTokenizerBase | None = None, neural_network: Module | None = None, pretrained_model_name_or_path: str | None = 'bert-base-uncased', device: str = 'cuda:0', many_gpus: bool = False)

Bases: ABC

The “device” parameter can have the following values:

“cpu” - The model will be loaded on CPU.
“cuda” - The model will be loaded on single GPU.
“cuda:i” - The model will be loaded on the specific single GPU with the index i.

It is also possible to use multiple GPUs. In order to do this:

Set device to “cuda”.
Set many_gpu flag to True.
As default it will use all of them.

To use only selected GPUs - set the environmental variable CUDA_VISIBLE_DEVICES.

fit(x_train: list[str], y_train: list[bool], epochs: int | None = None) → None

classmethod load(model_dir: str, device: str = 'cuda:0', many_gpus: bool = False) → BertClassifier

predict(x: list[str], batch_size: int | None = None) → list[tuple[bool, float]]

predict_classes(x: list[str], batch_size: int | None = None) → list[bool]

predict_scores(x: list[str], batch_size: int | None = None) → list[float]

save(model_dir: str) → None

class belt_nlp.bert.BertClassifierNN(model: BertModel | RobertaModel)

Bases: Module

forward(input_ids: Tensor, attention_mask: Tensor) → Tensor

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class belt_nlp.bert.TokenizedDataset(tokens: BatchEncoding, labels: list | None = None)

Bases: Dataset

Dataset for tokens with optional labels.

belt_nlp.bert_truncated module

class belt_nlp.bert_truncated.BertClassifierTruncated(batch_size: int, learning_rate: float, epochs: int, accumulation_steps: int = 1, tokenizer: PreTrainedTokenizerBase | None = None, neural_network: Module | None = None, pretrained_model_name_or_path: str | None = 'bert-base-uncased', device: str = 'cuda:0', many_gpus: bool = False): Bases: BertClassifier

belt_nlp.bert_with_pooling module

class belt_nlp.bert_with_pooling.BertClassifierWithPooling(batch_size: int, learning_rate: float, epochs: int, chunk_size: int, stride: int, minimal_chunk_length: int, pooling_strategy: str = 'mean', accumulation_steps: int = 1, maximal_text_length: int | None = None, tokenizer: PreTrainedTokenizerBase | None = None, neural_network: Module | None = None, pretrained_model_name_or_path: str | None = 'bert-base-uncased', device: str = 'cuda:0', many_gpus: bool = False)

Bases: BertClassifier

The splitting procedure is the following:

Tokenize the whole text (if maximal_text_length=None) or truncate to the size maximal_text_length.
Split the tokens to chunks of the size chunk_size.
Tokens may overlap dependent on the parameter stride.
In other words: we get chunks by moving the window of the size chunk_size by the length equal to stride.
See the example in https://github.com/google-research/bert/issues/27#issuecomment-435265194.
Stride has the analogous meaning here that in convolutional neural networks.
The chunk_size is analogous to kernel_size in CNNs.
We ignore chunks which are too small - smaller than minimal_chunk_length.

After getting the tensor of predictions of all chunks we pool them into one prediction. Aggregation function is specified by the string parameter pooling_strategy. It can be either “mean” or “max”.

static collate_fn_pooled_tokens(data)

belt_nlp.exceptions module

exception belt_nlp.exceptions.InconsistentSplittingParamsException: Bases: Exception

belt_nlp.splitting module

belt_nlp.splitting.add_padding_tokens(input_id_chunks: list[torch.Tensor], mask_chunks: list[torch.Tensor]) → None: Adds padding tokens (token id = 0) at the end to make sure that all chunks have exactly 512 tokens.

belt_nlp.splitting.add_special_tokens_at_beginning_and_end(input_id_chunks: list[torch.Tensor], mask_chunks: list[torch.Tensor]) → None: Adds special CLS token (token id = 101) at the beginning. Adds SEP token (token id = 102) at the end of each chunk. Adds corresponding attention masks equal to 1 (attention mask is boolean).

belt_nlp.splitting.check_split_parameters_consistency(chunk_size: int, stride: int, minimal_chunk_length: int) → None

belt_nlp.splitting.split_overlapping(tensor: Tensor, chunk_size: int, stride: int, minimal_chunk_length: int) → list[torch.Tensor]: Helper function for dividing 1-dimensional tensors into overlapping chunks.

belt_nlp.splitting.split_tokens_into_smaller_chunks(tokens: BatchEncoding, chunk_size: int, stride: int, minimal_chunk_length: int) → tuple[list[torch.Tensor], list[torch.Tensor]]: Splits tokens into overlapping chunks with given size and stride.

belt_nlp.splitting.stack_tokens_from_all_chunks(input_id_chunks: list[torch.Tensor], mask_chunks: list[torch.Tensor]) → tuple[torch.Tensor, torch.Tensor]: Reshapes data to a form compatible with BERT model input.

belt_nlp.splitting.tokenize_text_with_truncation(text: str, tokenizer: PreTrainedTokenizerBase, maximal_text_length: int) → BatchEncoding: Tokenizes the text with truncation to maximal_text_length and without special tokens.

belt_nlp.splitting.tokenize_whole_text(text: str, tokenizer: PreTrainedTokenizerBase) → BatchEncoding: Tokenizes the entire text without truncation and without special tokens.

belt_nlp.splitting.transform_list_of_texts(texts: list[str], tokenizer: PreTrainedTokenizerBase, chunk_size: int, stride: int, minimal_chunk_length: int, maximal_text_length: int | None = None) → BatchEncoding

belt_nlp.splitting.transform_single_text(text: str, tokenizer: PreTrainedTokenizerBase, chunk_size: int, stride: int, minimal_chunk_length: int, maximal_text_length: int | None) → tuple[torch.Tensor, torch.Tensor]: Transforms (the entire) text to model input of BERT model.

belt_nlp.tensor_utils module

belt_nlp.tensor_utils.list_of_tensors_deep_equal(list_1: list[torch.Tensor], list_2: list[torch.Tensor]) → bool