Welcome to BELT (BERT For Longer Texts)’s documentation!
The project requires Python 3.9+ to run. We recommend training the models on the GPU. Hence, it is necessary to install torch
version compatible with the machine. The version of the driver depends on the machine - first, check the version of GPU drivers by the command nvidia-smi
and choose the newest version compatible with these drivers according to this table (e.g.: 11.1). Then we install torch
to get the compatible build. Here, we find which torch version is compatible with the CUDA version on our machine.
Another option is to use the CPU-only version of torch:
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
Next, we recommend installing via pip:
pip3 install belt-nlp
If you want to clone the repo in order to run tests or notebooks, you can use the requirements.txt
file.
BERT modification for longer texts
Motivation
The BERT model can only use the text of the maximal length of 512 tokens (roughly speaking: token = word). It is built in the model architecture and cannot be directly changed. Discussion of this issue can be found here.
Method
Method to overcome this issue was proposed by Devlin (one of the authors of BERT) in the previously mentioned discussion: comment.
The procedure of splitting and pooling is determined by the hyperparameters of the class BertClassifierWithPooling
. These are maximal_text_length
, chunk_size
, stride
, minimal_chunk_length
, and pooling_strategy
.
They are used in the following way:
The parameter
maximal_text_length
is used to truncate the tokens. It can be eitherNone
, which means no truncation, or an integer, determining the number of tokens to consider. Standard BERT truncates to 510 tokens because it needs 2 additional tokens at the beginning and the end.The integer parameter
chunk_size
determines the size (in number of tokens) of each chunk. This parameter cannot be larger than 510. Otherwise, we will not be able to fit the chunk into the input of BERT.Tokens may overlap depending on the parameter
stride
.In other words: we get chunks by moving the window of the size
chunk_size
by the length equal tostride
. Stride cannot be bigger than chunk size. Chunks must overlap or be near each other.Stride has the analogous meaning here to that in convolutional neural networks.
The
chunk_size
is analogous to kernel_size in 1D CNNs.We ignore chunks which are too small - smaller than
minimal_chunk_length
. This parameter cannot be set larger thanchunk_size
.See the example in the aforementioned comment.
More examples of splitting with different sets of parameters are in test_splitting.
The string parameter
pooling_strategy
is used at the end to aggregate the model results. It can be eithermean
ormax
.
1. Preparing a single text
We follow this instruction. The main difference is that we allow the text chunks to overlap.
Tokenize the whole text (if
maximal_text_length=None
) or truncate to the sizemaximal_text_length
.Split the tokens into chunks based on the model hyperparameters
chunk_size
,stride
, andminimal_chunk_length
.For each chunk add special tokens at the beginning and the end.
Add padding tokens to make all tokenized sequences the same length.
Stack the tensor chunks into one via
torch.stack
.
2. Model evaluation
The stacked tensor is then fed into the model as a mini-batch.
We get N probabilities, one for each text chunk.
We obtain the final probability by using the aggregation function on these probabilities (this function is mean or maximum - it depends on the hyperparameter
pooling_strategy
).
3. Fine-tuning the classifier
During training, we do the same steps as above. The crucial part is that all the operations of the type
cat/stack/split/mean/max
must be done on tensors with the attached gradient. That is, we use built-in torch tensor transformations. Any intermediate conversions to lists or arrays are not allowed. Otherwise, the crucial backpropagation commandloss.backward()
won’t work. More precisely, we override the standardtorch
training loop in the method_evaluate_single_batch
in the bert_with_pooling.py.Because the number of chunks for the given input text is variable, texts after tokenization are tensors with variable length. The default torch class
Dataloader
cannot allow this (because it automatically wants to stack the tensors). That is why we create custom dataloaders with overwritten methodcollate_fn
- more details can be found here.
Remarks
Because we fed all the text chunks as a mini-batch, the procedure may use a lot of GPU memory to fit all the gradients during fine-tuning even with
batch_size=1
. In this case, we recommend setting the parametermaximal_text_length
to truncate longer texts. Naturally, this is the trade-off between the context we want the model to look at and the available resources. Settingmaximal_text_length=510
is equivalent to using the standard BERT model with truncation.
Contents: