Tokenizers
介绍
convert text inputs to numerical data.
可以分为三类:
word based
tokenized_text = "Jim Henson was a puppeteer".split()
print(tokenized_text)
['Jim', 'Henson', 'was', 'a', 'puppeteer']
每个单词都对应一个id,从0开始到词典长度的最大值。同时我们还需要a custom token to represent words that are not in our vocabulary. This is known as the “unknown” token, often represented as ”[UNK]” or ””.
character based
和word-based相比有几个优点:
- The vocabulary is much smaller.
- There are much fewer out-of-vocabulary (unknown) tokens, since every word can be built from characters.
也有两个局限性:
- less meaningful for Latin language like English.
- we’ll end up with a very large amount of tokens to be processed by our model.
subword based
优点:
- a lot of semantic meaning
- This approach is especially useful in agglutinative(黏着语) languages such as Turkish, where you can form (almost) arbitrarily long complex words by stringing together subwords.
其他
Unsurprisingly, there are many more techniques out there. To name a few:
- Byte-level BPE, as used in GPT-2
- WordPiece, as used in BERT
- SentencePiece or Unigram, as used in several multilingual models
You should now have sufficient knowledge of how tokenizers work to get started with the API.
Loading and saving
Loading:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…
HBox(children=(FloatProgress(value=0.0, description='Downloading', max=29.0, style=ProgressStyle(description_w…
HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435797.0, style=ProgressStyle(descripti…
tokenizer("Using a Transformer network is simple")
{'input_ids': [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
得到一个字典,包含key:
- input_ids
- token_type_ids
- attention_mask
Saving:
tokenizer.save_pretrained("directory_on_my_computer")
会存储:
- tokenizer (a bit like the architecture of the model)
- vocabulary (a bit like the weights of the model)
Encoding
Translating text to numbers is known as encoding.Encoding is done in a two-step process: the tokenization, followed by the conversion to input IDs.
Tokenization
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)
print(tokens)
['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple']
From tokens to input IDs
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)
[7993, 170, 13809, 23763, 2443, 1110, 3014]
Decoding
Decoding is going the other way around: from vocabulary indices, we want to get a string.
decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])
print(decoded_string)
Using a transformer network is simple
Models
模型只接受tensor的输入,所以需要tokenizers的预处理。
Creating a Transformer
上篇教程中使用的AutoModel
可以根据你所给出的预训练checkpoint自动判断属于哪一类transformer从而import。当你本身知道你的预训练模型属于哪类transformers时也可以直接自己指定,比如BERT:
# 默认情况下,模型的weights是随机初始化的,还需要训练预训练模型
from transformers import BertConfig, BertModel
# Building the config
config = BertConfig()
# Building the model from the config
model = BertModel(config)
print(config)
BertConfig {
"attention_probs_dropout_prob": 0.1,
"gradient_checkpointing": false,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"layer_norm_eps": 1e-12,
"max_position_embeddings": 512,
"model_type": "bert",
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pad_token_id": 0,
"position_embedding_type": "absolute",
"transformers_version": "4.9.0",
"type_vocab_size": 2,
"use_cache": true,
"vocab_size": 30522
}
其中:
- hidden_size: the size of hidden_states vector.
- num_hidden_layers: defines the number of layers the Transformer model has.
Different loading methods
# 我们直接使用预训练模型,不需要训练,也不用config,weights是直接挪用的预训练好的weights
from transformers import BertModel
model = BertModel.from_pretrained("bert-base-cased")
HBox(children=(FloatProgress(value=0.0, description='Downloading', max=570.0, style=ProgressStyle(description_…
HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435779157.0, style=ProgressStyle(descri…
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
在默认情况下,上面的预训练模型“bert-base-cased”缓存在:~/.cache/huggingface/transformers
。可以通过改变环境变量HF_HOME改变缓存位置。
Saving methods
和from_pretrained
加载预训练模型类似,我们存储预训练模型可以用save_pretrained
:
model.save_pretrained("directory_on_my_computer")
上面的命令存储了两个文件:
- config.json:the attributes necessary to build the model architecture.This file also contains some metadata, such as where the checkpoint originated and what 🤗 Transformers version you were using when you last saved the checkpoint.
- pytorch_model.bin: state dictionary, it contains all your model’s weights.
The two files go hand in hand; the configuration is necessary to know your model’s architecture, while the model weights are your model’s parameters.