Hugging Face Transformers教程笔记(3)：Models and Tokenizers

介绍

convert text inputs to numerical data.

可以分为三类：

word based

tokenized_text = "Jim Henson was a puppeteer".split()
print(tokenized_text)

['Jim', 'Henson', 'was', 'a', 'puppeteer']

每个单词都对应一个id，从0开始到词典长度的最大值。同时我们还需要a custom token to represent words that are not in our vocabulary. This is known as the “unknown” token, often represented as ”[UNK]” or ””.

character based

和word-based相比有几个优点：

The vocabulary is much smaller.
There are much fewer out-of-vocabulary (unknown) tokens, since every word can be built from characters.

也有两个局限性：

less meaningful for Latin language like English.
we’ll end up with a very large amount of tokens to be processed by our model.

subword based

优点：

a lot of semantic meaning
This approach is especially useful in agglutinative（黏着语） languages such as Turkish, where you can form (almost) arbitrarily long complex words by stringing together subwords.

其他

Unsurprisingly, there are many more techniques out there. To name a few:

Byte-level BPE, as used in GPT-2
WordPiece, as used in BERT
SentencePiece or Unigram, as used in several multilingual models

You should now have sufficient knowledge of how tokenizers work to get started with the API.

Loading and saving

Loading:

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=29.0, style=ProgressStyle(description_w…

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435797.0, style=ProgressStyle(descripti…

tokenizer("Using a Transformer network is simple")

{'input_ids': [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

得到一个字典，包含key:

input_ids
token_type_ids
attention_mask

Saving:

tokenizer.save_pretrained("directory_on_my_computer")

会存储：

tokenizer (a bit like the architecture of the model)
vocabulary (a bit like the weights of the model)

Encoding

Translating text to numbers is known as encoding.Encoding is done in a two-step process: the tokenization, followed by the conversion to input IDs.

Tokenization

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)

print(tokens)

['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple']

From tokens to input IDs

ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

[7993, 170, 13809, 23763, 2443, 1110, 3014]

Decoding

Decoding is going the other way around: from vocabulary indices, we want to get a string.

decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])
print(decoded_string)

Using a transformer network is simple

Models

模型只接受tensor的输入，所以需要tokenizers的预处理。

Creating a Transformer

上篇教程中使用的AutoModel可以根据你所给出的预训练checkpoint自动判断属于哪一类transformer从而import。当你本身知道你的预训练模型属于哪类transformers时也可以直接自己指定，比如BERT：

# 默认情况下，模型的weights是随机初始化的，还需要训练预训练模型
from transformers import BertConfig, BertModel

# Building the config
config = BertConfig()

# Building the model from the config
model = BertModel(config)

print(config)

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.9.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

其中：

hidden_size: the size of hidden_states vector.
num_hidden_layers: defines the number of layers the Transformer model has.

Different loading methods

# 我们直接使用预训练模型，不需要训练，也不用config，weights是直接挪用的预训练好的weights
from transformers import BertModel

model = BertModel.from_pretrained("bert-base-cased")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=570.0, style=ProgressStyle(description_…

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435779157.0, style=ProgressStyle(descri…

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

在默认情况下，上面的预训练模型“bert-base-cased”缓存在：~/.cache/huggingface/transformers。可以通过改变环境变量HF_HOME改变缓存位置。

Saving methods

和from_pretrained加载预训练模型类似，我们存储预训练模型可以用save_pretrained:

model.save_pretrained("directory_on_my_computer")

上面的命令存储了两个文件：

config.json：the attributes necessary to build the model architecture.This file also contains some metadata, such as where the checkpoint originated and what 🤗 Transformers version you were using when you last saved the checkpoint.
pytorch_model.bin: state dictionary, it contains all your model’s weights.

The two files go hand in hand; the configuration is necessary to know your model’s architecture, while the model weights are your model’s parameters.

文档信息

本文作者：weownthenight
本文链接：https://weownthenight.github.io/2021/08/18/Hugging-Face-Transformers%E6%95%99%E7%A8%8B%E7%AC%94%E8%AE%B0(3)-Models-and-Tokenizers/
版权声明：自由转载-非商用-非衍生-保持署名（创意共享3.0许可证）

weownthenight的博客