Hugging Face Transformers教程笔记(4):Handling Multiple Sequences

2021/08/25 NLP 共 6416 字,约 19 分钟


  • How do we handle multiple sequences?
  • How do we handle multiple sequences of different lengths?
  • Are vocabulary indices the only inputs that allow a model to work well?
  • Is there such a thing as too long a sequence?

Models expect a batch of inputs

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
# 转换为tensor
input_ids = torch.tensor(ids)
# This line will fail.
报错了:dimension out of range.



tokenized_inputs = tokenizer(sequence, return_tensors="pt")
tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102]])
tensor([ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
         2026,  2878,  2166,  1012])


import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)

# 我们自己加了一维
input_ids = torch.tensor([ids])
print("Input IDs:", input_ids)

output = model(input_ids)
print("Logits:", output.logits)
Input IDs: tensor([[ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
          2026,  2878,  2166,  1012]])
Logits: tensor([[-2.7276,  2.8789]], grad_fn=<AddmmBackward>)

Padding the inputs

Padding makes sure all our sentences have the same length by adding a special word called the padding token to the sentences with fewer values.

model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]
# the padding token ID: tokenizer.pad_token_id
batched_ids = [[200, 200, 200], [200, 200, tokenizer.pad_token_id]]

tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward>)
tensor([[ 1.5694, -1.3895],
        [ 1.3373, -1.2163]], grad_fn=<AddmmBackward>)

观察结果可以发现,第二句话在padding前和padding后得到的tensor不同。这是因为transformer会将padding token和句子中所有的token全部attend,要想两者的结果相同,我们需要transformer忽略padding token,对它不计算attention,这一点可以通过attention mask来实现。

Attention masks

Attention masks are tensors with the exact same shape as the input IDs tensor, filled with 0s and 1s: 1s indicate the corresponding tokens should be attended to, and 0s indicate the corresponding tokens should not be attended to (i.e., they should be ignored by the attention layers of the model).

还是以上述例子举例,我们来看如何利用attention mask让第二句话在padding前和padding后的tensor结果相同:

batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id]

attention_mask = [
  [1, 1, 1],
  [1, 1, 0]

outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
tensor([[ 1.5694, -1.3895],
        [ 0.5803, -0.4125]], grad_fn=<AddmmBackward>)

Longer sequences

transformer model对句子的最大长度通常有限制,一般是512或1024 tokens。对于较长的句子,有两种处理方式:

  • Use a model with a longer supported sequence length.
  • Truncate your sequences.



sequence = sequence[:max_sequence_length]



