Hugging Face Transformers教程笔记(6):Processing the data

2021/09/08 NLP 共 6347 字,约 19 分钟

Processing the data

接下来的几篇教程会演示如何fine tune一个transformer。这篇教程主要讲的是如何处理数据集。

获取数据集

在这一章我们以MRPC(Microsoft Research Paraphrase Corpus)为例,The dataset consists of 5,801 pairs of sentences, with a label indicating if they are paraphrases or not (i.e., if both sentences mean the same thing).MRPC是GLUE benchmark的十个数据集之一,GLUE用来衡量模型在文本分类任务的表现。

Hugging Face除了有transformer model以外,也有数据集可以很方便的load进来:

!pip install datasets transformers[sentencepiece]
from datasets import load_dataset

raw_datasets = load_dataset("glue", "mrpc")
raw_datasets
Downloading:   0%|          | 0.00/7.78k [00:00<?, ?B/s]



Downloading:   0%|          | 0.00/4.47k [00:00<?, ?B/s]


Downloading and preparing dataset glue/mrpc (download: 1.43 MiB, generated: 1.43 MiB, post-processed: Unknown size, total: 2.85 MiB) to /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...



Downloading: 0.00B [00:00, ?B/s]



Downloading: 0.00B [00:00, ?B/s]



Downloading: 0.00B [00:00, ?B/s]



0 examples [00:00, ? examples/s]



0 examples [00:00, ? examples/s]



0 examples [00:00, ? examples/s]


Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.





DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

从输出可以看出,MRPC数据集分为训练集(3668 pairs)、验证集(408 pairs)、测试集(1725 pairs)。格式为(sentence1, sentence2, label, idx)。

和model类似,数据集下载后缓存在~/.cache/huggingface/dataset,可以通过设置环境变量HF_HOME来改变。

取数据:

raw_train_dataset = raw_datasets["train"]
raw_train_dataset[0]
{'idx': 0,
 'label': 1,
 'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .'}

这里可以看到label已经是数字了,我们如果想要知道不同的数字对应的是什么意义可以:

raw_train_dataset.features
{'idx': Value(dtype='int32', id=None),
 'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None),
 'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None)}

数据集预处理

from transformers import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenized_sentences_1 = tokenizer(raw_datasets["train"]["sentence1"])
tokenized_sentences_2 = tokenizer(raw_datasets["train"]["sentence2"])
Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]



Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]



Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]



Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

我们不能直接把tokenized_sentence_1和tokenized_sentence_2作为输入放入模型,需要组合成pair才可以,而tokenizer本身也可以将句子组合成pair:

inputs = tokenizer("This is the first sentence.", "This is the second one.")
inputs
{'input_ids': [101, 2023, 2003, 1996, 2034, 6251, 1012, 102, 2023, 2003, 1996, 2117, 2028, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


```python
tokenizer.convert_ids_to_tokens(inputs["input_ids"])
['[CLS]',
 'this',
 'is',
 'the',
 'first',
 'sentence',
 '.',
 '[SEP]',
 'this',
 'is',
 'the',
 'second',
 'one',
 '.',
 '[SEP]']

可以看到格式为: [CLS] sentence1 [SEP] sentence2 [SEP]

['[CLS]', 'this', 'is', 'the', 'first', 'sentence', '.', '[SEP]', 'this', 'is', 'the', 'second', 'one', '.', '[SEP]']
[      0,      0,    0,     0,       0,          0,   0,       0,      1,    1,     1,        1,     1,   1,       1]
[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]

tokenizer是否会返回token_type_ids还是要看预训练模型是怎么处理的,比如DistillBERT就是没有的。

我们可以这样处理训练集:

tokenized_dataset = tokenizer(
    raw_datasets["train"]["sentence1"],
    raw_datasets["train"]["sentence2"],
    padding=True,
    truncation=True,
)

上述的做法有几个不好的地方:

  1. 返回的结果是字典,包含有key:input_ids, attention_mask, token_type_ids, 值为list of lists。
  2. 在运行时需要有足够的RAM。
  3. 这样做只能分别处理训练集,验证集和测试集。

为了解决这些问题,我们可以使用Dataset.map方法。The map method works by applying a function on each element of the dataset.下面我们先定义这个函数:

def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

其中example是一个dictionary(它的key是features),tokenize_function返回了一个新的dictionary(将attention_mask, input_ids, token_type_ids加入features)。可以回忆一下raw_datasets,和如下的tokenized_datasets作为对比:

raw_datasets
DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets
  0%|          | 0/4 [00:00<?, ?ba/s]



  0%|          | 0/1 [00:00<?, ?ba/s]



  0%|          | 0/2 [00:00<?, ?ba/s]





DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'idx', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['attention_mask', 'idx', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids'],
        num_rows: 408
    })
    test: Dataset({
        features: ['attention_mask', 'idx', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids'],
        num_rows: 1725
    })
})

tokenized_function不止可以对一个element生效,还可以对整个batch生效,只需要加上batched=True,这样可以加快tokenizer的速度。另外,我们没有在tokenized_function里设置padding,是因为:如果对每一个sample padding到max length是非常低效的,我们可以以batch为单位来padding,这样max length可能也不会那么大,速度也会快很多。

除了batch可以提高速度外,也可以指定多线程来提高效率,指定num_proc参数就可以。

We didn’t do this here because the 🤗 Tokenizers library already uses multiple threads to tokenize our samples faster, but if you are not using a fast tokenizer backed by this library, this could speed up your preprocessing.

tokenizer除了添加了三项特征外,也可以在函数中改变原有的特征。

接下来要做的就是之前讲过的padding,因为按每个batch来padding,所以也叫dynamic padding。

Dynamic padding

In PyTorch, the function that is responsible for putting together samples inside a batch is called a collate function.

上文中提到的按照batch来padding的做法不适合TPU,TPU需要相同形状的tensor。

collate function可以在DataLoader处按照参数传入定义,默认的collate办法就是将样本转换为tensor,recursively if your elements are lists, tuples, or dictionaries.

在这个例子中,我们不能使用默认的collate function,因为我们需要按照batch来padding,Hugging Face Transformers库中提供了DataCollatorWithPadding可供使用:

from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

这是在padding之前的情况:(取前8个samples(一个batch),看一下长度)

samples = tokenized_datasets["train"][:8]
# string无法转换为tensor,所以这里去掉了idx, sentence1, sentence2三个特征
samples = {
    k: v for k, v in samples.items() if k not in ["idx", "sentence1", "sentence2"]
}
[len(x) for x in samples["input_ids"]]
[50, 59, 47, 67, 59, 50, 62, 32]

进行dynamic padding后:

batch = data_collator(samples)
{k: v.shape for k, v in batch.items()}
{'attention_mask': torch.Size([8, 67]),
 'input_ids': torch.Size([8, 67]),
 'labels': torch.Size([8]),
 'token_type_ids': torch.Size([8, 67])}

可以看到都变为了67,也就是这个batch中的最大长度。

文档信息

Search

    Table of Contents