更改

Huggingface之transformers零基础使用指南

添加200字节, 2023年11月1日 (三) 13:29

无编辑摘要

'房产')

label2id = dict() id2label = dict() for i, label in enumerate(set(y_lst)): label2id[label] = i id2label[i] = label tokenizer = AutoTokenizer.from_pretrained("./models/bert-base-chinese")

先把所有的文本都转化为编码，而不是在后续数据集中转化，这样可以避免在后续训练过程中，每一个epoch都要进行转化，提升效率：

token_lens = [] ~~for txt in tqdm(x_lst):~~ ~~tokens = tokenizer.encode(txt, max_length=512)~~ ~~token_lens.append(len(tokens))~~ 0%| | 0/5900 [00:00<?, ?it/s]Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.~~100%|██████████| 5900/5900 [00:07<00:00, 739.64it/s]~~ ~~class NewsDataset(Dataset):~~ ~~def __init__(self,x_lst,y_lst,tokenizer,max_len):~~ ~~self.x_lst=x_lst~~ ~~self.y_lst=y_lst~~ ~~self.tokenizer=tokenizer~~ ~~self.max_len=max_len~~ ~~def __len__(self):~~ ~~return len(self.x_lst)~~ ~~def __getitem__(self,index):~~ ~~"""~~ ~~index 为数据索引，迭代取第index条数据~~ ~~"""~~ ~~text=str(self.x_lst[index])~~ ~~label=label2id[self.y_lst[index]]~~ ~~encoding=self.tokenizer.encode_plus(~~ ~~text,~~ ~~add_special_tokens=True,~~ ~~max_length=self.max_len,~~ ~~return_token_type_ids=True,~~ ~~pad_to_max_length=True,~~ ~~return_attention_mask=True,~~ ~~return_tensors='pt',~~ ) ~~return {~~ ~~'texts':text,~~ ~~'input_ids':encoding['input_ids'].flatten(),~~ ~~'attention_mask':encoding['attention_mask'].flatten(),~~ ~~'labels':torch.tensor(label,dtype=torch.long)~~ } ~~x_train, x_val, y_train, y_val = train_test_split(x_lst, y_lst, test_size=0.15, random_state=RANDOM_SEED) # 划分训练集测试集~~ ~~# datasettrain_dataset = NewsDataset(x_train, y_train, tokenizer, MAX_LEN)val_dataset = NewsDataset(x_val, y_val, tokenizer, MAX_LEN)~~ ~~# dataloadertrain_dataloader = DataLoader(train_dataset, batch_size=BATCH_SIZE)val_dataloader = DataLoader(val_dataset, batch_size=BATCH_SIZE)~~

for txt in tqdm(x_lst):

tokens = tokenizer.encode(txt, max_length=512)

token_lens.append(len(tokens))

0%| | 0/5900 [00:00<?, ?it/s]Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly

truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this

strategy more precisely by providing a specific strategy to `truncation`.

100%|██████████| 5900/5900 [00:07<00:00, 739.64it/s]

class NewsDataset(Dataset):

def __init__(self,x_lst,y_lst,tokenizer,max_len):

self.x_lst=x_lst

self.y_lst=y_lst

self.tokenizer=tokenizer

self.max_len=max_len

def __len__(self):

return len(self.x_lst)

def __getitem__(self,index):

"""

index 为数据索引，迭代取第index条数据

"""

text=str(self.x_lst[index])

label=label2id[self.y_lst[index]]

encoding=self.tokenizer.encode_plus(

text,

add_special_tokens=True,

max_length=self.max_len,

return_token_type_ids=True,

pad_to_max_length=True,

return_attention_mask=True,

return_tensors='pt',

)

return {

'texts':text,

'input_ids':encoding['input_ids'].flatten(),

'attention_mask':encoding['attention_mask'].flatten(),

'labels':torch.tensor(label,dtype=torch.long)

}

x_train, x_val, y_train, y_val = train_test_split(x_lst, y_lst, test_size=0.15, random_state=RANDOM_SEED) # 划分训练集测试集

# dataset

train_dataset = NewsDataset(x_train, y_train, tokenizer, MAX_LEN)

val_dataset = NewsDataset(x_val, y_val, tokenizer, MAX_LEN)

# dataloader

train_dataloader = DataLoader(train_dataset, batch_size=BATCH_SIZE)

val_dataloader = DataLoader(val_dataset, batch_size=BATCH_SIZE)

6.2 自定义网络

这里我们使用BERT预训练模型，同时接Dropout层和一层线形层，构成自定义网络：

class CustomBERTModel(nn.Module):

def __init__(self, n_classes):

super(CustomBERTModel, self).__init__()

self.bert = AutoModel.from_pretrained("./models/bert-base-chinese")

self.drop = nn.Dropout(p=0.3)

self.out = nn.Linear(self.bert.config.hidden_size, n_classes)

def forward(self, input_ids, attention_mask):

_, pooled_output = self.bert(

input_ids=input_ids,

attention_mask=attention_mask,

return_dict = False

)

output = self.drop(pooled_output) # dropout

return self.out(output)

device = set_device(cuda_index=1)

2022-12-20 16:12:39 set_device line 11 out: cuda:1

n_classes = len(label2id)

model = CustomBERTModel(n_classes)

model = model.to(device)

Some weights of the model checkpoint at ./models/bert-base-chinese were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight',

'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight',

'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight']

- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a

BertForSequenceClassification model from a BertForPreTraining model).

- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification

model from a BertForSequenceClassification model).

自定义数据集：

~~class CustomBERTModel(nn.Module):~~ ~~def __init__(self, n_classes):~~ ~~super(CustomBERTModel, self).__init__()~~ ~~self.bert = AutoModel.from_pretrained("./models/bert-base-chinese")~~ ~~self.drop = nn.Dropout(p=0~~6.3) ~~self.out = nn.Linear(self.bert.config.hidden_size, n_classes)~~ ~~def forward(self, input_ids, attention_mask):~~ ~~_, pooled_output = self.bert(~~ ~~input_ids=input_ids,~~ ~~attention_mask=attention_mask,~~ ~~return_dict = False~~ ) ~~output = self.drop(pooled_output) # dropout~~ ~~return self.out(output)~~ ~~device = set_device(cuda_index=1)~~训练

~~2022-12-20 16~~ def train_epoch(model, data_loader,loss_fn,optimizer,device,scheduler, n_examples):12 model.train() losses = [] correct_predictions = 0 for i, d in bar(data_loader):~~39 set_device line 11 out: cuda~~ input_ids = d["input_ids"].to(device) attention_mask = d["attention_mask"].to(device) targets = d["labels"].to(device) outputs = model( input_ids=input_ids, attention_mask=attention_mask ) _, preds = torch.max(outputs, dim=1) loss = loss_fn(outputs, targets) correct_predictions += torch.sum(preds == targets) losses.append(loss.item()) loss.backward() nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) optimizer.step() scheduler.step() optimizer.zero_grad() return correct_predictions.double() / n_examples, np.mean(losses) def eval_model(model, data_loader, loss_fn, device, n_examples):1 model.eval() # 验证预测模式 losses = [] correct_predictions = 0

~~n_classes~~ with torch.no_grad(): for d in data_loader: input_ids = ~~len~~d["input_ids"].to(~~label2id~~device)~~model~~ attention_mask = ~~CustomBERTModel~~d["attention_mask"].to(~~n_classes~~device)~~model~~ targets = ~~model~~d["labels"].to(device)

~~Some weights of the~~ outputs = model ~~checkpoint at ./models/bert-base-chinese were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight'~~( input_ids=input_ids, 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight']- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model attention_mask=attention_mask ) _, preds = torch.~~- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical~~ max(~~initializing a BertForSequenceClassification model from a BertForSequenceClassification model~~outputs, dim=1).

~~自定义数据集：~~ loss = loss_fn(outputs, targets)

6 correct_predictions += torch.~~3 训练~~sum(preds == targets) losses.append(loss.item())

~~def train_epoch(model, data_loader,loss_fn,optimizer,device,scheduler, n_examples):~~ ~~model.train()~~ ~~losses = []~~ ~~correct_predictions = 0~~ ~~for i, d in bar(data_loader):~~ ~~input_ids = d["input_ids"].to(device)~~ ~~attention_mask = d["attention_mask"].to(device)~~ ~~targets = d["labels"].to(device)~~ ~~outputs = model(~~ ~~input_ids=input_ids,~~ ~~attention_mask=attention_mask~~ ) ~~_, preds = torch.max(outputs, dim=1)~~ ~~loss = loss_fn(outputs, targets)~~ ~~correct_predictions += torch.sum(preds == targets)~~ ~~losses.append(loss.item())~~ ~~loss.backward()~~ ~~nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)~~ ~~optimizer.step()~~ ~~scheduler.step()~~ ~~optimizer.zero_grad()~~ return correct_predictions.double() / n_examples, np.mean(losses) EPOCHS = 5 # 训练轮数

~~def eval_model(model, data_loader, loss_fn, device, n_examples):~~ ~~model.eval() # 验证预测模式~~ ~~losses = []~~ ~~correct_predictions = 0~~ ~~with torch.no_grad():~~ ~~for d in data_loader:~~ ~~input_ids = d["input_ids"].to(device)~~ ~~attention_mask = d["attention_mask"].to(device)~~ ~~targets = d["labels"].to(device)~~ ~~outputs = model(~~ ~~input_ids=input_ids,~~ ~~attention_mask=attention_mask~~ ) ~~_, preds = torch.max(outputs, dim=1)~~ ~~loss = loss_fn(outputs, targets)~~ ~~correct_predictions += torch.sum(preds == targets)~~ ~~losses.append(loss.item())~~ ~~return correct_predictions.double() / n_examples, np.mean(losses)~~ ~~EPOCHS = 5 # 训练轮数~~ optimizer = AdamW(model.parameters(), lr=2e-5) total_steps = len(train_dataloader) * EPOCHS ~~scheduler = get_linear_schedule_with_warmup(~~ ~~optimizer,~~ ~~num_warmup_steps=0,~~ ~~num_training_steps=total_steps~~) ~~loss_fn = nn.CrossEntropyLoss().to(device)~~

~~best_accuracy = 0is_best = False~~t scheduler = ~~Tableprint~~get_linear_schedule_with_warmup(~~['epoch', 'train_accuracy', 'train_loss', 'test_accuracy', 'test_loss', 'is_best'])t.print_header()for epoch in range(EPOCHS):~~ ~~train_acc, train_loss = train_epoch(model,train_dataloader,loss_fn,~~ optimizer,~~device,scheduler,len(x_train))~~ ~~val_acc, val_loss~~ num_warmup_steps= ~~eval_model(~~ ~~model,~~ ~~val_dataloader,~~ ~~loss_fn,~~ ~~device,~~ ~~len(x_val)~~ ) ~~if val_acc > best_accuracy:~~ ~~is_best = True~~ ~~torch.save(model.state_dict()~~0, ~~'./models/news_classification/best_model_state.bin')~~ ~~best_accuracy = val_acc~~ ~~else:~~ ~~is_best~~ num_training_steps= ~~False~~total_steps ~~t.print_row(epoch, f"{train_acc:.4f}", f"{train_loss:.4f}", f"{val_acc:.4f}", f"{val_loss:.4f}", is_best~~ )

+ loss_fn =nn.CrossEntropyLoss().to(device) best_accuracy =0 is_best =False t =~~==+===========+====================+================+===================+===============+=============+| |~~ Tableprint(['epoch | ', 'train_accuracy | ', 'train_loss | ', 'test_accuracy | ', 'test_loss | ', 'is_best |'])~~+======+===========+====================+================+===================+===============+=============+~~ t.print_header()| ~~1 | 0 | 0.6080 | 1.4608 |~~ for epoch in range(EPOCHS): ~~0.8893 | 0.5278 | True |~~train_acc, train_loss = train_epoch(model,train_dataloader,loss_fn,optimizer,device,scheduler,len(x_train))~~+------+-----------+--------------------+----------------+-------------------+---------------+-------------+~~ ~~| 2 | 1 | 0.9196 | 0.3766 |~~ ~~0.9096 | 0.3583 | True |~~val_acc, val_loss = eval_model(~~+------+-----------+--------------------+----------------+-------------------+---------------+-------------+~~ model,~~| 3 | 2 | 0.9589 | 0.2015 | 0.9153 | 0.3413 | True |~~ val_dataloader,~~+------+-----------+--------------------+----------------+-------------------+---------------+-------------+~~ loss_fn,~~| 4 | 3 | 0.9765 | 0.1272 | 0.9153 | 0.3286 | False |~~ device,~~+------+-----------+--------------------+----------------+-------------------+---------------+-------------+~~ len(x_val)~~| 5 | 4 | 0.9836 | 0.0919 |~~ ~~0.9220~~ ) ~~| 0.3239 | True |+------+-----------+--------------------+----------------+-------------------+---------------+-------------+~~

if val_acc > best_accuracy:

is_best = True

torch.save(model.state_dict(), './models/news_classification/best_model_state.bin')

best_accuracy = val_acc

else:

is_best = False

t.print_row(epoch, f"{train_acc:.4f}", f"{train_loss:.4f}", f"{val_acc:.4f}", f"{val_loss:.4f}", is_best)

+======+===========+====================+================+===================+===============+=============+

+======+===========+====================+================+===================+===============+=============+

| 1 | 0 | 0.6080 | 1.4608 | 0.8893 | 0.5278 | True |

+------+-----------+--------------------+----------------+-------------------+---------------+-------------+

| 2 | 1 | 0.9196 | 0.3766 | 0.9096 | 0.3583 | True |

+------+-----------+--------------------+----------------+-------------------+---------------+-------------+

| 3 | 2 | 0.9589 | 0.2015 | 0.9153 | 0.3413 | True |

+------+-----------+--------------------+----------------+-------------------+---------------+-------------+

| 4 | 3 | 0.9765 | 0.1272 | 0.9153 | 0.3286 | False |

+------+-----------+--------------------+----------------+-------------------+---------------+-------------+

| 5 | 4 | 0.9836 | 0.0919 | 0.9220 | 0.3239 | True |

+------+-----------+--------------------+----------------+-------------------+---------------+-------------+

使用BERT预训练模型+自定义网络，模型初始时就具有了较高的准确率。

出处：https://www.cnblogs.com/chenhuabin/p/16997607.html

←上一编辑

江南仁

行政员、groupone、管理员

16,820

个编辑