Training a BARTForSequenceClassification returns data with ununiform dimentsions
I am trying to fine-tune a BART-base model on a dataset that I have. The dataset looks like this: It has columns "id", "text", "label" and "dataset_id". The "text" column is what I want to use as inputs to the model, and it is plain text. "label" is a value of either 0 or 1.
I've already written the code for Training, using transfomers==4.28.0.
This is the code for the dataset class:
class TextDataset(Dataset):
def __init__(self, encodings):
self.encodings = encodings
def __getitem__(self, idx):
return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
def __len__(self):
return len(self.encodings['input_ids'])
This is the code for loading and encoding of the data:
def load_data(directory):
files = os.listdir(directory)
dfs = []
for file in files:
if file.endswith('train.csv'):
df = pd.read_csv(os.path.join(directory, file))
dfs.append(df)
return pd.concat(dfs, ignore_index=True)
print(len(load_data("splitted_data/gender-bias")))
def encode_data(tokenizer, text, labels):
inputs = tokenizer(text, padding="max_length", truncation=True, max_length=128, return_tensors="pt")
inputs['labels'] = torch.tensor(labels)
return inputs
This is the code for the metrics for evaluation. I use the f1_score function from scikit.
def compute_metrics(eval_pred):
logits = eval_pred.predictions
labels = eval_pred.label_ids
predictions = np.argmax(logits, axis=-1)
return {"f1": f1_score(labels, predictions)}
This is the training function:
def train_model(train_dataset, eval_dataset):
# Define the training arguments
training_args = TrainingArguments(
output_dir='./baseline/results', # output directory
num_train_epochs=5, # total number of training epochs
per_device_train_batch_size=32, # batch size per device during training
per_device_eval_batch_size=64, # batch size for evaluation
warmup_steps=500, # number of warmup steps for learning rate scheduler
weight_decay=0.01, # strength of weight decay
evaluation_strategy="steps", # evaluation is done at each training step
eval_steps=50, # number of training steps between evaluations
load_best_model_at_end=True, # load the best model when finished training (defaults to `False`)
save_strategy='steps', # save the model after each training step
save_steps=500, # number of training steps between saves
metric_for_best_model='f1', # metric to use to compare models
greater_is_better=True # whether a larger metric value is better
)
# Define the trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
compute_metrics=compute_metrics
)
# Train the model
trainer.train()
return trainer
This is how I defined the model and etc.
model = BartForSequenceClassification.from_pretrained('facebook/bart-base', num_labels=2)
tokenizer = BartTokenizer.from_pretrained('facebook/bart-base')
train_df = load_data("splitted_data/gender-bias")
train_encodings = encode_data(tokenizer, train_df['text'].tolist(), train_df['label'].tolist())
# For simplicity, let's split our training data to create a pseudo-evaluation set
train_size = int(0.9 * len(train_encodings['input_ids'])) # 90% for training
train_dataset = {k: v[:train_size] for k, v in train_encodings.items()}
print(train_dataset)
print(len(train_dataset))
eval_dataset = {k: v[train_size:] for k, v in train_encodings.items()} # 10% for evaluation
# Convert the dictionary data to PyTorch Dataset
train_dataset = TextDataset(train_dataset)
eval_dataset = TextDataset(eval_dataset)
trainer = train_model(train_dataset, eval_dataset)
The training looks just fine. However, when it comes to evaluation during training, an error is raised from my compute_metrics function, which takes a parameter as the output of the model. The model should be a binary classification model, returning the probabilistic of each label in its output I believe.
np.argmax(np.array(logits), axis=-1) 21
ValueError: could not broadcast input array from shape (3208,2) into shape (3208,)
I've tried to output the type of the logits, and it turns out that type(logits)
return Tuple
. Considering that this might be caused the fact that evaluation dataset might be split into batches, and the returned Tuple is a number of separate numpy arrays, I've also tried to concatenate the tuple.
def compute_metrics(eval_pred):
logits = eval_pred.predictions
labels = eval_pred.label_ids
logits = np.concatenate(logits, axis=0)
predictions = np.argmax(logits, axis=-1)
return {"f1": f1_score(labels, predictions)}
But this raised a new error:
packages/numpy/core/overrides.py in concatenate(*args, **kwargs)
ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 2 dimension(s) and the array at index 1 has 3 dimension(s)
How can I solve this issue?
Comments
Post a Comment