Transformers in NLP: A Notebook Approach to Disaster Tweets Kaggle Contest

Using Transformers for Building a Disaster Tweet Classification Model in Kaggle Competitions

Introduction

Nowadays, Twitter has emerged as an important communication channel during times of emergency. With the widespread usage of smartphones, people can now quickly report an emergency they are witnessing in real-time, enabling disaster relief organizations and news agencies to monitor and respond. However, not all tweets that contain disaster-related words indicate a real disaster, as some may use such words metaphorically. The challenge is to build a machine-learning model that can distinguish between tweets related to actual disasters and those that are not. This is where Natural Language Processing (NLP) comes into play. The approach is based on Fastai Lesson 4 and J. Howard's "Getting Started with NLP for Absolute Beginners" Kaggle notebook, which provided a strong foundation for my solution. For more information about the competition, visit https://www.kaggle.com/competitions/nlp-getting-started

Setting up environment

Set a path to our data. Use fastkaggle because it makes everything so much easier, and makes it work automatically regardless if you're working on your PC or Kaggle!

try:import fastkaggle 
except ModuleNotFoundError:
    !pip install -q fastkaggle 
from fastkaggle import *
comp = 'nlp-getting-started'
if not iskaggle:
    !cp kaggle.json ~/.kaggle/kaggle.json 
    !chmod 600 ~/.kaggle/kaggle.json 
path = setup_comp(comp)

Downloading nlp-getting-started.zip to /content

100%|██████████| 593k/593k [00:00<00:00, 648kB/s]

Import and EDA

!pip install -q datasets sentencepiece transformers

 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 474.6/474.6 kB 24.3 MB/s eta 0:00:00  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 35.2 MB/s eta 0:00:00  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.0/7.0 MB 93.8 MB/s eta 0:00:00  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 110.5/110.5 kB 14.8 MB/s eta 0:00:00  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 212.5/212.5 kB 28.6 MB/s eta 0:00:00  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.0/1.0 MB 70.5 MB/s eta 0:00:00  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 224.5/224.5 kB 26.6 MB/s eta 0:00:00  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.3/134.3 kB 16.6 MB/s eta 0:00:00  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.8/7.8 MB 81.2 MB/s eta 0:00:00  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 114.5/114.5 kB 16.2 MB/s eta 0:00:00  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 149.6/149.6 kB 20.3 MB/s eta 0:00:00  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 268.8/268.8 kB 33.3 MB/s eta 0:00:00 [?25h

fastai library includes pandas,numpy and matplotlib.

from fastai.imports import * 
import warnings,logging
warnings.simplefilter('ignore')
logging.disable(logging.WARNING)
!ls {path}

sample_submission.csv test.csv train.csv

Let's look at the dataset

df = pd.read_csv(path/'train.csv')
df.head()
idkeywordlocationtexttarget
01NaNNaNOur Deeds are the Reason of this #earthquake May ALLAH Forgive us all1
14NaNNaNForest fire near La Ronge Sask. Canada1
25NaNNaNAll residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected1
36NaNNaN13,000 people receive #wildfires evacuation orders in California1
47NaNNaNJust got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school1

One of the most usefull feature of DataFrame is describe() method which returns the summary of the columns

df.describe(include='object')
keywordlocationtext
count755250807613
unique22133417503
topfatalitiesUSA11-Year-Old Boy Charged With Manslaughter of Toddler: Report: An 11-year-old boy has been charged with manslaughter over the fatal sh...
freq4510410

We can see that there are 221 keyword, 3341 location and 7503 text which are unique.

df.isna().sum()

id 0 keyword 61 location 2533 text 0 target 0 dtype: int64

There 61 keyword,2522 location entries having 'NA' values. So we replace those with fillna() method .

df.fillna('',inplace=True)

We represent input to the model as

"TEXT1":text; "LOC1":loaction; "KEYWORD1":keyword
df['input'] = 'TEXT1: '+df.text+'; LOC1: '+df.location+'; KEYWORD1: '+df.keyword
df.input.head()

0 TEXT1: Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all; LOC1: ; KEYWORD1: 1 TEXT1: Forest fire near La Ronge Sask. Canada; LOC1: ; KEYWORD1: 2 TEXT1: All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected; LOC1: ; KEYWORD1: 3 TEXT1: 13,000 people receive #wildfires evacuation orders in California ; LOC1: ; KEYWORD1: 4 TEXT1: Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school ; LOC1: ; KEYWORD1: Name: input, dtype: object

Tokenization

Transformers require Dataset objects which can be created as follows.

from datasets import Dataset, DatasetDict 

ds = Dataset.from_pandas(df)
ds

Dataset({ features: ['id', 'keyword', 'location', 'text', 'target', 'input'], num_rows: 7613 })

We will be using the DeBERTa model, as its small size allows us to train our model quickly.

model_nm = 'microsoft/deberta-v3-small'

AutoTokenizer will automatically download the suitable vocabulary for our model.

from transformers import AutoModelForSequenceClassification, AutoTokenizer 

tokz = AutoTokenizer.from_pretrained(model_nm)
tokz.tokenize("Tom can't remember all his passwords, so he keeps them in a list disguised as phone numbers.")

['▁Tom', '▁can', "'", 't', '▁remember', '▁all', '▁his', '▁passwords', ',', '▁so', '▁he', '▁keeps', '▁them', '▁in', '▁a', '▁list', '▁disguised', '▁as', '▁phone', '▁numbers', '.']

def tok_func(x):return tokz(x['input'])

To tokenize our complete dataset, we will use the map method, which takes two parameters: the function and whether it should be batched.

tok_ds = ds.map(tok_func, batched=True)

Map: 0%| | 0/7613 [00:00<?, ? examples/s]

row = tok_ds[0]
row['input'],row['input_ids']

('TEXT1: Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all; LOC1: ; KEYWORD1: ', [1, 54453, 435, 294, 581, 65453, 281, 262, 18037, 265, 291, 953, 117831, 903, 4924, 17018, 43632, 381, 305, 346, 57615, 435, 294, 2600, 29908, 67111, 435, 294, 2])

Text: Transformers require target variables to be named as 'labels' according to convention. Therefore, we rename the columns.

tok_ds = tok_ds.rename_columns({'target':'labels'})

Test and Validation set

eval_df = pd.read_csv(path/'test.csv')
eval_df.head()
idkeywordlocationtext
00NaNNaNJust happened a terrible car crash
12NaNNaNHeard about #earthquake is different cities, stay safe everyone.
23NaNNaNthere is a forest fire at spot pond, geese are fleeing across the street, I cannot save them all
39NaNNaNApocalypse lighting. #Spokane #wildfires
411NaNNaNTyphoon Soudelor kills 28 in China and Taiwan
eval_df.fillna('',inplace=True)
dds = tok_ds.train_test_split(0.25)
dds

DatasetDict({ train: Dataset({ features: ['id', 'keyword', 'location', 'text', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'], num_rows: 5709 }) test: Dataset({ features: ['id', 'keyword', 'location', 'text', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'], num_rows: 1904 }) })

eval_df['input'] = 'TEXT1: '+eval_df.text+'; LOC1: '+eval_df.location+'; KEYWORD1: '+eval_df.keyword
eval_ds = Dataset.from_pandas(eval_df).map(tok_func,batched=True)

Map: 0%| | 0/3263 [00:00<?, ? examples/s]

Metrics and Correlation

Text: As mentioned in the competition, the evaluation is based on the F1 score between the predicted and expected answers.

from sklearn.metrics import f1_score
def corr(x,y): 
    x_flat = np.argmax(x, axis=1).flatten()
    y_flat = y.flatten()
    return f1_score(x_flat, y_flat)
def corr_d(eval_pred):
    return {'f1_score':corr(*eval_pred)}

Training

from transformers import TrainingArguments, Trainer

bs: Batch Size

epochs: Number of epochs

bs = 128
epochs = 4

lr: Learning rate

lr = 8e-5
args = TrainingArguments(
    'ouputs',learning_rate=lr,
    warmup_ratio=0.1,lr_scheduler_type='cosine',
    fp16=True,evaluation_strategy='epoch',
    per_device_train_batch_size=bs,
    per_device_eval_batch_size=bs*2,
    num_train_epochs=epochs, weight_decay = 0.01,
    report_to = 'none'
)
model = AutoModelForSequenceClassification.from_pretrained(
    model_nm,
    num_labels = 2
)

Downloading pytorch_model.bin: 0%| | 0.00/286M [00:00<?, ?B/s]

trainer = Trainer(
    model,args,train_dataset=dds['train'],
    eval_dataset=dds['test'],
    tokenizer=tokz,
    compute_metrics = corr_d
)
trainer.train();

[180/180 01:22, Epoch 4/4]

EpochTraining LossValidation LossF1 Score
1No log0.4467170.775607
2No log0.5041640.790861
3No log0.4491960.792476
4No log0.4863360.795580

We achieved an F1 score of 79.5%.

preds = trainer.predict(eval_ds).predictions.argmax(1)
preds

array([1, 1, 1, ..., 1, 1, 1])

preds

array([1, 1, 1, ..., 1, 1, 1])

Create a submission file for submitting to a Kaggle competition.

import datasets 
submission = datasets.Dataset.from_dict({
    'id':eval_ds['id'],
    'target':preds
})

submission.to_csv('submission.csv',index=False)

Creating CSV from Arrow format: 0%| | 0/4 [00:00<?, ?ba/s]

22746

!head submission.csv

id,target

0,1

2,1

3,1

9,1

11,1

12,1

21,0

22,0

27,0

Improvements

  1. Data cleaning: Remove emojis, eliminate punctuation, and convert text to lowercase.

  2. Model: Larger models can be used to increase the score. These models can be found at Hugging Face.

  3. Tweaking the training parameters: learning rate, batch size, epochs

Conclusion

In conclusion, the use of a transformer with the DeBERTa model proved to be an effective solution in achieving a high score of 83.08% in the Kaggle competition of disaster tweets. This approach utilized the power of natural language processing and machine learning to accurately classify tweets related to disasters. The solution was not only effective but also simple, and with the use of a GPU, it could be trained in under 2 minutes. Overall, the success of this approach highlights the potential of utilizing advanced techniques in natural language processing for solving complex real-world problems.

References

Please consider upvoting my Kaggle notebook if you found my solution helpful.Here's the link to my Kaggle notebook: notebook. Thank you!