Transformers in NLP: A Notebook Approach to Disaster Tweets Kaggle Contest
Using Transformers for Building a Disaster Tweet Classification Model in Kaggle Competitions
Introduction
Nowadays, Twitter has emerged as an important communication channel during times of emergency. With the widespread usage of smartphones, people can now quickly report an emergency they are witnessing in real-time, enabling disaster relief organizations and news agencies to monitor and respond. However, not all tweets that contain disaster-related words indicate a real disaster, as some may use such words metaphorically. The challenge is to build a machine-learning model that can distinguish between tweets related to actual disasters and those that are not. This is where Natural Language Processing (NLP) comes into play. The approach is based on Fastai Lesson 4 and J. Howard's "Getting Started with NLP for Absolute Beginners" Kaggle notebook, which provided a strong foundation for my solution. For more information about the competition, visit https://www.kaggle.com/competitions/nlp-getting-started
Setting up environment
Set a path to our data. Use fastkaggle because it makes everything so much easier, and makes it work automatically regardless if you're working on your PC or Kaggle!
try:import fastkaggle
except ModuleNotFoundError:
!pip install -q fastkaggle
from fastkaggle import *
comp = 'nlp-getting-started'
if not iskaggle:
!cp kaggle.json ~/.kaggle/kaggle.json
!chmod 600 ~/.kaggle/kaggle.json
path = setup_comp(comp)
Downloading nlp-getting-started.zip to /content
100%|██████████| 593k/593k [00:00<00:00, 648kB/s]
Import and EDA
!pip install -q datasets sentencepiece transformers
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m24.3 MB/s[0m eta [36m0:00:00[0m [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m35.2 MB/s[0m eta [36m0:00:00[0m [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m93.8 MB/s[0m eta [36m0:00:00[0m [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.5/212.5 kB[0m [31m28.6 MB/s[0m eta [36m0:00:00[0m [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m70.5 MB/s[0m eta [36m0:00:00[0m [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m26.6 MB/s[0m eta [36m0:00:00[0m [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.3/134.3 kB[0m [31m16.6 MB/s[0m eta [36m0:00:00[0m [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m81.2 MB/s[0m eta [36m0:00:00[0m [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.5/114.5 kB[0m [31m16.2 MB/s[0m eta [36m0:00:00[0m [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m149.6/149.6 kB[0m [31m20.3 MB/s[0m eta [36m0:00:00[0m [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m33.3 MB/s[0m eta [36m0:00:00[0m [?25h
fastai
library includes pandas,numpy and matplotlib.
from fastai.imports import *
import warnings,logging
warnings.simplefilter('ignore')
logging.disable(logging.WARNING)
!ls {path}
sample_submission.csv test.csv train.csv
Let's look at the dataset
df = pd.read_csv(path/'train.csv')
df.head()
id | keyword | location | text | target | |
0 | 1 | NaN | NaN | Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all | 1 |
1 | 4 | NaN | NaN | Forest fire near La Ronge Sask. Canada | 1 |
2 | 5 | NaN | NaN | All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected | 1 |
3 | 6 | NaN | NaN | 13,000 people receive #wildfires evacuation orders in California | 1 |
4 | 7 | NaN | NaN | Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school | 1 |
One of the most usefull feature of DataFrame
is describe()
method which returns the summary of the columns
df.describe(include='object')
keyword | location | text | |
count | 7552 | 5080 | 7613 |
unique | 221 | 3341 | 7503 |
top | fatalities | USA | 11-Year-Old Boy Charged With Manslaughter of Toddler: Report: An 11-year-old boy has been charged with manslaughter over the fatal sh... |
freq | 45 | 104 | 10 |
We can see that there are 221 keyword, 3341 location and 7503 text which are unique.
df.isna().sum()
id 0 keyword 61 location 2533 text 0 target 0 dtype: int64
There 61 keyword,2522 location entries having 'NA' values. So we replace those with fillna()
method .
df.fillna('',inplace=True)
We represent input to the model as
"TEXT1":text; "LOC1":loaction; "KEYWORD1":keyword
df['input'] = 'TEXT1: '+df.text+'; LOC1: '+df.location+'; KEYWORD1: '+df.keyword
df.input.head()
0 TEXT1: Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all; LOC1: ; KEYWORD1: 1 TEXT1: Forest fire near La Ronge Sask. Canada; LOC1: ; KEYWORD1: 2 TEXT1: All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected; LOC1: ; KEYWORD1: 3 TEXT1: 13,000 people receive #wildfires evacuation orders in California ; LOC1: ; KEYWORD1: 4 TEXT1: Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school ; LOC1: ; KEYWORD1: Name: input, dtype: object
Tokenization
Transformers require Dataset
objects which can be created as follows.
from datasets import Dataset, DatasetDict
ds = Dataset.from_pandas(df)
ds
Dataset({ features: ['id', 'keyword', 'location', 'text', 'target', 'input'], num_rows: 7613 })
We will be using the DeBERTa model, as its small size allows us to train our model quickly.
model_nm = 'microsoft/deberta-v3-small'
AutoTokenizer will automatically download the suitable vocabulary for our model.
from transformers import AutoModelForSequenceClassification, AutoTokenizer
tokz = AutoTokenizer.from_pretrained(model_nm)
tokz.tokenize("Tom can't remember all his passwords, so he keeps them in a list disguised as phone numbers.")
['▁Tom', '▁can', "'", 't', '▁remember', '▁all', '▁his', '▁passwords', ',', '▁so', '▁he', '▁keeps', '▁them', '▁in', '▁a', '▁list', '▁disguised', '▁as', '▁phone', '▁numbers', '.']
def tok_func(x):return tokz(x['input'])
To tokenize our complete dataset, we will use the map
method, which takes two parameters: the function and whether it should be batched.
tok_ds = ds.map(tok_func, batched=True)
Map: 0%| | 0/7613 [00:00<?, ? examples/s]
row = tok_ds[0]
row['input'],row['input_ids']
('TEXT1: Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all; LOC1: ; KEYWORD1: ', [1, 54453, 435, 294, 581, 65453, 281, 262, 18037, 265, 291, 953, 117831, 903, 4924, 17018, 43632, 381, 305, 346, 57615, 435, 294, 2600, 29908, 67111, 435, 294, 2])
Text: Transformers require target variables to be named as 'labels' according to convention. Therefore, we rename the columns.
tok_ds = tok_ds.rename_columns({'target':'labels'})
Test and Validation set
eval_df = pd.read_csv(path/'test.csv')
eval_df.head()
id | keyword | location | text | |
0 | 0 | NaN | NaN | Just happened a terrible car crash |
1 | 2 | NaN | NaN | Heard about #earthquake is different cities, stay safe everyone. |
2 | 3 | NaN | NaN | there is a forest fire at spot pond, geese are fleeing across the street, I cannot save them all |
3 | 9 | NaN | NaN | Apocalypse lighting. #Spokane #wildfires |
4 | 11 | NaN | NaN | Typhoon Soudelor kills 28 in China and Taiwan |
eval_df.fillna('',inplace=True)
dds = tok_ds.train_test_split(0.25)
dds
DatasetDict({ train: Dataset({ features: ['id', 'keyword', 'location', 'text', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'], num_rows: 5709 }) test: Dataset({ features: ['id', 'keyword', 'location', 'text', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'], num_rows: 1904 }) })
eval_df['input'] = 'TEXT1: '+eval_df.text+'; LOC1: '+eval_df.location+'; KEYWORD1: '+eval_df.keyword
eval_ds = Dataset.from_pandas(eval_df).map(tok_func,batched=True)
Map: 0%| | 0/3263 [00:00<?, ? examples/s]
Metrics and Correlation
Text: As mentioned in the competition, the evaluation is based on the F1 score between the predicted and expected answers.
from sklearn.metrics import f1_score
def corr(x,y):
x_flat = np.argmax(x, axis=1).flatten()
y_flat = y.flatten()
return f1_score(x_flat, y_flat)
def corr_d(eval_pred):
return {'f1_score':corr(*eval_pred)}
Training
from transformers import TrainingArguments, Trainer
bs: Batch Size
epochs: Number of epochs
bs = 128
epochs = 4
lr: Learning rate
lr = 8e-5
args = TrainingArguments(
'ouputs',learning_rate=lr,
warmup_ratio=0.1,lr_scheduler_type='cosine',
fp16=True,evaluation_strategy='epoch',
per_device_train_batch_size=bs,
per_device_eval_batch_size=bs*2,
num_train_epochs=epochs, weight_decay = 0.01,
report_to = 'none'
)
model = AutoModelForSequenceClassification.from_pretrained(
model_nm,
num_labels = 2
)
Downloading pytorch_model.bin: 0%| | 0.00/286M [00:00<?, ?B/s]
trainer = Trainer(
model,args,train_dataset=dds['train'],
eval_dataset=dds['test'],
tokenizer=tokz,
compute_metrics = corr_d
)
trainer.train();
[180/180 01:22, Epoch 4/4]
Epoch | Training Loss | Validation Loss | F1 Score |
1 | No log | 0.446717 | 0.775607 |
2 | No log | 0.504164 | 0.790861 |
3 | No log | 0.449196 | 0.792476 |
4 | No log | 0.486336 | 0.795580 |
We achieved an F1 score of 79.5%.
preds = trainer.predict(eval_ds).predictions.argmax(1)
preds
array([1, 1, 1, ..., 1, 1, 1])
preds
array([1, 1, 1, ..., 1, 1, 1])
Create a submission file for submitting to a Kaggle competition.
import datasets
submission = datasets.Dataset.from_dict({
'id':eval_ds['id'],
'target':preds
})
submission.to_csv('submission.csv',index=False)
Creating CSV from Arrow format: 0%| | 0/4 [00:00<?, ?ba/s]
22746
!head submission.csv
id,target
0,1
2,1
3,1
9,1
11,1
12,1
21,0
22,0
27,0
Improvements
Data cleaning: Remove emojis, eliminate punctuation, and convert text to lowercase.
Model: Larger models can be used to increase the score. These models can be found at Hugging Face.
Tweaking the training parameters: learning rate, batch size, epochs
Conclusion
In conclusion, the use of a transformer with the DeBERTa model proved to be an effective solution in achieving a high score of 83.08% in the Kaggle competition of disaster tweets. This approach utilized the power of natural language processing and machine learning to accurately classify tweets related to disasters. The solution was not only effective but also simple, and with the use of a GPU, it could be trained in under 2 minutes. Overall, the success of this approach highlights the potential of utilizing advanced techniques in natural language processing for solving complex real-world problems.
References
Practical Deep Learning for Coders Lesson 4: Natural Language (NLP)
J. Howard, "Getting Started with NLP for Absolute Beginners," Kaggle, getting-started-with-nlp-for-absolute-beginners.
Please consider upvoting my Kaggle notebook if you found my solution helpful.Here's the link to my Kaggle notebook: notebook. Thank you!