Predicting the sentiment of Covid-19 Tweets via LDA Classification
Introduction
This tutorial makes use of 3 cleaned data sets, as discussed and prepared in part 1 of this series.
- Dimension reduced sentiment data of 1,600,000 tweets labelled 0 for negative and 1 for positive sentiment.
- 8,981 unseen tweets about covid-19.
- Dimension reduced version of the 8,981 covid-19 tweets.
The goal is to train a LDA model on the sentiment data and use that model to predict the sentiment of each covid-19 tweet.
The code accompanying this tutorial can be found here: https://github.com/aletna/Covid-19-Tweets-Classification.git
We will be using the following libraries:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import metrics
import pandas as pd
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import classification_report,confusion_matrix1. Training labelled sentiment data
While Linear Discriminant Analysis is often used as a dimension reduction tool, it can also be used as a classification model. Since our data is already prepared so that we do not need to further reduce dimension, we will be using it purely for its classification features in this reading.
First let’s load the sentiment data. In part 1 of this series we prepared the sentiment data so that it is divided in the input and output (target) variables. Let’s load the prepared data.
target = np.load('target.npy')/4
sent_input = np.load('sentiment_input.npy')Here I divide the target data by 4 as the original dataset labelled positive tweets as 4 and negative tweets as 0. By dividing it by 4, every tweet is simply labelled as 0 and 1. This is not necessary but it helps avoid confusion.
Next, let’s divide our input and output variables into training and test sets to be able to test our model on the unseen test data. Here I’m splitting the data in 75% training data and 25% test data by setting the test_size to 0.25. The random_state acts a as a seed to be able to reproduce the same results.
# Splitting input and output variables into training and test sets
X = sent_input
y = targetX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=7)
Using the training data, let’s train the LDA model.
lda = LinearDiscriminantAnalysis()
lda.fit(X_train, y_train)Once we’ve fitted the training data to the model we can make predictions on the unseen test data by using the predict function.
y_pred = lda.predict(X_test)Since we trained the classification model on potential outputs of either 0 or 1, our predictions will look the same. The variable y_pred now stores a value of either 0 or 1 for each of the y_test tweets.
2. Model Overview
Before proceeding to make predictions on the unseen covid tweets, it is important to have a look at the results and accuracy of the model. Let’s start of by printing its accuracy score and classification report which presents scores such as the precision, recall, or f1-score.
print(precision_score(y_test, y_predicted, average='weighted')
print(classification_report(y_test,y_pred))This particular model has an accuracy of 0.694 which is not perfect but it is also not particularly bad. It means that around 7 out of 10 predictions of the test set were accurate. The confusion matrix helps us understand how many true & false positives and true & false negatives occurred.
print(confusion_matrix(y_test,y_pred))3. Predicting the sentiment of covid tweets
Now that we have trained our model on the labelled sentiment data, we can apply it to our unseen covid data which was prepared in part 1 of this series as well.
covid_data = np.load('vectorized_100.npy')To run the predictions we simply have to run the following line of code.
covid_predictions = lda.predict(covid_data)Now all our models’ predictions are stored in covid_predictions. Let’s create a data frame to be able to attach it to our non vectorized covid data for further visualizations later on (part 4).
covid_predicted_values = pd.DataFrame({'Predicted Values': covid_predictions.flatten()})Below is a snapshot of what this will look like. It can be seen that for example the first 5 tweets in our data set were predicted to be rather negative than positive inclined. The last tweet, tweet number 8980, on the other hand seems to be a positive one.

We can however get more information on the sentiment of the tweet by using the sklearn’s LDA built in predict_proba function. This will provide us the probability of each sample belonging to each of the two classes. For example a tweet that was predicted to be negative could be only 55% likely to be negative or a more certain 90%. The predict_proba function will determine that for us.
covid_predictions_probability = lda.predict_proba(covid_data)Before adding it to our initial dataframe of all the covid data, let’s create an array of all the negative probabilities and positive probabilities for each tweet.
covid_negative_probability=[]
covid_positive_probability=[]
for i in covid_predictions_probability:
covid_positive_probability.append(i[1])
covid_negative_probability.append(i[0])Let’s import the cleaned non-vectorized covid-19 dataframe to then link all our prediction data to each corresponding tweet.
data_country_random = pd.read_csv('data_country_random.csv')Now we can attach the predictions as new columns.
data_country_random['predictions'] = covid_predicted_values
data_country_random['positive_probability'] = covid_positive_probability
data_country_random['negative_probability'] = covid_negative_probabilityFinally let’s print the head of the covid dataframe.
data_country_random.head()Below is a snippet of its head (note, not all columns fit in the screenshot, but the entire table can be found in the github repo).

The last three columns are our new prediction data.
- predictions: shows our models’ prediction of the tweets sentiment. 0 for negative and 1 for positive.
- positive_probability: shows the probability of this tweet being positive. The first row for example has a 0.4 probability of being positive.
- negative_probability: Similar to positive_probability this shows the probability of the tweet being negative. In the case of the first row entry, it has a 0.59 probability of being negative which is why the model decided that it is a negative tweet.
Looking at the positive and negative probability of each tweet shows that it is often not a clear decision whether a tweet is positive or negative and that we should take each prediction with a grain of salt.
Finally we can inspect some positive and negative tweets to see whether the model accurately classified them.
# printing out first 50 positive tweets
# and the probability ofhow likely it is positive vs negativedata_country_random[['text','positive_probability','negative_probability']].loc[data_country_random['predictions'] == 1][:50]

# printing out first 50 negative tweets
# and the probability of how likely it is positive vs negativedata_country_random[['text','positive_probability','negative_probability']].loc[data_country_random['predictions'] == 0][:50]

Now that the covid predictions are stored in the dataframe, let’s export it for further analysis and visualizations in part 4.
data_country_random.to_csv("covid_predictions_with_proba.csv")Conclusion
In conclusion, we took the clean dataset of sentiment labelled (positive and negative) tweet data and trained a LDA classifier on it and achieved a classification accuracy of 0.694. Once that model was trained we used it to predict the sentiment of our unseen but cleaned covid tweets. We did not only predict the classification but also stored the model’s predicted probability that any of the given tweets is positive and negative. This information was all added to our overall covid tweet dataframe after which we saved it, to be visualized in part 4 of this series.



