Sentiment Analysis Project: Document & Showcase
Introduction:
The purpose of our sentiment analysis project is to develop a machine learning model that can accurately classify the sentiment expressed in text data. Sentiment analysis, also known as opinion mining, is a subfield of natural language processing that aims to determine the emotional tone behind a piece of text, whether it is positive, negative, or neutral. Sentiment analysis has various applications, including social media monitoring, customer feedback analysis, brand reputation management, and market research.
For this project, we utilized a publicly available dataset of customer reviews from an e-commerce platform. The dataset consists of a collection of customer reviews along with their corresponding sentiment labels (positive or negative). The dataset contains a total of 50,000 reviews, evenly distributed between positive and negative sentiments.
- Project goal: Developing a sentiment analysis model to classify movie reviews as positive or negative.
- Scope: Using a dataset of movie reviews and focus on binary sentiment classification.
1. Gathering and preparing the data:
- We’ll use the IMDb movie review dataset, which contains 50,000 movie reviews labeled as positive or negative. We can find the dataset on the IMDb website or on the platform Kaggle.
- After downloading the dataset, we will need to extract it. We can do this using a command-line tool like tar or unzip. Once the dataset is extracted, we will see a folder called
aclImdb
. This folder contains two subfolders: train and test. The train folder contains the training data, and the test folder contains the test data. - The next step is to load the data into a data science library like Pandas. Pandas is a Python library that makes it easy to work with data. We can use Pandas to read the data from the
aclImdb
folder and create a DataFrame.
To create a DataFrame, we can use the below steps:
- Import the necessary libraries:
- Open the Python script or notebook and import the
pandas
library usingimport pandas as pd
.
- Open the Python script or notebook and import the
- Extract the dataset:
- Use a command-line tool like
tar
orunzip
to extract the dataset if we haven’t already done so. Ensure that the folderaclImdb
is present in our working directory.
- Use a command-line tool like
- Load the data into a DataFrame:
- Use the
glob
module in Python to retrieve the paths of all the text files inside thepos
andneg
folders within thetest
andtrain
folders. This can be achieved using theglob.glob()
function. For example:
- Use the
import glob
pos_files = glob.glob('aclImdb/test/pos/*.txt') + glob.glob('aclImdb/train/pos/*.txt')
neg_files = glob.glob('aclImdb/test/neg/*.txt') + glob.glob('aclImdb/train/neg/*.txt')
- Create an empty list to store the reviews and labels:
reviews = []
labels = []
- Read the contents of each file, append the review to the
reviews
list, and assign the corresponding label (‘positive’ or ‘negative’) to thelabels
list. For example:
for file in pos_files:
with open(file, 'r', encoding='utf-8') as f:
review = f.read()
reviews.append(review)
labels.append('positive')
for file in neg_files:
with open(file, 'r', encoding='utf-8') as f:
review = f.read()
reviews.append(review)
labels.append('negative')
- Create a dictionary with the reviews and labels as keys and values, respectively:
data = {'review': reviews, 'label': labels}
- Create the DataFrame using the
pd.DataFrame()
function:
df = pd.DataFrame(data)
4. Explore the DataFrame:
- To view the first few rows of the DataFrame, use the
head()
method:
df.head()
Output:
- We can also calculate summary statistics using the
describe()
method:
df.describe()
Output:
- Also, we can check the distribution of positive and negative reviews using the
value_counts()
method:
df['label'].value_counts()
Output:
positive 25000negative 25000Name: label, dtype: int64
2. Split the data into training and testing sets
To accomplish this, we can utilize the train_test_split
function from the scikit-learn library. This function will randomly shuffle the data and split it into the desired proportions. Here’s how we can perform the split:
from sklearn.model_selection import train_test_split
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(reviews, labels, test_size=0.2, random_state=42)
In the above code, reviews
and labels
represent our data and corresponding sentiment labels, respectively. The test_size
parameter is set to 0.2, indicating that 20% of the data will be used for testing, while 80% will be used for training. We can adjust this value based on our preference.
The train_test_split
function returns four sets of data: X_train
, X_test
, y_train
, and y_test
. The X_train
and y_train
represent the training data and labels, respectively while X_test
and y_test
represent the testing data and labels, respectively.
We can now use X_train
and y_train
to train our sentiment analysis model, and X_test
and y_test
to evaluate its performance.
3. Preprocess the text data
Preprocessing the text data is an important step in sentiment analysis. It involves transforming the raw text into a format that can be understood by the machine learning algorithm.
- Install NLTK: If we haven’t installed NLTK, we can do so by running the following command:
!pip install nltk
Output:
Requirement already satisfied: nltk in c:\users\mathl\anaconda3\lib\site-packages (3.7)Requirement already satisfied: tqdm in c:\users\mathl\anaconda3\lib\site-packages (from nltk) (4.64.1)Requirement already satisfied: click in c:\users\mathl\anaconda3\lib\site-packages (from nltk) (8.0.4)Requirement already satisfied: joblib in c:\users\mathl\anaconda3\lib\site-packages (from nltk) (1.1.1)Requirement already satisfied: regex>=2021.8.3 in c:\users\mathl\anaconda3\lib\site-packages (from nltk) (2022.7.9)Requirement already satisfied: colorama in c:\users\mathl\anaconda3\lib\site-packages (from click->nltk) (0.4.6)
- Import NLTK and download resources: Import the NLTK library and download the required resources by running the following code:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
Output:
[nltk_data] Downloading package punkt to[nltk_data] C:\Users\mathl\AppData\Roaming\nltk_data...[nltk_data] Unzipping tokenizers\punkt.zip.[nltk_data] Downloading package stopwords to[nltk_data] C:\Users\mathl\AppData\Roaming\nltk_data...[nltk_data] Unzipping corpora\stopwords.zip.[nltk_data] Downloading package wordnet to[nltk_data] C:\Users\mathl\AppData\Roaming\nltk_data...True
3. run the following code to download the “omw-1.4” resource:
import nltk
nltk.download('omw-1.4')
Output:
[nltk_data] Downloading package omw-1.4 to[nltk_data] C:\Users\mathl\AppData\Roaming\nltk_data...True
4. Perform text preprocessing: Once NLTK is installed and the resources are downloaded, we can use various NLTK functions to preprocess the text data. Here’s an example of how we can perform common preprocessing steps:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string
# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
# Define punctuation and stopwords
punctuation = string.punctuation
stopwords = set(stopwords.words('english'))
# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()
# Preprocess function
def preprocess_text(text):
# Tokenize the text
tokens = word_tokenize(text)
# Remove punctuation and convert to lowercase
tokens = [token.lower() for token in tokens if token not in punctuation]
# Remove stopwords
tokens = [token for token in tokens if token not in stopwords]
# Lemmatize tokens
tokens = [lemmatizer.lemmatize(token) for token in tokens]
# Join tokens back into a single string
preprocessed_text = ' '.join(tokens)
return preprocessed_text
# Apply preprocessing to our data
preprocessed_reviews = [preprocess_text(review) for review in X_train]
Output:
[nltk_data] Downloading package punkt to[nltk_data] C:\Users\mathl\AppData\Roaming\nltk_data...[nltk_data] Package punkt is already up-to-date![nltk_data] Downloading package stopwords to[nltk_data] C:\Users\mathl\AppData\Roaming\nltk_data...[nltk_data] Package stopwords is already up-to-date![nltk_data] Downloading package wordnet to[nltk_data] C:\Users\mathl\AppData\Roaming\nltk_data...[nltk_data] Package wordnet is already up-to-date!
In the above code, we import the necessary modules from NLTK, including the tokenizer, stopwords, and lemmatizer. We define the punctuation set and stopwords set for English. Then, we initialize the WordNetLemmatizer for lemmatization.
The preprocess_text
function takes a text input, tokenizes it into words, removes punctuation, converts the words to lowercase, removes stopwords, lemmatizes the words and joins them back into a single string. Finally, we apply the preprocessing function to each review in the training data (X_train
) to obtain the preprocessed text (preprocessed_reviews
).
This preprocessing step helps in reducing noise and focusing on important features of the text, improving the performance of the sentiment analysis model.
4. Vectorize the text data
Machine learning algorithms typically require numeric input. In order to use the text data as input for the model, we need to convert it into numerical vectors.
We will follow the process of vectorizing the preprocessed text data using scikit-learn’s CountVectorizer
and TfidfVectorizer
. These are popular methods for converting text data into numerical feature vectors. Let’s start with the CountVectorizer
approach:
- Import the required libraries:
from sklearn.feature_extraction.text import CountVectorizer
- Initialize the
CountVectorizer
object:
vectorizer = CountVectorizer()
- Fit and transform the preprocessed reviews to create the feature vectors:
X_train_vectorized = vectorizer.fit_transform(preprocessed_reviews)
Here, X_train_vectorized
will be a sparse matrix that represents the feature vectors of the preprocessed reviews.
Now, let’s move on to the TfidfVectorizer
approach. It’s similar to CountVectorizer
, but it also takes into account the inverse document frequency (IDF) of each word. IDF assigns higher weights to words that are more informative and less frequent across the entire corpus.
- Import the required library:
from sklearn.feature_extraction.text import TfidfVectorizer
- Initialize the
TfidfVectorizer
object:
vectorizer = TfidfVectorizer()
- Fit and transform the preprocessed reviews to create the feature vectors:
X_train_vectorized = vectorizer.fit_transform(preprocessed_reviews)
Again, X_train_vectorized
will be a sparse matrix that represents the feature vectors of the preprocessed reviews.
Both CountVectorizer
and TfidfVectorizer
offer various parameters to customize the vectorization process. We can explore the scikit-learn documentation for these classes to learn more about the available options.
5. Train a sentiment analysis model
We will move to the process of training a sentiment analysis model using a suitable machine learning algorithm. We’ll use scikit-learn’s implementation of the chosen algorithm, which provides convenient functions for training and prediction. Let’s go step by step:
- Choose a Machine Learning Algorithm:
- Naive Bayes: It’s a simple and effective algorithm for text classification tasks.
- Logistic Regression: It’s a widely used algorithm that works well for binary classification tasks.
- Support Vector Machines (SVM): It’s a powerful algorithm that can handle complex classification tasks.
For this example, let’s use Naive Bayes as our sentiment analysis algorithm.
- Import the required libraries:
from sklearn.naive_bayes import MultinomialNB
- Initialize the Naive Bayes classifier:
classifier = MultinomialNB()
- Train the classifier using the vectorized features and corresponding labels:
classifier.fit(X_train_vectorized, y_train)
Output:
Here, X_train_vectorized
represents the vectorized feature matrix, and y_train
represents the corresponding labels (sentiments) for the training data.
- Predict the sentiment for new or unseen data:
X_test_preprocessed = [preprocess_text(text) for text in X_test]
X_test_vectorized = vectorizer.transform(X_test_preprocessed)
predictions = classifier.predict(X_test_vectorized)
Here, X_test
represents the new or unseen data, and X_test_vectorized
represents the vectorized feature matrix for the test data. preprocess_text()
is the same function we defined earlier for preprocessing the text.
- Evaluate the model’s performance: We can use various evaluation metrics such as accuracy, precision, recall, and F1-score to assess the model’s performance. Here’s an example of calculating the accuracy:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, predictions)
Here, y_test
represents the true labels (sentiments) for the test data.
Remember to preprocess and vectorize the test data in the same way as the training data to ensure consistency in the feature representation.
Feel free to explore other algorithms like Logistic Regression or Support Vector Machines (SVM) if we prefer. The steps for training and prediction will be similar, with only the algorithm-specific code changing.
6. Evaluate the model
Once the model is trained, we need to evaluate its performance on the testing data to assess how well it generalizes to unseen examples.
Now, we will start the process of evaluating the trained sentiment analysis model using common evaluation metrics. We’ll use scikit-learn’s functions to compute these metrics.
Let’s proceed step by step:
- Import the required libraries:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
- Predict the sentiment for the test data:
predictions = classifier.predict(X_test_vectorized)
Here, X_test_vectorized
represents the vectorized feature matrix for the test data, and classifier
is the trained sentiment analysis model.
- Compute evaluation metrics:
accuracy = accuracy_score(y_test, predictions)
precision = precision_score(y_test, predictions, average='weighted')
recall = recall_score(y_test, predictions, average='weighted')
f1 = f1_score(y_test, predictions, average='weighted')
Here, y_test
represents the true labels (sentiments) for the test data.
The average='weighted'
parameter calculates metrics for each label and finds their weighted average. We can also use average='binary'
for binary classification tasks.
- Print the evaluation results:
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1)
This will display the computed values for each evaluation metric as shown below:
Final Output:
Accuracy: 0.8633Precision: 0.8637983603899607Recall: 0.8633F1-score: 0.8632660961375841
Analysing the Final Output:
The output we provided indicates the evaluation results of the sentiment analysis model on the test data.
- Accuracy: The accuracy of 0.8633 indicates that the model correctly predicted the sentiment for approximately 86.33% of the test data.
- Precision: The precision score of 0.8637983603899607 represents the weighted average precision, which measures the model’s ability to correctly identify positive and negative sentiments.
- Recall: The recall score of 0.8633 represents the weighted average recall, which measures the model’s ability to correctly classify positive and negative sentiments.
- F1-score: The F1-score of 0.8632660961375841 is the weighted average of precision and recall, providing a single metric to assess the model’s overall performance.
These evaluation metrics provide insights into the model’s performance and can help us understand how well it generalizes to unseen examples. In this case, the model seems to have achieved reasonably good performance, but further analysis and comparison with other models or baselines can provide a more comprehensive understanding.
Findings and Conclusion:
Through our sentiment analysis project, we successfully developed a machine learning model that can accurately classify the sentiment expressed in customer reviews. The model achieved an accuracy of 86.33%, indicating its effectiveness in sentiment classification.
We discovered that preprocessing the text data by removing noise, performing tokenization, stop word removal, and lemmatization significantly improved the performance of the sentiment analysis model.
The Multinomial Naive Bayes classifier, coupled with the Bag-of-Words representation, demonstrated excellent performance in sentiment analysis tasks. It was selected for its simplicity, interpretability, and ability to handle discrete features like word counts.
The project’s findings reveal the strengths of our sentiment analysis model, including its ability to accurately classify sentiment in customer reviews. However, some limitations exist, such as the challenge of handling misspelled words and noisy text. To overcome these limitations, future improvements could include incorporating more advanced preprocessing techniques, such as spell-checking algorithms, and exploring other machine learning algorithms like Support Vector Machines or deep learning models.
Therefore, our sentiment analysis project provides valuable insights into understanding customer sentiment and can be utilized in various domains, such as customer feedback analysis, brand reputation management, and market research.
Comments
Post a Comment