Naïve Bayes Algorithm in Machine Learning Explained with an example

Welcome to this blog, which will introduce you to the Naive Bayes algorithm in machine learning. Let’s begin.

The Bayes Theorem: A Quick Introduction

The Bayes theorem is a mathematical formula that calculates the probability of the occurrence of an event based on certain conditions. It is expressed as:

In this formula:

P(X|Y) is the probability that event X will occur, given that event Y has occurred
P(Y|X) is the probability that event Y will occur, given that event X has occurred
P(X) is the probability that event X has occurred
P(Y) is the probability that event Y has occurred

However, the Bayes theorem is often used in machine learning to classify data. As, It allows us to calculate the probability of an input data point belonging to a particular class, given a set of features.

The Step-Wise Bayes Theorem

Let us consider a situation to understand this theorem better. Suppose we have a dataset of students, with two features: their height and weight. Now, we want to classify these students as either “short” or “tall,” based on their height.

Here are the steps we would follow to use the Bayes theorem to classify the students:

Calculate the probability of each class (i.e., “short” or “tall”). For example, if 50% of the students are classified as “short,” then the probability of the class being “short” is 0.5.
For each class, calculate the probability of each feature (i.e., height and weight). For example, if 75% of the “short” students have a height of lesser than 5 feet, then the probability of the feature “height” for the class “short” is 0.75.
To get the probability of each class given the features, multiply the probabilities from steps 1 and 2. For example, the probability of the class being “short” given a height of lesser than 5 feet and a weight of lesser than 150 pounds would be:

After that, choose the class with the highest probability as the prediction.

Using Naive Bayes as an Algorithm

The Naive Bayes algorithm is a practical way of classifying data based on Bayes’ theorem. Unlike most classification algorithms, Naive Bayes takes independence into account to determine the likelihood of an event occurring. Although it is computationally expensive, it proves very effective in practice, helping solve real-world data classification problems.

To use the Naive Bayes algorithm, we first need to identify the features and target variables. The target variable is the one we aim to predict, while the features are the variables that we will use to make the prediction. We should then divide the data into a training set and a test set.

After this, The first step is to encode the target variable, if it is categorical. This involves converting the categorical values into numerical values that the model can understand.

Next, we need to convert the categorical features into dummy data. This involves creating a new column for each unique category in the feature. Then, set the value to 1 if the category is present in the data point, and 0 otherwise.

Once the data has been cleaned and prepared, the model can be trained using the training data. The Naive Bayes algorithm assumes a normal distribution for the data. So we need to make sure that the data fit this assumption. If the data is not normally distributed. We may need to use techniques such as standardization or normalization to transform it.

Now you can use test data for predictions and evaluation of the model. It will help you to get the accuracy of the model.

Naive Bayes Applications in Industry

Some applications of the Naive Bayes algorithm include:

Identifying spam emails by analysing the words and phrases used and the sender’s history
Classifying texts, like news articles or reviews, into predefined categories
Recommending items to users based on their past behaviour and the behaviour of other users with similar interests

Types of Naive Bayes Algorithm

There are many types of Naive Bayes, each with its own unique characteristics and uses.

Naive Bayes algorithms are often used in applications that require the prediction of continuous variables. Gaussian Naive Bayes is used when you want to predict the price of a house based on its size and location.
Multinomial Naive Bayes is used when you want to predict the probability that an email contains certain words by looking at the frequency in which those words appear in your training data. It is useful in case features are discrete in your data.
The Bernoulli Naive Bayes method is a simple way to determine the likelihood of an outcome in a situation where the features are binary. That is, they can take on only two values. It’s often used in spam filtering, where the features represent the presence or absence of certain words in an email.

Benefits of Naive Bayes

The Naive Bayes algorithm in machine learning has several benefits over other methods:

It is relatively simple to implement and train. So it’s easy for non-specialists to use.
It has good performance on unstructured data like text or images.
It works well with large datasets because its output is based on probabilities rather than absolute values.
It does not require prior knowledge about how to make a decision. Because it’s based purely on observed outcomes rather than rules or other parameters.

Limitations of Naive Bayes

The Naive Bayes algorithm has many benefits, but it also has some limitations.

Assumption of feature independence: As I mention, This algorithm assumes that all features in the dataset are independent of each other. This assumption is often not true in real-world data, which can lead to less accurate predictions.
Missing data sensitiveness: Naive Bayes is sensitive to missing data because it relies on the frequencies of features in the training data. If a feature is missing from a data point, the algorithm cannot make a prediction.
Limited to classification tasks: The Naive Bayes algorithm is limited to classification tasks, and cannot be used for regression or clustering tasks.

Naive Bayes Demo Using Python

Let’s see the Naive Bayes algorithm in action. We will build a model using Python, the scikit-learn library, and some basic statistical knowledge.

First, we will start by loading the necessary libraries and the dataset. All the given step has already been performed in collab with the iris dataset. you can check it here.

Analyzing a Given Dataset

Before we build a model, we need to understand the characteristics of the dataset we are working with. This includes understanding the types of variables in the dataset, the distribution of the data, and any missing values.

We can use the pandas library in Python to load and explore the dataset:

import pandas as pd

# load the dataset
data = pd.read_csv('sample_data.csv')

# view the first few rows of the dataset
print(data.head())

# get summary statistics for the dataset
print(data.describe())

# check for missing values
print(data.isnull().sum())Code language: PHP (php)

Checking the Distribution of the Target Variable

The target variable is the variable we want to predict in our machine-learning model. It is important to understand the distribution of the target variable, as this can affect the performance of our model.

We can use the seaborn library in Python to visualize the distribution of the target variable:

import seaborn as sns

# visualize the distribution of the target variable
sns.countplot(x='target', data=data)Code language: PHP (php)

Encoding the Target Variable

If the target variable is categorical, we need to encode it as numerical values that the model can understand.

We can use the sklearn library in Python to encode the target variable:

from sklearn.preprocessing import LabelEncoder

# encode the target variable
le = LabelEncoder()
data['target'] = le.fit_transform(data['target'])Code language: PHP (php)

Converting Categorical Features to Dummy Data

If the dataset contains categorical features, we need to convert them to numerical values that the model can understand. One way to do this is to create dummy variables for each unique category in the feature.

We can use the pandas library in Python to create dummy variables:

# create dummy variables for the categorical feature
data = pd.get_dummies(data, columns=['categorical_feature'])Code language: PHP (php)

Splitting the Data into Train and Test Sets

However, to build our model, we will first need to divide the data into a training set and a test set. The model will be trained on the training set and validated on the test set.

We can use the sklearn library in Python to split the data:

# split the data 
predcitors = data.iloc[:, [0, 3]].values  
labels = data.iloc[:, 5].values  
X_train, X_test, y_train, y_test = train_test_split(predictors, labels, test_size=0.2, random_state=25)Code language: PHP (php)

Assume a normal distribution

# create a Gaussian model
model = GaussianNB()

# train the model
model.fit(X_train, y_train)Code language: PHP (php)

Calculate the accuracy scores of models

Next, to calculate the accuracy of the model, you need to compare the predicted value with actual data in the test set. This accuracy helps to know about model performance. We can use the accuracy_score() function from the sklearn.metrics module to do this:

from sklearn.metrics import accuracy_score
y_pred = model.predict(X_test)

# calculate the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')Code language: PHP (php)

Here, y_test is a list of the actual values in the test set, and y_pred is a list of the predicted value by the model. The accuracy_score() function returns a score between 0 and 1, where a score of 1 indicates perfect accuracy and a score of 0 indicates no accuracy.

That’s it! You have learned how to assume a normal distribution. Also, how to calculate the accuracy scores of the machine learning model in Python.

Conclusion

In conclusion, now I have shown you how to apply the Naive Bayes algorithm in machine learning. In this problem, we used the Iris dataset as an example. Created a classifier that would classify instances into three categories of flowers. We also got an accuracy of 100 per cent on the iris dataset. We have also seen the distribution of predictions on confusion metrics.

FAQs

What is Naive Bayes in machine learning for example?

Naive Bayes is a machine learning algorithm that calculates the probability of an event occurring by multiplying the prior probability of the event occurring times the likelihood of the event occurring given certain evidence.

What is the Naive Bayes learning algorithm?

Naive Bayes is one of the machine learning algorithms that use probability to classify data. Such as, naive Bayes can be used to classify emails as spam or not.

How does Naive Bayes work in machine learning?

It predicts the probability of certain features occurring in each class by learning from a training set of data.

How would you explain Naive Bayes?

An analysis that assumes that the features in a data set are independent of one another is called naive. Although, this is not always true.