Machine Learning Tutorial in R: From Data Cleaning to Classification (Using Iris Dataset)

Machine Learning Tutorial in R: From Data Cleaning to Classification (Using Iris Dataset)#

This notebook demonstrates a complete machine learning workflow in R—from data cleaning to model evaluation—using the Iris dataset.

Step 1: Install the Required Libraries/Packages#

Before we begin working on machine learning tasks in R, we need to install and load several essential packages:

These commands install the following libraries:

tidyverse: A collection of R packages designed for data science. It includes tools like ggplot2, dplyr, and readr for data manipulation and visualization.
caret: Short for Classification And Regression Training, this package provides a unified interface for building and evaluating machine learning models.
randomForest: Implements the Random Forest algorithm, a popular ensemble method for classification and regression tasks.
corrplot: A visualization tool to display correlation matrices in a clear and informative way.

install.packages("tidyverse")
install.packages("caret")
install.packages("randomForest")
install.packages("corrplot")

Step 2: Load Libraries#

Once installed, we can load the libraries. This makes the functions from each package available in our R environment so we can use them in the rest of our analysis.

library(tidyverse)
library(caret)
library(randomForest)
library(corrplot)

Step 3: Load Iris Dataset, a built-in dataset in R#

This section begins by loading the built-in Iris dataset into a variable called df. The Iris dataset is a classic dataset in machine learning, containing measurements of sepal length, sepal width, petal length, and petal width for 150 iris flowers across three species. To get an initial sense of the data, the head(df) function is used to display the first six rows, providing a quick preview of the dataset’s structure and values. Similarly, tail(df, n = 10) shows the last ten rows, offering a look at how the dataset ends and helping to identify any potential issues or patterns in the data.

# Load Iris Dataset, a built-in dataset in R
df <- iris

# Display the first 6 rows of a data frame df
head(df)

# Display the last 10 rows of df
tail(df, n = 10)

A data.frame: 6 x 5
	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
	<dbl>	<dbl>	<dbl>	<dbl>	<fct>
1	5.1	3.5	1.4	0.2	setosa
2	4.9	3.0	1.4	0.2	setosa
3	4.7	3.2	1.3	0.2	setosa
4	4.6	3.1	1.5	0.2	setosa
5	5.0	3.6	1.4	0.2	setosa
6	5.4	3.9	1.7	0.4	setosa

A data.frame: 10 x 5
	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
	<dbl>	<dbl>	<dbl>	<dbl>	<fct>
141	6.7	3.1	5.6	2.4	virginica
142	6.9	3.1	5.1	2.3	virginica
143	5.8	2.7	5.1	1.9	virginica
144	6.8	3.2	5.9	2.3	virginica
145	6.7	3.3	5.7	2.5	virginica
146	6.7	3.0	5.2	2.3	virginica
147	6.3	2.5	5.0	1.9	virginica
148	6.5	3.0	5.2	2.0	virginica
149	6.2	3.4	5.4	2.3	virginica
150	5.9	3.0	5.1	1.8	virginica

Step 3: Exploratory Data Analysis#

This block of code provides essential information about the structure and contents of the dataset. The str(df) function reveals the internal structure of the data frame, including the number of rows and columns, the column names, and the data type of each column—helpful for understanding how the data is organized. Next, summary(df) generates descriptive statistics (such as mean, median, min, max, and quartiles) for all numerical features, offering a quick overview of the distribution and range of the values. Finally, to make the dataset more intuitive for machine learning tasks, the fifth column (originally named “Species”) is renamed to "target" using colnames(df)[5] <- "target", clearly designating it as the variable we aim to predict.

#find the number of rows and columns, column names, and datatype of columns
str(df)

# find the summary statistics of all the numerical features
summary(df)

# Rename target column for consistency; the column 5 ("species") is renamed as "target"
colnames(df)[5] <- "target"

'data.frame':	150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50  
                
                
                

Step 4: Data Cleaning#

This section focuses on exploring and visualizing the dataset to better understand its structure before building a machine learning model. First, anyNA(df) checks for any missing values in the dataset—although the Iris dataset is clean, it’s good practice to confirm this before proceeding. Next, a bar chart is created using ggplot() to visualize the distribution of the target variable (i.e., the flower species). This helps assess class balance, which is important for model training. Lastly, a heatmap of correlations is generated using corrplot(), which calculates pairwise correlations between the four numerical features (sepal and petal measurements) and visually represents their relationships. In a heatmap, brighter or darker colors typically indicate stronger positive or negative correlations, helping you quickly identify which features move together or in opposite directions.

# No missing values in iris; but check anyway; 
anyNA(df)

# Plot the number of classes in target variable (column y)
ggplot(df, aes(x=target)) + geom_bar(fill='steelblue') + theme_minimal() +
  ggtitle("Target Class Distribution")

# plot heatmap of correlation between all the features of dataset df; we are interested in the correlation between the features (X)
corrplot(cor(df[, 1:4]), method="color", type="upper", tl.cex=0.8)

FALSE

../../../_images/dba5be7e8a27b89ba9d6d72fe718aded59089cd9ba3c2ef6cf28c840ef18f798.png

../../../_images/3479487583a212f007e469eba422772d931add36d5fe1895a8cc8c1dfc61030f.png

Step 5: Split the Data#

This section prepares the data for machine learning by splitting it into training and testing sets. Using set.seed(42) ensures that the random split is reproducible, meaning you’ll get the same results each time you run the code. The createDataPartition() function from the caret package is used to randomly divide the dataset, while preserving the proportion of each class in the target variable. In this example, 80% of the data is used for training (train) and 20% for testing (test). However, other common split ratios include 70/30 or 75/25, depending on the size of the dataset and the need for training versus evaluation. A larger training set (like 80%) gives the model more data to learn from, while a larger test set (like 30%) provides a more robust evaluation. The ideal split often depends on the problem context and the amount of available data.

set.seed(42) # set a random seed to replicate the results
trainIndex <- createDataPartition(df$target, p = .8, list = FALSE)
train <- df[trainIndex, ]
test <- df[-trainIndex, ] # whatever is left after assigning to "train"

Step 6: Feature Scaling#

This section focuses on feature scaling, an important preprocessing step in many machine learning workflows. Since the numerical features in the Iris dataset have different value ranges—for instance, Petal.Width ranges up to 2.5 while Sepal.Length goes up to 7.9—scaling is used to standardize the values. Using the preProcess() function from the caret package with the "center" and "scale" methods, the features in the training data are transformed so that they have a mean of 0 and a standard deviation of 1. This ensures that no single feature disproportionately influences the model due to its magnitude. The scaling model learned from the training data is then applied to both the training and testing features using predict(). Note that only the input features (columns 1 to 4) are scaled—the target variable (target) remains unchanged.

#Numerical value of feature may be of different magnitudes (for example, max value of  "Petal.Width" is 2.5 wheras max value "Sepal.Length" is 7.9. 
#Hence, we scale the features to transform them to a similar scale (between 0 and 1), 
#so the feature with high numerical values does not dominate the model creation. Only feature values are scaled, NOT the target values.```
preproc <- preProcess(train[, 1:4], method = c("center", "scale"))
train_scaled <- predict(preproc, train[, 1:4])
test_scaled <- predict(preproc, test[, 1:4])

Step 7: Train Classifier#

This line of code demonstrates how to train a machine learning model—in this case, a Random Forest classifier—using the preprocessed training data. The randomForest() function takes the scaled feature values (train_scaled) as input (x) and the corresponding class labels (train$target) as output (y). Random Forest is a powerful ensemble learning method that builds multiple decision trees and combines their predictions to improve accuracy and reduce overfitting. While this example uses Random Forest, other popular classification algorithms such as Decision Tree, Logistic Regression, Naive Bayes, and XGBoost can also be used depending on the problem and dataset characteristics.

model <- randomForest(x = train_scaled, y = train$target)

Step 8: Evaluate Model#

This code uses the trained Random Forest model to make predictions on the scaled test dataset. The predict() function generates predicted class labels (pred) for each example in the test set based on the learned patterns from the training data. Then, confusionMatrix() compares these predictions to the true target values (test$target) to evaluate the model’s performance. The confusion matrix provides detailed insight into how well the model classified each class, showing counts of true positives, false positives, true negatives, and false negatives, which help assess accuracy, precision, recall, and other important metrics.

pred <- predict(model, newdata = test_scaled)
confusionMatrix(pred, test$target)

Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         10          0         0
  versicolor      0          9         1
  virginica       0          1         9

Overall Statistics
                                          
               Accuracy : 0.9333          
                 95% CI : (0.7793, 0.9918)
    No Information Rate : 0.3333          
    P-Value [Acc > NIR] : 8.747e-12       
                                          
                  Kappa : 0.9             
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            0.9000           0.9000
Specificity                 1.0000            0.9500           0.9500
Pos Pred Value              1.0000            0.9000           0.9000
Neg Pred Value              1.0000            0.9500           0.9500
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.3000           0.3000
Detection Prevalence        0.3333            0.3333           0.3333
Balanced Accuracy           1.0000            0.9250           0.9250

Summary of Output#

The model achieved an overall accuracy of approximately 93%, indicating strong performance in correctly classifying the Iris species. Sensitivity and specificity values are high across all classes, showing the model effectively identifies each species with few misclassifications. The balanced accuracy values near or above 90% for all classes further confirm that the model performs consistently well across the different species.