Visualizing Statistical Relationships

Visualizing Statistical Relationships#

Seaborn is a powerful Python library built on top of matplotlib that simplifies the creation of statistical graphics. With its high-level, dataset-oriented API and seamless integration with pandas, Seaborn allows users to quickly explore relationships in data, estimate trends, and visualize distributions with minimal code. It offers smart defaults and flexible customization, making it ideal for both exploratory analysis and polished visualizations.

Seaborn tutorial (https://seaborn.pydata.org/tutorial/relational.html)

Overview of seaborn plotting functions (https://seaborn.pydata.org/tutorial/function_overview.html)

Installing and Importing Packages#

# Importing the neccessary packages
%pip install numpy pandas matplotlib seaborn

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.__version__

Relating Variables with Scatter Plot#

A scatter plot is a fundamental way to visualize the relationship between two continuous variables. Each point on the plot represents an observation, positioned according to its values on the x- and y-axes. Scatter plots help reveal correlations, clusters, or outliers in data, showing trends or patterns that may exist between variables.

In Seaborn, functions like sns.scatterplot() or sns.relplot() are commonly used to create scatter plots, often enhanced by additional semantic variables such as hue (color), size, or style to represent more dimensions of data.

This code begins by setting the overall visual style of Seaborn plots to "darkgrid", which adds a grid to the background to make the plot easier to read. It then loads Seaborn’s built-in tips dataset, which contains information about restaurant bills and tips. Finally, it creates a scatter plot using sns.lmplot() to visualize the relationship between total_bill and tip. Setting fit_reg=False disables the default linear regression line, allowing us to focus solely on the raw data points.

sns.set(style="darkgrid") #style = dict, None, or one of {darkgrid, whitegrid, dark, white, ticks} --- style of axes

tips = sns.load_dataset("tips") #Load an example dataset from the online repository (requires internet)
sns.lmplot(x="total_bill", y="tip", fit_reg=False, data=tips); # semicolon(;) at the end stops the graph text to be printed

../../_images/967113a2b2551ab4544f89439b509bc59b907edfff50dee887ef3377ba80d29f.png

This code sets a clean white background using sns.set(style="white"), then creates a scatter plot showing the relationship between total_bill and tip using Seaborn’s lmplot(). Another dimension is added to the plot through the hue semantic—coloring points based on the smoker variable—so we can visually compare tipping behavior between smokers and non-smokers. The regression line is turned off with fit_reg=False, so the plot displays only the raw data points. By default, Seaborn sets the figure size, but you can manually adjust it using the height and aspect parameters. You can also change the size of the data points using the s parameter in other Seaborn functions like scatterplot().

sns.set(style="white")
sns.lmplot(x="total_bill", y="tip", hue="smoker", fit_reg=False,data=tips);

../../_images/68c88decfd515f6716307f20c968f5cd6a5767ddb6c02af1c8349dff675dc567.png

This code uses Matplotlib’s plt.figure(figsize=(8, 10)) to manually set the size of the figure, demonstrating another way to control plot dimensions outside of Seaborn. It then creates a scatter plot of total_bill vs. tip using Seaborn’s lmplot(), with data points colored by the smoker category using the hue semantic. To better highlight group differences and improve accessibility, you can also use different marker styles for each class by adding the markers parameter (e.g., markers=["o", "s"]). While lmplot() has its own size controls (height and aspect), this example shows how Matplotlib can also be used to control figure appearance.

from matplotlib import pyplot as plt

plt.figure(figsize=(8,10))
sns.lmplot(x="total_bill", y="tip", hue="smoker", fit_reg=False, data=tips);

<Figure size 800x1000 with 0 Axes>

This code uses sns.relplot() to create a scatter plot showing the relationship between total_bill and tip, while simultaneously representing two additional variables. The hue="smoker" parameter colors the points based on smoking status, and style="time" changes the marker shape depending on whether the meal was during lunch or dinner. This allows the plot to display four variables at once. However, this should be done carefully—while color is visually striking and easy to interpret, the human eye is less sensitive to differences in shape, so subtle distinctions may be harder to spot.

sns.relplot(x="total_bill", y="tip", hue="smoker", style="time", data=tips);

../../_images/f83d008f5ee89bd8d3a68a142e528a0b8b4feeedbbeb3cd8433e75cffa6a6ea9.png

In the examples above, sns.scatterplot() is used to plot total_bill vs. tip, with point colors (hue) based on the size of the dining party. Since size is a numeric variable, Seaborn automatically uses a sequential color palette, where lighter and darker colors represent smaller and larger values. This is different from when hue is categorical—like sex or smoker—in which case Seaborn uses a qualitative palette by default.

The second plot shows how to manually change the color palette using the palette parameter. Here, the "coolwarm" palette is used, which makes differences in group size more visually distinct. Adjusting color palettes can help improve clarity, especially when working with numeric hue semantics.

sns.scatterplot(x="total_bill", y="tip", hue="size", data=tips);

#changing the color palette
sns.scatterplot(x="total_bill", y="tip", hue="size", palette="coolwarm", data=tips);

../../_images/37b3adaff257327cd787c138a363092c45bec13ceb2ccda2a9632bd80037955b.png

This code uses sns.relplot() to create a scatter plot where the size of each point is determined by the size variable, adding a third semantic dimension to the plot. Instead of using the raw numeric values directly as point sizes, Seaborn normalizes the range of size values into an area range for the plot markers.

In the second example, the sizes=(15, 200) parameter customizes this area range, specifying that points will have sizes scaled between 15 and 200 (in area units), which can help make differences in data more visually distinct. This semantic mapping of size is a powerful way to encode numeric information in scatter plots.

sns.relplot(x="total_bill", y="tip", size="size", data=tips);

#the literal value of the variable is not used to pick the area of the point. Instead, the range of values in data units is
#normalized into a range in area units. This range can be customized:
sns.relplot(x="total_bill", y="tip", size="size", sizes=(15, 200), data=tips);

../../_images/880f6068d84e6c2621d4dc22af8d72586dcd186fa353bab5f0c973034f150aba.png

../../_images/63ff7cf38995ad61da23c7bb6f8f402cb690969905eab1c598038ed45284ddd3.png

This plot combines multiple semantic variables using sns.relplot(). Both the color (hue) and the size of the points represent the size variable, giving two visual cues to its values. The color palette "BuGn_r" (a reversed blue-green sequential palette) is used to emphasize this numeric variable, while sizes=(30, 200) controls the range of marker areas. Setting legend="full" ensures the legend shows the full range of sizes and colors, making it easier to interpret how size relates to total_bill and tip.

sns.relplot(x="total_bill", y="tip", hue="size", size="size", palette = ("BuGn_r"),
                     sizes=(30, 200), legend="full" ,data=tips)

<seaborn.axisgrid.FacetGrid at 0x7f894b562590>

../../_images/e740763db53d80e25585db9211e97feb78b30268ef8612777cac07b1575879dd.png

Pairwise plots#

Pairwise plots are a convenient way to visualize relationships between multiple variables at once. Using Seaborn’s pairplot(), you can create a grid of scatter plots showing every pairwise combination of numerical variables in a dataset, along with histograms or KDE plots on the diagonal to display each variable’s distribution. This helps quickly spot correlations, patterns, or clusters across variables.

FacetGrid#

Facet Grids are powerful tools for visualizing data subsets across different categories. They allow you to split your dataset into multiple smaller plots (facets), arranged by one or more categorical variables. A FacetGrid can organize these plots by rows, columns, and additionally use hue for color coding, effectively creating up to three dimensions of categorization.

Each facet shows the distribution or relationships for the subset of data corresponding to a specific category or combination of categories. This helps reveal patterns or differences that might be hidden in the overall dataset.

This code creates a FacetGrid using the tips dataset, splitting the data into separate plots based on the time variable (e.g., lunch or dinner). For each subset, it maps a histogram of the tip variable, allowing you to compare the distribution of tips between different meal times side-by-side. The function plt.hist from Matplotlib is used to generate these histograms—it takes the tip data for each facet and plots its frequency distribution as bars. This visual separation helps identify differences in tipping behavior depending on the time of day.

#sns.set(style="white")
g = sns.FacetGrid(tips, col="time")
g.map(plt.hist, "tip");

../../_images/a8d242f335c88eed103308e4e3f4284054afe8488953bbb559397aea514374d5.png

This code creates a FacetGrid that divides the tips dataset into separate plots based on the time variable (col="time"), producing one plot for lunch and one for dinner. Within each plot, points are colored by the smoker status using the hue semantic, with the "husl" color palette for clear distinction. It uses plt.scatter to map a scatter plot of total_bill versus tip on each facet, with some transparency (alpha=.7) and white edges around points for better visibility. The add_legend() call adds a legend explaining the smoker categories. This visualization helps compare tipping behavior across meal times and smoking status simultaneously.

g = sns.FacetGrid(tips, col="time", hue="smoker", palette="husl")
g.map(plt.scatter, "total_bill", "tip", alpha=.7, edgecolor ="w")
g.add_legend();

../../_images/e1c080fda0a0af21fa99a0dcc1da4f0c25dab11db3edb16fd4da194af10d7a16.png

This line loads the famous Iris dataset using Seaborn’s built-in dataset loader. The Iris dataset contains measurements of sepal length, sepal width, petal length, and petal width for three species of iris flowers, making it a classic example for practicing data visualization and classification tasks.

iris= sns.load_dataset("iris")

This code outputs the column names of the iris DataFrame, showing the variables available in the dataset—such as measurements of flower parts and the species label.

iris.columns

Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
       'species'],
      dtype='object')

This line creates a pair plot of the iris dataset, plotting all pairwise combinations of numerical features. The hue="sepal_length" argument colors the data points based on their sepal length values, but since sepal_length is a continuous variable, Seaborn applies a sequential color palette to represent its values.

sns.pairplot(iris, hue="sepal_length", height=2.5);

../../_images/5c88f1159eb393fbf3c6e6e262adfd75ca2f756ec1ccd01f9129991be244d485.png

Pairgrid#

A PairGrid is a flexible way to create a grid of subplots that shows pairwise relationships in a dataset. Unlike pairplot(), which is a quick, high-level function, PairGrid offers more control over how plots are drawn on the diagonal, upper, and lower parts of the grid. You can customize each section with different kinds of plots (e.g., histograms, scatter plots, KDEs) using .map(), .map_diag(), and .map_upper()/.map_lower().

This makes PairGrid especially useful when you want to tailor the appearance or behavior of pairwise plots beyond the defaults of pairplot().

This code creates a PairGrid using the iris dataset, with points colored by the petal_length variable. Since petal_length is continuous, Seaborn applies a sequential version of the "husl" color palette to represent the range of values. The g.map(plt.scatter, edgecolor="w") line maps a scatter plot to every pairwise combination of numeric variables in the dataset, using white edges around the points to improve contrast and visibility. This customized PairGrid gives you fine control over how each subplot is rendered, compared to the simpler pairplot() function.

g = sns.PairGrid(iris, hue="petal_length", palette = ("husl"));
g.map(plt.scatter, edgecolor="w");

../../_images/17a52a11e97cb4a1876cc424163641c9122d52b445480c801e01fa7190805ab2.png

Heatmaps#

A heatmap is a two-dimensional graphical representation of data where the individual values in a matrix or DataFrame are represented as colors. Each cell in the grid corresponds to a value, and its color indicates the magnitude—typically using hue, intensity, or both. This makes heatmaps especially useful for visualizing patterns, clusters, correlations, or missing data in a dataset.

The variation in color provides quick visual cues to help the reader understand how values are distributed or grouped, making complex data more interpretable at a glance.

This code creates a heatmap to visualize airline passenger counts over time using Seaborn’s built-in flights dataset. First, the data is reshaped using .pivot() so that months form the rows, years form the columns, and passenger numbers fill the matrix values.

The sns.heatmap(flights) call generates a color-coded grid where each cell’s color represents the number of passengers for a given month and year. This makes it easy to spot seasonal trends and changes over time. The plot is displayed inside a Matplotlib figure sized at 8 by 6 inches.

You can customize the color scheme with the cmap parameter (e.g., cmap="rocket_r" for a reversed rocket palette), and control tick label orientation with ax.tick_params(axis='x', labelrotation=90).

 # Heatmap with default seaborn settings
plt.figure(figsize=(8,6))

flights = sns.load_dataset("flights")
flights = flights.pivot(index="month", columns="year", values="passengers")
ax = sns.heatmap(flights) # Returns ax, a matplotlib axes
# ax  = sns.heatmap(flights, cmap="rocket_r") # Reversed color palette

#ax.tick_params(axis = 'x', labelrotation=90)

../../_images/7592cdbb902bf29e43ac73b68456c138db75144e4984c93fbcb16ee2ba4a13b5.png

This code creates a more detailed heatmap using the flights dataset. The color palette is set to "YlGnBu" (yellow-green-blue), which is suitable for representing increasing values. The linewidths=.5 argument adds thin lines between the cells to improve readability.

The annot=True parameter displays the actual passenger counts in each cell, and fmt="d" formats those numbers as integers. Together, these additions make it easier to interpret exact values while still benefiting from the overall color-based visualization.

# Changing color palette, adding space between each square, adding values etc
plt.figure(figsize=(8,6))
sns.heatmap(flights, cmap="YlGnBu", linewidths=.5, annot=True, fmt="d");

../../_images/7d8cbe1f6351f7a3cc32a581c69edbb8eb6b9ed34ccdf91232fe08441f95b947.png

Boxplots#

A boxplot (or box-and-whisker plot) is a standardized way of displaying the distribution of a dataset based on five summary statistics: the minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum. It also highlights outliers—points that fall significantly outside the typical range.

Boxplots are useful for comparing distributions across different categories. In Seaborn, you can create them easily with sns.boxplot(). The central box shows the interquartile range (IQR), the line inside the box is the median, and the “whiskers” extend to the lowest and highest values within 1.5×IQR. Any points beyond that are plotted individually as outliers.

This code loads the tips dataset and creates a boxplot of the total_bill variable. The boxplot summarizes the distribution of total bills by displaying the median, quartiles, and potential outliers. This visualization helps quickly understand the spread and skewness of the billing amounts in the dataset.

# load data using pandas dataframes
tips = sns.load_dataset("tips")
ax = sns.boxplot(x=tips["total_bill"])

../../_images/661a5bae40daecd8db402fb08980b66fdb7a44ce7ec250da0a80dd08df29742f.png

This code creates a boxplot showing the distribution of total_bill amounts for each day of the week in the tips dataset. By grouping the data by the day variable on the x-axis, it allows for easy comparison of billing patterns across different days. Each box displays the median, quartiles, and potential outliers for the total bill amounts on that day.

ax = sns.boxplot(x="day", y="total_bill", data=tips)

../../_images/abe38f61da1b9b07740830d074a39a2e963d861fc426eb6c7d8e814eb43b492e.png

This code creates a grouped boxplot using the tips dataset, showing the distribution of total_bill amounts for each day of the week (x="day"). The hue="smoker" parameter adds a second layer of grouping, splitting each day’s data into smokers and non-smokers, with different colors assigned by the "Set3" palette. This visualization allows you to compare how the total bill varies not only by day but also between smokers and non-smokers, revealing differences in spending patterns across these categories.

ax = sns.boxplot(x="day", y="total_bill", hue="smoker", data=tips, palette="Set3")

../../_images/790c4bc1de82ca0310bfece7bec4d845eee13138f4d37f9690716ebde4870bde.png