Data Cleaning Basics

Data Cleaning Basics#

Step 1: Reading Data#

Data can be read from various file formats such as CSV, JSON, and Excel files. Read more about it here

This code demonstrates how to load and customize a dataset using the Pandas library. It shows how to read a full CSV file into a DataFrame or, alternatively, import only specific columns using the usecols parameter to focus on relevant data like year, leading cause, and number of deaths. Another option illustrated is how to remove unnecessary columns after loading the dataset using the drop() method to streamline the data for analysis.

import pandas as pd
import numpy as np

df = pd.read_csv("NYC_Leading_Death_Causes.csv")

# If we want to upload only certain columns of the dataset --- example Year, Leading Cause and Deaths
#new_df = pd.read_csv('NYC_Leading_Death_Causes.csv', usecols = ['Year','Leading Cause', 'Deaths'])

# OR
# pandas drop columns using list of column names 
#new_df = df.drop(['Sex','Death Rate', 'Age Adjusted Death Rate'], axis=1)

Step 2: Exploring Data#

The describe() function is used to generate summary statistics for all numerical columns in the dataset. It provides useful metrics such as count, mean, standard deviation, minimum and maximum values, as well as the 25th, 50th (median), and 75th percentiles. This summary offers a quick overview of the distribution, central tendency, and variability of the numerical data, helping identify patterns or potential outliers.

# to get summary statistics of data(all numerical columns ONLY)
df.describe()

	Year
count	1096.000000
mean	2010.474453
std	2.292191
min	2007.000000
25%	2008.000000
50%	2010.000000
75%	2012.000000
max	2014.000000

The shape attribute is used to find the number of rows and columns in a DataFrame. It returns a tuple where the first value represents the number of rows and the second value represents the number of columns. This provides a quick way to understand the overall size and structure of the dataset.

# find the number of rows and columns of data frame
df.shape

(1096, 7)

By default, head() returns the top 5 rows, but specifying a number in parentheses lets you control how many rows are shown. So, the head(10) function displays the first 10 rows of the DataFrame, allowing you to quickly view a sample of the dataset. This is helpful for getting an initial look at the data’s structure, column names, and sample values.

# print top 10 rows of the dataframe. "head()" will print top 5 rows
df.head(10)

	Year	Leading Cause	Sex	Race Ethnicity	Deaths	Death Rate	Age Adjusted Death Rate
0	2010	Influenza (Flu) and Pneumonia (J09-J18)	F	Hispanic	228	18.7	23.1
1	2008	Accidents Except Drug Posioning (V01-X39, X43,...	F	Hispanic	68	5.8	6.6
2	2013	Accidents Except Drug Posioning (V01-X39, X43,...	M	White Non-Hispanic	271	20.1	17.9
3	2010	Cerebrovascular Disease (Stroke: I60-I69)	M	Hispanic	140	12.3	21.4
4	2009	Assault (Homicide: Y87.1, X85-Y09)	M	Black Non-Hispanic	255	30	30
5	2012	Mental and Behavioral Disorders due to Acciden...	F	Other Race/ Ethnicity	.	.	.
6	2012	Cerebrovascular Disease (Stroke: I60-I69)	F	Asian and Pacific Islander	102	17.5	20.7
7	2009	Essential Hypertension and Renal Diseases (I10...	M	Asian and Pacific Islander	26	5.1	7.2
8	2010	All Other Causes	F	White Non-Hispanic	2140	149.7	93.9
9	2009	Alzheimer's Disease (G30)	F	Other Race/ Ethnicity	.	.	.

By default, tail() shows the last 5 rows, but you can specify any number inside the parentheses to customize how many rows you want to see. SO, the tail(12) function displays the last 12 rows of the DataFrame, giving you a look at the most recent or bottom entries in the dataset. This is useful for reviewing the end of the dataset, checking for missing data, or understanding how the data concludes.

#display last 5 rows / for displaying any other number of last rows, put value inside ();ex tail(10)
df.tail(12)

	Year	Leading Cause	Sex	Race Ethnicity	Deaths	Death Rate	Age Adjusted Death Rate
1084	2008	Nephritis, Nephrotic Syndrome and Nephrisis (N...	F	Asian and Pacific Islander	13	2.4	3.3
1085	2014	Chronic Lower Respiratory Diseases (J40-J47)	F	Not Stated/Unknown	12	.	.
1086	2007	Human Immunodeficiency Virus Disease (HIV: B20...	F	Black Non-Hispanic	248	23.6	22.1
1087	2010	Atherosclerosis (I70)	F	Not Stated/Unknown	.	.	.
1088	2014	All Other Causes	F	Black Non-Hispanic	1536	146.4	126.4
1089	2009	Nephritis, Nephrotic Syndrome and Nephrisis (N...	F	Black Non-Hispanic	75	7.2	6.7
1090	2008	Chronic Liver Disease and Cirrhosis (K70, K73)	F	Other Race/ Ethnicity	.	.	.
1091	2012	Influenza (Flu) and Pneumonia (J09-J18)	F	Not Stated/Unknown	6	.	.
1092	2014	Accidents Except Drug Posioning (V01-X39, X43,...	F	White Non-Hispanic	169	11.9	7.4
1093	2009	Malignant Neoplasms (Cancer: C00-C97)	M	White Non-Hispanic	3236	240.5	205.6
1094	2009	Intentional Self-Harm (Suicide: X60-X84, Y87.0)	M	White Non-Hispanic	191	14.2	13
1095	2013	Essential Hypertension and Renal Diseases (I10...	M	Black Non-Hispanic	148	17.2	20.9

The columns attribute is used to display all the column names in a DataFrame. It returns an index object containing the names of each column, which helps you understand the structure and available variables in the dataset. This is especially useful when working with large datasets or when you want to reference specific columns for further analysis.

# to display all column names
df.columns

Index(['Year', 'Leading Cause', 'Sex', 'Race Ethnicity', 'Deaths',
       'Death Rate', 'Age Adjusted Death Rate'],
      dtype='object')

The rename() function is used to change column names in a DataFrame. In the first example, the column 'Sex' is renamed to 'Gender', and the inplace=True argument ensures the change is applied directly to the existing DataFrame without needing to assign it to a new variable. The second (commented-out) example shows how to rename multiple columns at once by passing a dictionary where each key is the current column name and each value is the new name. This is especially helpful for making column names more descriptive or easier to work with (e.g., removing spaces or standardizing naming conventions). After renaming, using df.columns lets you confirm the changes.

#Rename one column
df.rename(columns = {'Sex':'Gender'}, inplace = True) 
df.columns

# #Rename multiple column
#df.rename(columns = {'Leading Cause':'Leading_Cause', 'Race Ethnicity':'Race_Ethnicity', 
                              #'Death Rate':'Death_Rate', 'Age Adjusted Death Rate': 'Age_Adjusted_Death_Rate'}, inplace = True)

Index(['Year', 'Leading Cause', 'Gender', 'Race Ethnicity', 'Deaths',
       'Death Rate', 'Age Adjusted Death Rate'],
      dtype='object')

The count() function is used to find the number of non-null (non-missing) entries in each column of the DataFrame. It returns a count for every column, helping you identify if there are any missing values. This is useful for quickly checking data completeness and ensuring that columns contain the expected number of entries.

# find the count of rows
df.count()

Year                       1096
Leading Cause              1096
Gender                     1096
Race Ethnicity             1096
Deaths                     1096
Death Rate                 1096
Age Adjusted Death Rate    1096
dtype: int64

The iloc[:, :] function is used to print all rows and all columns of a DataFrame by index-based selection. The colon before the comma (:) selects all rows, and the colon after the comma (:) selects all columns. This is helpful when you want to view the entire dataset explicitly, although in practice, large DataFrames may be truncated in the output display for performance reasons.

# print ALL rows and ALL columns by index
df.iloc[:,:]

	Year	Leading Cause	Gender	Race Ethnicity	Deaths	Death Rate	Age Adjusted Death Rate
0	2010	Influenza (Flu) and Pneumonia (J09-J18)	F	Hispanic	228	18.7	23.1
1	2008	Accidents Except Drug Posioning (V01-X39, X43,...	F	Hispanic	68	5.8	6.6
2	2013	Accidents Except Drug Posioning (V01-X39, X43,...	M	White Non-Hispanic	271	20.1	17.9
3	2010	Cerebrovascular Disease (Stroke: I60-I69)	M	Hispanic	140	12.3	21.4
4	2009	Assault (Homicide: Y87.1, X85-Y09)	M	Black Non-Hispanic	255	30	30
...	...	...	...	...	...	...	...
1091	2012	Influenza (Flu) and Pneumonia (J09-J18)	F	Not Stated/Unknown	6	.	.
1092	2014	Accidents Except Drug Posioning (V01-X39, X43,...	F	White Non-Hispanic	169	11.9	7.4
1093	2009	Malignant Neoplasms (Cancer: C00-C97)	M	White Non-Hispanic	3236	240.5	205.6
1094	2009	Intentional Self-Harm (Suicide: X60-X84, Y87.0)	M	White Non-Hispanic	191	14.2	13
1095	2013	Essential Hypertension and Renal Diseases (I10...	M	Black Non-Hispanic	148	17.2	20.9

1096 rows × 7 columns

This code uses iloc to access a specific subset of the DataFrame by integer position. It selects rows from index 100 to 124 (which corresponds to the 101st through 125th rows, since indexing starts at 0) and columns from index 4 to 6 (the 5th, 6th, and 7th columns). This allows you to extract and work with a precise slice of the data based on row and column positions rather than labels.

# Access 101 - 124th rows and 5th, 6th and 7th columns by index. Both Row and Column Index starts from 0
df.iloc[100:125,4:7]

	Deaths	Death Rate	Age Adjusted Death Rate
100	1940	226.8	280.7
101	3366	234.5	160.5
102	563	39.7	20.4
103	.	.	.
104	180	12.7	6.7
105	7	.	.
106	47	.	.
107	.	.	.
108	63	5.3	6.2
109	10	.	.
110	36	.	.
111	1230	116.8	113.4
112	279	26.5	25.3
113	37	6.7	8
114	6	.	.
115	1852	176.5	148.4
116	530	39.6	32.9
117	11	.	.
118	554	96.5	118.5
119	.	.	.
120	.	.	.
121	186	21.5	21.5
122	.	.	.
123	235	20.6	38.9
124	2282	217.7	194.2

The loc accessor is used to select data by label. Here, df.loc[:, "Leading Cause"] retrieves all rows (:) but only the column named "Leading Cause". This method is useful when you want to access one or more columns directly by their names instead of by their integer positions.

# Access columns by column name
df.loc[:,"Leading Cause"]

               Influenza (Flu) and Pneumonia (J09-J18)
     Accidents Except Drug Posioning (V01-X39, X43,...
     Accidents Except Drug Posioning (V01-X39, X43,...
             Cerebrovascular Disease (Stroke: I60-I69)
                    Assault (Homicide: Y87.1, X85-Y09)
                              ...                        
            Influenza (Flu) and Pneumonia (J09-J18)
  Accidents Except Drug Posioning (V01-X39, X43,...
              Malignant Neoplasms (Cancer: C00-C97)
    Intentional Self-Harm (Suicide: X60-X84, Y87.0)
  Essential Hypertension and Renal Diseases (I10...
Name: Leading Cause, Length: 1096, dtype: object

Here, df.Year retrieves the entire "Year" column directly using dot notation by referencing the column name as an attribute of the DataFrame. This is a quick and convenient way to access columns with simple names, though it only works when the column name is a valid Python identifier (no spaces or special characters).

# Another way to access columns (by column name)
df.Year

     2010
     2008
     2013
     2010
     2009
        ... 
  2012
  2014
  2009
  2009
  2013
Name: Year, Length: 1096, dtype: int64

The code df.Leading Cause attempts to access a column using dot notation, but it will cause an error because the column name contains a space. Dot notation only works for column names that are valid Python identifiers—meaning no spaces or special characters. For columns with spaces (like "Leading Cause"), you need to use bracket notation instead, such as df["Leading Cause"].

# Another way to access columns (by column name). The only problem is that it does not work if the
#column name has space between them
df.Leading Cause

  Input In [31]
    df.Leading Cause
               ^
SyntaxError: invalid syntax

Data Cleaning#

This code handles duplicate rows in the DataFrame by identifying all rows that are duplicates except for their first occurrence. The duplicated() function returns a Boolean series marking True for every duplicate row after the first one based on all columns by default. By using this Boolean series to filter df, the code creates a new DataFrame dupRowsDF containing only these duplicate rows. Printing dupRowsDF lets you see which rows have repeated values across the entire dataset, which is useful for data cleaning and ensuring data quality.

#[A]Handling duplicate rows*******

# Select duplicate rows except first occurrence based on all columns
dupRowsDF = df[df.duplicated()]
print("Duplicate Rows except first occurrence based on all columns are :")
print(dupRowsDF) 

Duplicate Rows except first occurrence based on all columns are :
     Year                       Leading Cause Gender      Race Ethnicity  \
32   2009  Assault (Homicide: Y87.1, X85-Y09)      M  Black Non-Hispanic   
149  2009  Assault (Homicide: Y87.1, X85-Y09)      M  Black Non-Hispanic   

    Deaths Death Rate Age Adjusted Death Rate  
32     255         30                      30  
149    255         30                      30  

This code removes all duplicate rows from the DataFrame df while keeping only the first occurrence of each duplicated row. The drop_duplicates() function returns a new DataFrame df1 without duplicates. The length of this cleaned DataFrame (len(df1)) is printed to show how many unique rows remain after duplicates are removed, helping you understand the impact of duplicate data on the dataset size.

### dropping ALL duplicte values except first 
df1 = df.drop_duplicates() 
print(len(df1))

### displaying data 
print("the length of the df after deleting duplicate rows is", len(df1))

1094
the length of the df after deleting duplicate rows is 1094

The dtypes attribute is used to find the data types of all columns in the DataFrame df1. It returns a series where each column name is paired with its corresponding data type, such as int64 for integers or object for text. This information is important for understanding how data is stored and for choosing appropriate methods for analysis or manipulation.

# find the data types of all columns
df1.dtypes

Year                        int64
Leading Cause              object
Gender                     object
Race Ethnicity             object
Deaths                     object
Death Rate                 object
Age Adjusted Death Rate    object
dtype: object

This code attempts to find the total number of deaths over 8 years by summing the values in the "Deaths" column. However, since the "Deaths" column is stored as an object type (often meaning it’s treated as text), using .sum() in this case will concatenate the values like strings instead of performing a numeric addition. As a result, the output will be a meaningless string rather than a valid total. To get an accurate sum, the column should first be converted to a numeric data type using pd.to_numeric().

# Find the total number of deaths in 8 years; datatype of "Deaths" column is "object"
# without converting the column to numeric, the matematical operation "sum" returns a garbage result
df1["Deaths"].sum()

'22868271140255.102262140.191281761770.102121..9018313123554845929132610746635616803222.15926271204.5168290281.6..829.229314645124448145155271163846951721464.30052811231.55510214716147.7.12631667.2951912161057.410511627221493319299.449519403366563.180747.6310361230279376185253011554..186.23522826538.297156.5714.526119313755727327581291055651570820461.29019710075.896674535219153218119113.116633627126501621486206192521065620916927722693352222.191138216517.533081824631882522721779337120081784..189371.3639221033121379519109710.2851201..1482212715.115629716250211991543938013375202625418227.228521189764317135742044340266229134111252151316558247135133048.267431422722196130590605351313217063391142831877550.36.75507215108853384951278124615950227266.320112172156145262455..5253514178792991455503155171268.14415115434383028.9882798322210399016518244415850229340155.184398358206742463528312517018689164171245951080.622134205265.952743775.4145571124964507..94718229411876.141418.43629.1540.22551163.105258163.26126311349414721756200222.2431281..7911110238797688209113542201686962983518177.11124022998308.18531431082101082146562243750239549622121023820594204423610255155710721829289154711200969.174970145632388263612201557603647824...115501588151.12084.1932581071857648.116780783451354618219379735103482316464156.51565429186177190511116.1154.199224318357118007917026227546250614731583.469258193215173701871871684148513161155569651523408522017162224325529.50.8371533185146.26531538333..1793027.14832851024624881431551012547.46718166867751222352278394114638223842329211632445..174..2235881511443408203450249.206.1091735716751061414727724432051371047158286.7707471910416446137123534511891751230981842725259152266334619952365395.464245322221168.551180.11473871887819572659382011772841013139910675407110660.3016520756561327.11154242.1069645358186.13148102861428196621201523756199402104520451089186114619332191391262532151235.40414094371391231536571541918131782721382.14114923383414533370507398208398781840.19..5015145219123444210859325578.393024021853488521.311512421013.99012121768169202971348.587314983116312221941307200.2678050689272422128567101521195.729223122030710..13192267302713277632.11265314581822383907014633841261582194245146194132191.5627954619583356.1133.12307143.134644474..96961802578622276011822858.17614831810259513220768.618144519611.2182741814992.646..104615636103329557336166.344161740207747339.1818292372199120734721170.20410749903256108.48644272727128615061.184456.8296.613142132209686168276818.89931841216272.7139.13206013611314242258206851115901493622036276156183.683622752.1109341566254..25921521312248.153675.61693236191148'

This code prepares the "Deaths" column for numerical analysis by cleaning and converting its values. First, df1.replace('.', np.nan) replaces any occurrences of the string "." in the entire DataFrame with NaN (Not a Number), which represents missing or undefined values. Then, pd.to_numeric(df1["Deaths"]) converts the "Deaths" column from an object (text) type to a numeric type (float64 by default), allowing for accurate mathematical operations like summing or averaging. This step is essential when the column contains non-numeric characters that would otherwise interfere with calculations.

#replace "." with "NaN" and change datatype of "Deaths" to float
df1 = df1.replace('.', np.nan)
df1["Deaths"] = pd.to_numeric(df1["Deaths"])

This code calculates the total number of deaths by summing the values in the "Deaths" column after it has been successfully converted to a numeric data type. Unlike earlier, when the column was stored as an object and resulted in a string concatenation, this time the .sum() function performs a proper mathematical operation. The result is the accurate total number of deaths across the dataset, which can now be used for further analysis or reporting.

# Finding the total number of deaths after converting the datatype of "Deaths" column into Numeric, returns correct result
df1["Deaths"].sum()

424998.0

Data Analysis and Visualization#

This code identifies all the unique causes of death listed in the "Leading Cause" column. The .unique() function returns an array of distinct labels (i.e., categories) found in that column, which helps in understanding the different qualitative or categorical values present in the data. The second line prints the total number of unique causes by using len(causeDeath), which counts how many distinct death causes are represented in the dataset. This is useful for summarizing and preparing for grouped analysis or visualization.

# Find the unique causes of death --- find the unique labels of categorical/qualitative data in the column
causeDeath = df1["Leading Cause"].unique()
print(causeDeath)

# count of unique labels of "leading cause"
print("\n The number of unique causes of deaths are:", len(causeDeath))

['Influenza (Flu) and Pneumonia (J09-J18)'
 'Accidents Except Drug Posioning (V01-X39, X43, X45-X59, Y85-Y86)'
 'Cerebrovascular Disease (Stroke: I60-I69)'
 'Assault (Homicide: Y87.1, X85-Y09)'
 'Mental and Behavioral Disorders due to Accidental Poisoning and Other Psychoactive Substance Use (F11-F16, F18-F19, X40-X42, X44)'
 'Essential Hypertension and Renal Diseases (I10, I12)' 'All Other Causes'
 "Alzheimer's Disease (G30)"
 'Diseases of Heart (I00-I09, I11, I13, I20-I51)'
 'Intentional Self-Harm (Suicide: X60-X84, Y87.0)'
 'Human Immunodeficiency Virus Disease (HIV: B20-B24)'
 'Chronic Liver Disease and Cirrhosis (K70, K73)'
 'Malignant Neoplasms (Cancer: C00-C97)'
 'Chronic Lower Respiratory Diseases (J40-J47)'
 'Diabetes Mellitus (E10-E14)'
 'Certain Conditions originating in the Perinatal Period (P00-P96)'
 'Septicemia (A40-A41)' 'Aortic Aneurysm and Dissection (I71)'
 'Mental and Behavioral Disorders due to Use of Alcohol (F10)'
 'Insitu or Benign / Uncertain Neoplasms (D00-D48)'
 "Parkinson's Disease (G20)"
 'Nephritis, Nephrotic Syndrome and Nephrisis (N00-N07, N17-N19, N25-N27)'
 'Viral Hepatitis (B15-B19)'
 'Congenital Malformations, Deformations, and Chromosomal Abnormalities (Q00-Q99)'
 'Atherosclerosis (I70)' 'Tuberculosis (A16-A19)']

 The number of unique causes of deaths are: 26

This code calculates the total number of deaths using the groupby() function to aggregate data based on specific categories. First, it groups the data by "Year" and then sums the "Deaths" column within each year, showing the total number of deaths for every year in the dataset. This helps reveal trends over time. Next, it groups the data by "Gender" (previously renamed from "Sex") and sums the deaths for each gender, giving a comparison of the total number of deaths among males and females. This grouped analysis is useful for identifying patterns and making category-based comparisons in the data.

#FIND THE TOTAL NUMBER OF DEATHS.................using "groupby" function
#By year
totalDeaths_year = df1.groupby("Year")["Deaths"].sum()
print("The total number of deaths every year:\n", totalDeaths_year)

#By Gender
totalDeaths_gender = df1.groupby("Gender")["Deaths"].sum()
print("\n The total number of deaths of males and females are:\n", totalDeaths_gender)

The total number of deaths every year:
 Year
2007    53996.0
2008    54138.0
2009    52820.0
2010    52505.0
2011    52726.0
2012    52420.0
2013    53387.0
2014    53006.0
Name: Deaths, dtype: float64

 The total number of deaths of males and females are:
 Gender
F    217071.0
M    207927.0
Name: Deaths, dtype: float64

This code calculates and visualizes the total number of deaths by race/ethnicity using grouping and sorting techniques. First, it groups the DataFrame by the "Race Ethnicity" column and sums the values in the "Deaths" column for each group. The .sort_values(ascending=False) function then arranges the results in descending order, showing which ethnic groups had the highest to lowest number of deaths. Then, totalDeaths_Race.plot.bar() creates a bar chart to visually represent these totals, with a title "Total number of deaths by Ethnicity". This visualization makes it easier to compare death counts across different racial and ethnic groups.

#FIND THE TOTAL NUMBER OF DEATHS.................
#By Race Ethnicity and sort the answers in descending order
totalDeaths_Race = df1.groupby("Race Ethnicity")["Deaths"].sum().sort_values(ascending=False)
print("The total number of deaths by Ethnicity:\n", totalDeaths_Race)

# Make a bar diagram for the total deaths by Race
totalDeaths_Race.plot.bar(title = "Total number of deaths by Ethnicity")

The total number of deaths by Ethnicity:
 Race Ethnicity
White Non-Hispanic            206487.0
Black Non-Hispanic            111116.0
Hispanic                       74802.0
Asian and Pacific Islander     26355.0
Not Stated/Unknown              4099.0
Other Race/ Ethnicity           2139.0
Name: Deaths, dtype: float64

<AxesSubplot:title={'center':'Total number of deaths by Ethnicity'}, xlabel='Race Ethnicity'>

../../_images/c716334ba962fab6333de6f5329b32f6a9e899ce478d084da249590b60789e16.png

This code calculates the total number of deaths for each Leading Cause of Death, identifies the top 5 causes, saves the results to a new CSV file, and visualizes them in a bar chart. It groups the dataset by "Leading Cause" and sums the deaths in each category. The .sort_values(ascending=False).head() part sorts these totals in descending order and selects the top 5 leading causes with the highest death counts. Then, totalDeaths_LC.to_csv("Deaths_LC.csv") writes this summarized result to a new CSV file named “Deaths_LC.csv”. Finally, totalDeaths_LC.plot.bar() creates a bar chart that visually displays these top 5 causes of death, making it easier to interpret and compare their impact.

#FIND THE TOTAL NUMBER OF DEATHS.................
# By Leading Cause of Death and write the answer in seperate file "Death_LC.csv". Find only top 5 causes

totalDeaths_LC = df1.groupby("Leading Cause")["Deaths"].sum().sort_values(ascending=False).head()
totalDeaths_LC.to_csv("Deaths_LC.csv") # write the result in another file
totalDeaths_LC.plot.bar()

<AxesSubplot:xlabel='Leading Cause'>

../../_images/d5c6355f3fd7d46054879332672a4c08aaf675db3a4145fa46dc67cfc24d7c87.png

This code calculates the number of deaths for each year, broken down by gender using the groupby() function with multiple grouping columns. It groups the DataFrame first by "Year" and then by "Gender", and for each combination, it sums the values in the "Deaths" column. This results in a multi-level index (also called a hierarchical index) where each row represents the total deaths for a specific gender in a specific year. The final output provides a detailed breakdown of how death counts vary by gender across different years, which is helpful for identifying trends or disparities over time.

#Number of deaths every year by Gender.......
total_lc_gender = df1.groupby(["Year","Gender"])["Deaths"].sum()

print(total_lc_gender)

Year  Gender
2007  F         27749.0
      M         26247.0
2008  F         27816.0
      M         26322.0
2009  F         26941.0
      M         25879.0
2010  F         26675.0
      M         25830.0
2011  F         27075.0
      M         25651.0
2012  F         26766.0
      M         25654.0
2013  F         27133.0
      M         26254.0
2014  F         26916.0
      M         26090.0
Name: Deaths, dtype: float64

This code groups the data by both "Year" and "Gender" and sums the "Deaths" for each combination. The .unstack('Gender') method pivots the "Gender" level of the grouped data into separate columns—typically one for "F" (Female) and one for "M" (Male)—creating a table where each row represents a year and each column shows the total deaths by gender. Then, df2[['F','M']].plot(...) generates a stacked bar graph using these gender-based totals, with bars stacked vertically to show each year’s total deaths divided into male and female contributions. The figsize=(12,6) ensures the plot is wide and readable, and the title highlights the focus on gender distribution across years.

# Draw a stacked bar graph of each year showing Male/female proportion in total deaths
df2 = df1.groupby(['Year', 'Gender'])['Deaths'].sum().unstack('Gender')
df2[['F','M']].plot(kind='bar', stacked=True, title = "Gender Distribution in Deaths Each Year", figsize=(12,6))

<AxesSubplot:title={'center':'Gender Distribution in Deaths Each Year'}, xlabel='Year'>

../../_images/09b917d2ece2bd6fcb6fa1974f7865d8f213d5ec0cd25eea4711c68826371ecb.png

This code creates a grouped bar chart to visually compare male and female death counts for each year. Using the df2 DataFrame (which contains total deaths grouped by "Year" and separated by "Gender"), the .plot.bar() function generates side-by-side bars for each year, with one bar for females and one for males. The width=0.55 parameter controls the thickness of the bars, while figsize=(12,5) sets the overall size of the plot for better visibility. The bars are colored green and red to distinguish between genders. The .legend(bbox_to_anchor=(1.2, 0.5)) command positions the legend outside the main plot area, making it clearer and less cluttered. This grouped bar chart makes it easy to compare gender-specific death counts across multiple years.

# Draw a GROUPED bar graph of each year showing the comparison of Male/female deaths each year
# width = .55 ------  sets the width of the bars 
# legend = False ---  Turns the Legend off
# figsize ----------  Sets size of plot in inches
# legend(bbox_to_anchor = (....)) --------- sets the position of the legend
df2.plot.bar(title = "Comparison of Male/female deaths Each Year", width=0.55, figsize=(12,5),color=["green","red"]).legend(bbox_to_anchor=(1.2, 0.5))

<matplotlib.legend.Legend at 0x7fe1f0abf820>

../../_images/ad2b5bc82ece5d8dc8ba6c1802d68dc14684e000ce74e6fa4563f39723c9cae3.png

This code removes all rows from the DataFrame df1 that contain NaN (missing values) in any column using the dropna() method. After the rows are dropped, the length of the new DataFrame is printed to show how much data remains. Then, it calculates the total number of deaths from the cleaned dataset.

However, dropping rows with any missing values can be a risky or inappropriate approach—especially for this dataset—because it might remove rows where only non-critical fields (like a category label or metadata) are missing, while still containing valid numeric death counts. This could lead to underreporting or misrepresentation of the actual data. In such cases, it’s better to consider imputing missing values or selectively dropping only rows where essential data (like "Deaths") is missing.

#dropping all rows with NaN in ANY column --- CAN BE A WRONG APPROACH FOR THIS DATASET
df1 = df1.dropna()

print("The length of df1 after dropping all rows that have NaN in ANY column is:", len(df1))

df1["Deaths"].sum()

The length of df1 after dropping all rows that have NaN in ANY column is: 708

418760.0

This code filters the dataset to extract only the rows where the "Year" column has the value 2010. The result is stored in a new DataFrame called df1_2010, which contains all data entries specifically from the year 2010. This subset can then be used for focused analysis, such as identifying leading causes of death or demographic patterns during that particular year.

#find the data related to 2010....
df1_2010 = df1[df1.Year == 2010]

Acknowledgements#

The Python code used in this notebook was originally written by Vandana Srivastava, AI/Data Science Specialist, University Libraries, USC.

Explanatory text, annotations, and additional instructional content were added by Meara Cox, USC 2026 Graduate.