Module 0_ Welcome to the Data Science Toolkit! 1
Module 0: Data Science Toolkit!
Module 1: NumPy - The Foundation for Numerical Data
Topic: What is NumPy?
What is it?
NumPy, which stands for Numerical Python, is the fundamental package for scientific computing in Python. It provides support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.
Why is it important?
While Python lists are versatile, they are not optimized for numerical operations, especially when dealing with large datasets. NumPy arrays, on the other hand, are significantly faster and more memory-efficient for numerical computations. This is because NumPy is implemented in C, allowing it to perform operations much closer to the hardware.
Analogy: If standard Python lists are like flexible, general-purpose containers that can hold anything but are slow for heavy lifting, NumPy arrays are like specialized, high-performance containers specifically designed for numbers. They are built for speed and efficiency when you need to perform mathematical operations on collections of numbers, much like a forklift is designed for lifting heavy loads efficiently compared to carrying them by hand.
NumPy is the backbone of many other data science libraries in Python, including Pandas, Matplotlib, and Scikit-learn. Understanding NumPy is crucial for effectively using these libraries.
How do we use it?
First, we need to import the NumPy library, typically using the alias np
.
import numpy as np
# Now we can use np to access NumPy functions and objects
print(np.__version__)
Code Explanation & Output:
import numpy as np
: This line imports the NumPy library and assigns it the conventional aliasnp
. This allows us to refer to NumPy functions and objects usingnp.
instead of the fullnumpy.
. This is a widely adopted convention in the Python data science community.print(np.__version__)
: This line prints the version of NumPy that is installed in your environment. This is a simple way to confirm that NumPy is imported correctly and to know which version you are working with.
1.26.4 # The exact version number might vary depending on your installation
Topic: Creating NumPy Arrays
NumPy arrays are the core data structure in NumPy. Let's explore several common ways to create them.
Creating arrays from Python lists
What is it?
One of the most straightforward ways to create a NumPy array is by converting a standard Python list or a list of lists into an ndarray
(N-dimensional array).
Why is it important?
This method is essential for transitioning existing Python data structures into the more efficient NumPy array format, allowing you to leverage NumPy's powerful numerical capabilities.
How do we use it?
import numpy as np
# Create a 1D array from a list
list_1d = [1, 2, 3, 4, 5]
numpy_1d = np.array(list_1d)
print(f"1D Array: {numpy_1d}")
print(f"Type of 1D Array: {type(numpy_1d)}\n")
# Create a 2D array from a list of lists
list_2d = [[10, 20, 30], [40, 50, 60]]
numpy_2d = np.array(list_2d)
print(f"2D Array:\n{numpy_2d}")
print(f"Type of 2D Array: {type(numpy_2d)}")
Code Explanation & Output:
list_1d = [1, 2, 3, 4, 5]
: Defines a standard Python list containing integers.numpy_1d = np.array(list_1d)
: Convertslist_1d
into a 1-dimensional NumPy array. Thenp.array()
function is the primary way to create arrays from existing Python sequences.list_2d = [[10, 20, 30], [40, 50, 60]]
: Defines a Python list of lists, which represents a 2-dimensional structure.numpy_2d = np.array(list_2d)
: Convertslist_2d
into a 2-dimensional NumPy array. NumPy automatically infers the dimensions from the nested list structure.
1D Array: [1 2 3 4 5]
Type of 1D Array: <class 'numpy.ndarray'>
2D Array:
[[10 20 30]
[40 50 60]]
Type of 2D Array: <class 'numpy.ndarray'>
Using np.arange()
What is it?
np.arange()
is a NumPy function that returns evenly spaced values within a given interval. It's similar to Python's built-in range()
function but returns a NumPy array.
Why is it important?
It's incredibly useful for creating sequences of numbers, which are often needed for indexing, generating sample data, or defining ranges for plots.
How do we use it?
import numpy as np
# Create an array from 0 up to (but not including) 10
arr_arange_1 = np.arange(10)
print(f"np.arange(10): {arr_arange_1}\n")
# Create an array from 5 up to (but not including) 15
arr_arange_2 = np.arange(5, 15)
print(f"np.arange(5, 15): {arr_arange_2}\n")
# Create an array with a step of 2
arr_arange_3 = np.arange(0, 20, 2)
print(f"np.arange(0, 20, 2): {arr_arange_3}")
Code Explanation & Output:
np.arange(10)
: Creates a 1D array with values starting from 0 up to (but not including) 10. The default start is 0 and the default step is 1.np.arange(5, 15)
: Creates a 1D array with values starting from 5 up to (but not including) 15. Here, 5 is the start and 15 is the stop.np.arange(0, 20, 2)
: Creates a 1D array with values starting from 0 up to (but not including) 20, with a step of 2. This means it will include 0, 2, 4, ..., 18.
np.arange(10): [0 1 2 3 4 5 6 7 8 9]
np.arange(5, 15): [ 5 6 7 8 9 10 11 12 13 14]
np.arange(0, 20, 2): [ 0 2 4 6 8 10 12 14 16 18]
Using np.zeros()
What is it?
np.zeros()
creates a new array of a specified shape, filled with zeros.
Why is it important?
It's commonly used to initialize arrays when you know the size you need but don't yet have the actual data. It's a placeholder for future calculations.
How do we use it?
import numpy as np
# Create a 1D array of 5 zeros
arr_zeros_1d = np.zeros(5)
print(f"1D Zeros Array: {arr_zeros_1d}\n")
# Create a 2D array (3 rows, 4 columns) of zeros
arr_zeros_2d = np.zeros((3, 4))
print(f"2D Zeros Array:\n{arr_zeros_2d}")
Code Explanation & Output:
np.zeros(5)
: Creates a 1D array containing five floating-point zeros. By default, NumPy arrays are created with afloat64
data type unless specified otherwise.np.zeros((3, 4))
: Creates a 2D array with 3 rows and 4 columns, all filled with zeros. Notice that the shape for multi-dimensional arrays is passed as a tuple.
1D Zeros Array: [0. 0. 0. 0. 0.]
2D Zeros Array:
[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]
Using np.ones()
What is it?
Similar to np.zeros()
, np.ones()
creates a new array of a specified shape, but filled with ones.
Why is it important?
Useful for initializing arrays, especially in scenarios where you might want to multiply by these values or use them as a base for incremental calculations.
How do we use it?
import numpy as np
# Create a 1D array of 3 ones
arr_ones_1d = np.ones(3)
print(f"1D Ones Array: {arr_ones_1d}\n")
# Create a 2D array (2 rows, 5 columns) of ones
arr_ones_2d = np.ones((2, 5))
print(f"2D Ones Array:\n{arr_ones_2d}")
Code Explanation & Output:
np.ones(3)
: Creates a 1D array containing three floating-point ones.np.ones((2, 5))
: Creates a 2D array with 2 rows and 5 columns, all filled with ones.
1D Ones Array: [1. 1. 1.]
2D Ones Array:
[[1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1.]]
Using np.linspace()
What is it?
np.linspace()
returns evenly spaced numbers over a specified interval. It takes the start, stop, and the number of samples to generate.
Why is it important?
This function is particularly useful for creating arrays for plotting functions, generating evenly distributed data points, or for simulations where a precise number of samples within a range is required.
How do we use it?
import numpy as np
# Create an array of 5 evenly spaced numbers between 0 and 10 (inclusive)
arr_linspace_1 = np.linspace(0, 10, 5)
print(f"np.linspace(0, 10, 5): {arr_linspace_1}\n")
# Create an array of 7 evenly spaced numbers between 1 and 20
arr_linspace_2 = np.linspace(1, 20, 7)
print(f"np.linspace(1, 20, 7): {arr_linspace_2}")
Code Explanation & Output:
np.linspace(0, 10, 5)
: Creates a 1D array of 5 numbers, evenly spaced between 0 and 10. Unlikearange()
,linspace()
includes thestop
value by default.np.linspace(1, 20, 7)
: Creates a 1D array of 7 numbers, evenly spaced between 1 and 20.
np.linspace(0, 10, 5): [ 0. 2.5 5. 7.5 10. ]
np.linspace(1, 20, 7): [ 1. 4.16666667 7.33333333 10.5 13.66666667 16.83333333
20. ]
Topic: Array Attributes & Operations
Once you have a NumPy array, you can explore its properties (attributes) and perform various mathematical operations on it.
Array Attributes
What are they?
NumPy arrays have several attributes that provide information about the array itself, such as its dimensions, shape, size, and data type.
Why are they important?
Understanding these attributes is crucial for debugging, reshaping arrays, and ensuring your data is in the correct format for operations.
How do we use them?
Let's create a sample array and explore its attributes:
import numpy as np
# Create a 2D array
my_array = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]])
print(f"Original Array:\n{my_array}\n")
# .shape: Returns a tuple with the dimensions of the array (rows, columns)
print(f"Shape of the array: {my_array.shape}")
# .ndim: Returns the number of dimensions of the array
print(f"Number of dimensions: {my_array.ndim}")
# .size: Returns the total number of elements in the array
print(f"Total number of elements: {my_array.size}")
# .dtype: Returns the data type of the elements in the array
print(f"Data type of elements: {my_array.dtype}")
Code Explanation & Output:
my_array = np.array(...)
: Creates a 2D array with 4 rows and 3 columns.my_array.shape
: This attribute returns(4, 3)
, indicating 4 rows and 3 columns.my_array.ndim
: This attribute returns2
, as it's a 2-dimensional array.my_array.size
: This attribute returns12
, which is the total number of elements (4 rows * 3 columns).my_array.dtype
: This attribute returnsint64
(orint32
depending on your system), indicating that the elements are 64-bit integers.
Original Array:
[[ 1 2 3]
[ 4 5 6]
[ 7 8 9]
[10 11 12]]
Shape of the array: (4, 3)
Number of dimensions: 2
Total number of elements: 12
Data type of elements: int64
Basic Arithmetic Operations
What are they?
NumPy allows you to perform element-wise arithmetic operations (addition, subtraction, multiplication, division) on arrays. This means the operation is applied to each corresponding element in the arrays.
Why are they important?
This is the core of NumPy's power for numerical computation. Performing these operations on entire arrays is much faster and more concise than using loops in standard Python.
How do we use them?
import numpy as np
array_a = np.array([1, 2, 3, 4])
array_b = np.array([10, 20, 30, 40])
print(f"Array A: {array_a}")
print(f"Array B: {array_b}\n")
# Element-wise addition
sum_array = array_a + array_b
print(f"A + B: {sum_array}")
# Element-wise subtraction
diff_array = array_b - array_a
print(f"B - A: {diff_array}")
# Element-wise multiplication
prod_array = array_a * array_b
print(f"A * B: {prod_array}")
# Element-wise division
quot_array = array_b / array_a
print(f"B / A: {quot_array}")
# Scalar operations (operation with a single number)
scalar_mult = array_a * 2
print(f"A * 2: {scalar_mult}")
scalar_add = array_a + 100
print(f"A + 100: {scalar_add}")
Code Explanation & Output:
array_a
andarray_b
are two 1D NumPy arrays of the same size.array_a + array_b
: Adds corresponding elements:[1+10, 2+20, 3+30, 4+40]
.array_b - array_a
: Subtracts corresponding elements:[10-1, 20-2, 30-3, 40-4]
.array_a * array_b
: Multiplies corresponding elements:[1*10, 2*20, 3*30, 4*40]
.array_b / array_a
: Divides corresponding elements:[10/1, 20/2, 30/3, 40/4]
.array_a * 2
: Multiplies every element inarray_a
by 2. This is called broadcasting, where NumPy intelligently handles operations between arrays of different (but compatible) shapes.array_a + 100
: Adds 100 to every element inarray_a
(another example of broadcasting).
Array A: [1 2 3 4]
Array B: [10 20 30 40]
A + B: [11 22 33 44]
B - A: [ 9 18 27 36]
A * B: [ 10 40 90 160]
B / A: [10. 10. 10. 10.]
A * 2: [2 4 6 8]
A + 100: [101 102 103 104]
Important Tip: For element-wise operations, the arrays generally need to have the same shape or be compatible for broadcasting. If they are not, NumPy will raise a
ValueError
.
Topic: Indexing and Slicing
Accessing specific elements or portions of a NumPy array is fundamental. This is done through indexing and slicing, concepts similar to Python lists but with powerful extensions for multi-dimensional arrays.
1D array indexing and slicing
What is it?
For one-dimensional arrays, indexing allows you to retrieve a single element using its position (index), and slicing allows you to extract a contiguous sequence of elements.
Why is it important?
It enables you to pinpoint and work with individual data points or subsets of your data, which is crucial for data cleaning, analysis, and feature engineering.
How do we use it?
import numpy as np
arr_1d = np.array([10, 20, 30, 40, 50, 60, 70])
print(f"Original 1D Array: {arr_1d}\n")
# Indexing: Accessing a single element
# Remember: Python uses 0-based indexing
first_element = arr_1d[0]
print(f"First element (index 0): {first_element}")
third_element = arr_1d[2]
print(f"Third element (index 2): {third_element}")
last_element = arr_1d[-1] # Negative indexing accesses from the end
print(f"Last element (index -1): {last_element}\n")
# Slicing: [start:stop:step]
# The 'stop' index is exclusive (up to, but not including)
slice_1 = arr_1d[1:4] # Elements from index 1 up to (but not including) 4
print(f"Slice from index 1 to 3: {slice_1}")
slice_2 = arr_1d[:3] # Elements from the beginning up to (but not including) 3
print(f"Slice from beginning to index 2: {slice_2}")
slice_3 = arr_1d[4:] # Elements from index 4 to the end
print(f"Slice from index 4 to end: {slice_3}")
slice_4 = arr_1d[::2] # Every second element (step of 2)
print(f"Every second element: {slice_4}")
slice_5 = arr_1d[::-1] # Reverse the array
print(f"Reversed array: {slice_5}")
Code Explanation & Output:
arr_1d[0]
: Retrieves the element at index 0 (the first element).arr_1d[2]
: Retrieves the element at index 2 (the third element).arr_1d[-1]
: Retrieves the last element using negative indexing.arr_1d[1:4]
: Creates a slice containing elements from index 1, 2, and 3. Index 4 is excluded.arr_1d[:3]
: Creates a slice from the beginning up to index 2.arr_1d[4:]
: Creates a slice from index 4 to the end of the array.arr_1d[::2]
: Creates a slice that includes every second element, starting from the beginning.arr_1d[::-1]
: Creates a reversed copy of the array.
Original 1D Array: [10 20 30 40 50 60 70]
First element (index 0): 10
Third element (index 2): 30
Last element (index -1): 70
Slice from index 1 to 3: [20 30 40]
Slice from beginning to index 2: [10 20 30]
Slice from index 4 to end: [50 60 70]
Every second element: [10 30 50 70]
Reversed array: [70 60 50 40 30 20 10]
2D array indexing and slicing (e.g., array[row, col])
What is it?
For two-dimensional arrays (matrices), indexing and slicing involve specifying both row and column indices. The syntax is array[row_index, column_index]
.
Why is it important?
Most real-world datasets are tabular (2D), so being able to precisely select rows, columns, or sub-sections of a 2D array is fundamental for data manipulation and analysis.
How do we use it?
import numpy as np
arr_2d = np.array([
[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
[10, 11, 12]
])
print(f"Original 2D Array:\n{arr_2d}\n")
# Accessing a single element: array[row, column]
element = arr_2d[1, 2] # Row index 1, Column index 2 (value 6)
print(f"Element at row 1, column 2: {element}\n")
# Slicing rows
first_row = arr_2d[0, :]
print(f"First row: {first_row}")
last_two_rows = arr_2d[2:, :]
print(f"Last two rows:\n{last_two_rows}\n")
# Slicing columns
first_column = arr_2d[:, 0]
print(f"First column: {first_column}")
middle_columns = arr_2d[:, 1:3]
print(f"Middle two columns:\n{middle_columns}\n")
# Slicing both rows and columns (sub-array)
sub_array = arr_2d[1:3, 0:2] # Rows from index 1 to 2, columns from index 0 to 1
print(f"Sub-array (rows 1-2, cols 0-1):\n{sub_array}")
Code Explanation & Output:
arr_2d[1, 2]
: Accesses the element at the second row (index 1) and third column (index 2), which is6
.arr_2d[0, :]
: Selects the entire first row. The:
indicates all elements along that dimension.arr_2d[2:, :]
: Selects all rows from index 2 to the end, and all columns.arr_2d[:, 0]
: Selects all rows, but only the first column.arr_2d[:, 1:3]
: Selects all rows, but only columns from index 1 up to (but not including) 3.arr_2d[1:3, 0:2]
: Selects a sub-array. It takes rows with indices 1 and 2, and columns with indices 0 and 1.
Original 2D Array:
[[ 1 2 3]
[ 4 5 6]
[ 7 8 9]
[10 11 12]]
Element at row 1, column 2: 6
First row: [1 2 3]
Last two rows:
[[ 7 8 9]
[10 11 12]]
First column: [ 1 4 7 10]
Middle two columns:
[[ 2 3]
[ 5 6]
[ 8 9]
[11 12]]
Sub-array (rows 1-2, cols 0-1):
[[4 5]
[7 8]]
Important Note: When you slice a NumPy array, you are usually getting a view of the original array, not a copy. This means if you modify the sliced array, the original array will also be modified. To get a true copy, use the
.copy()
method (e.g.,my_slice = arr_2d[1:3, 0:2].copy()
).
Topic: Key NumPy Functions
NumPy provides a vast collection of functions that operate efficiently on arrays. We'll focus on two important categories: Universal Functions (ufuncs) and Aggregation Functions.
Universal Functions: np.sqrt()
, np.exp()
, np.sin()
What are they?
Universal functions, or ufuncs, are functions that operate element-wise on ndarray
s. This means they apply a mathematical operation to each element of the array individually, producing a new array with the results.
Why are they important?
Ufuncs are highly optimized and much faster than performing the same operations using Python loops. They are essential for performing common mathematical transformations on entire datasets efficiently.
How do we use it?
import numpy as np
arr = np.array([1, 4, 9, 16, 25])
print(f"Original Array: {arr}\n")
# np.sqrt(): Calculates the square root of each element
sqrt_arr = np.sqrt(arr)
print(f"Square root: {sqrt_arr}")
# np.exp(): Calculates the exponential of each element (e^x)
exp_arr = np.exp(arr)
print(f"Exponential: {exp_arr}")
# np.sin(): Calculates the sine of each element (in radians)
sin_arr = np.sin(arr)
print(f"Sine: {sin_arr}")
# You can also apply ufuncs to 2D arrays
arr_2d = np.array([[1, 2], [3, 4]])
print(f"\nOriginal 2D Array:\n{arr_2d}\n")
sqrt_arr_2d = np.sqrt(arr_2d)
print(f"Square root of 2D array:\n{sqrt_arr_2d}")
Code Explanation & Output:
np.sqrt(arr)
: Computes the square root for each element inarr
.np.exp(arr)
: Computese
(Euler's number, approximately 2.71828) raised to the power of each element inarr
.np.sin(arr)
: Computes the sine of each element inarr
. Note that these functions typically expect input in radians.- The example with
arr_2d
demonstrates that ufuncs work seamlessly with multi-dimensional arrays, applying the operation element-wise across all dimensions.
Original Array: [ 1 4 9 16 25]
Square root: [1. 2. 3. 4. 5.]
Exponential: [2.71828183e+00 5.45981500e+01 8.10308393e+03 8.88611052e+06
7.20048994e+10]
Sine: [ 0.84147098 0.7568025 0.14112001 -0.28790332 -0.13235175]
Original 2D Array:
[[1 2]
[3 4]]
Square root of 2D array:
[[1. 1.41421356]
[1.73205081 2. ]]
Aggregation Functions: np.sum()
, np.mean()
, np.max()
, np.std()
What are they?
Aggregation functions (or reduction functions) perform an operation on an array and return a single value (or a smaller array) that summarizes the data. They
reduce the dimensionality of the array by computing a single statistic.
Why are they important?
These functions are fundamental for summarizing data, calculating descriptive statistics, and understanding the overall characteristics of your datasets. They are heavily used in exploratory data analysis (EDA).
How do we use it?
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
print(f"Original Array: {arr}\n")
# np.sum(): Calculates the sum of all elements
total_sum = np.sum(arr)
print(f"Sum of all elements: {total_sum}")
# np.mean(): Calculates the arithmetic mean (average) of all elements
average = np.mean(arr)
print(f"Mean of all elements: {average}")
# np.max(): Finds the maximum element
maximum = np.max(arr)
print(f"Maximum element: {maximum}")
# np.min(): Finds the minimum element
minimum = np.min(arr)
print(f"Minimum element: {minimum}")
# np.std(): Calculates the standard deviation
std_dev = np.std(arr)
print(f"Standard deviation: {std_dev}\n")
# Aggregation along an axis (for multi-dimensional arrays)
arr_2d = np.array([
[1, 2, 3],
[4, 5, 6],
[7, 8, 9]
])
print(f"Original 2D Array:\n{arr_2d}\n")
# Sum along columns (axis=0)
sum_columns = np.sum(arr_2d, axis=0)
print(f"Sum along columns (axis=0): {sum_columns}")
# Sum along rows (axis=1)
sum_rows = np.sum(arr_2d, axis=1)
print(f"Sum along rows (axis=1): {sum_rows}")
Code Explanation & Output:
np.sum(arr)
: Calculates the sum of all elements in the 1D arrayarr
.np.mean(arr)
: Calculates the average of all elements inarr
.np.max(arr)
: Returns the largest value inarr
.np.min(arr)
: Returns the smallest value inarr
.np.std(arr)
: Calculates the standard deviation of the elements inarr
. Standard deviation measures the amount of variation or dispersion of a set of values.- For
arr_2d
:np.sum(arr_2d, axis=0)
: Calculates the sum of elements alongaxis=0
(columns). This means it sums down each column. For example, the first element of the result[1+4+7, 2+5+8, 3+6+9]
.np.sum(arr_2d, axis=1)
: Calculates the sum of elements alongaxis=1
(rows). This means it sums across each row. For example, the first element of the result[1+2+3, 4+5+6, 7+8+9]
.
Original Array: [ 1 2 3 4 5 6 7 8 9 10]
Sum of all elements: 55
Mean of all elements: 5.5
Maximum element: 10
Minimum element: 1
Standard deviation: 2.8722813232690143
Original 2D Array:
[[1 2 3]
[4 5 6]
[7 8 9]]
Sum along columns (axis=0): [12 15 18]
Sum along rows (axis=1): [ 6 15 24]
Tip on
axis
: Think ofaxis=0
as operating
vertically (down the columns) and axis=1
as operating horizontally (across the rows). This concept of axis
is crucial in NumPy and Pandas for performing operations along specific dimensions.
Module 2: Pandas - The Ultimate Data Manipulation Tool
Topic: What is Pandas?
What is it?
Pandas is a fast, powerful, flexible, and easy-to-use open-source data analysis and manipulation tool, built on top of the Python programming language. It provides data structures like DataFrames and Series that are designed to work with tabular and time-series data.
Why is it important?
In the real world, data rarely comes in a perfectly clean, numerical format. It's often messy, incomplete, and comes from various sources. Pandas excels at handling these real-world complexities, allowing you to clean, transform, and analyze data efficiently. It's the workhorse for almost any data science project.
Analogy: If NumPy is the raw material (like metal or wood) that provides the fundamental building blocks for numerical operations, then Pandas is the factory that takes that raw material and shapes it into useful, organized components. Think of a DataFrame as a highly organized assembly line where you can easily inspect, modify, and combine different parts of your data. It's where the real data wrangling happens, turning raw numbers into structured information ready for analysis or machine learning.
Pandas makes data manipulation intuitive and powerful, allowing you to perform complex operations with just a few lines of code.
How do we use it?
Just like NumPy, we typically import Pandas with a conventional alias, pd
.
import pandas as pd
# Now we can use pd to access Pandas functions and objects
print(pd.__version__)
Code Explanation & Output:
import pandas as pd
: This line imports the Pandas library and assigns it the conventional aliaspd
. This is a widely accepted standard in the data science community.print(pd.__version__)
: This line prints the installed version of Pandas, useful for checking your environment.
2.2.2 # The exact version number might vary depending on your installation
Topic: Core Pandas Structures
Pandas introduces two primary data structures that are fundamental to its operation: the Series and the DataFrame.
The Series (a labeled 1D array)
What is it?
A Pandas Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). It is essentially a column in a spreadsheet or a single column of a DataFrame.
Why is it important?
Series are the building blocks of DataFrames. They allow you to work with individual columns of data, providing powerful indexing capabilities and enabling efficient element-wise operations, similar to NumPy arrays but with added labels (an index).
How do we use it?
import pandas as pd
import numpy as np
# Creating a Series from a list
data = [10, 20, 30, 40, 50]
s = pd.Series(data)
print(f"Series from list:\n{s}\n")
# Creating a Series with a custom index
labels = ["a", "b", "c", "d", "e"]
s_labeled = pd.Series(data, index=labels)
print(f"Series with custom index:\n{s_labeled}\n")
# Creating a Series from a dictionary
dict_data = {"apple": 100, "banana": 150, "orange": 120}
s_dict = pd.Series(dict_data)
print(f"Series from dictionary:\n{s_dict}\n")
# Accessing elements in a Series
print(f"Element at index 2 (from list Series): {s[2]}")
print(f"Element with label \'b\' (from labeled Series): {s_labeled["b"]}")
print(f"Value of \'orange\' (from dictionary Series): {s_dict["orange"]}\n")
# Operations on Series (element-wise)
s_ops = pd.Series([1, 2, 3, 4])
print(f"Original Series for operations: {s_ops}")
print(f"Series + 5: {s_ops + 5}")
print(f"Series * 2: {s_ops * 2}")
Code Explanation & Output:
pd.Series(data)
: Creates a Series where the index is automatically generated (0, 1, 2, ...).pd.Series(data, index=labels)
: Creates a Series with a custom index provided by thelabels
list. This allows you to refer to elements by meaningful names instead of just numerical positions.pd.Series(dict_data)
: When creating a Series from a dictionary, the dictionary keys automatically become the Series index, and the values become the Series data.- Accessing elements: You can access elements by their numerical position (like a list) or by their label (if a custom index is provided).
- Operations: Arithmetic operations on Series are element-wise, similar to NumPy arrays. Pandas aligns data based on the index when performing operations between two Series.
Series from list:
0 10
1 20
2 30
3 40
4 50
dtype: int64
Series with custom index:
a 10
b 20
c 30
d 40
e 50
dtype: int64
Series from dictionary:
apple 100
banana 150
orange 120
dtype: int64
Element at index 2 (from list Series): 30
Element with label \'b\' (from labeled Series): 20
Value of \'orange\' (from dictionary Series): 120
Original Series for operations: 0 1
1 2
2 3
3 4
dtype: int64
Series + 5: 0 6
1 7
2 8
3 9
dtype: int64
Series * 2: 0 2
1 4
2 6
3 8
dtype: int64
The DataFrame (our primary tool, a 2D table)
What is it?
A Pandas DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns). It is the most commonly used Pandas object and is essentially a spreadsheet or a SQL table.
Why is it important?
DataFrames are where the magic happens in Pandas. They allow you to store and manipulate complex datasets with multiple columns of different data types. They provide powerful tools for filtering, selecting, grouping, and transforming data, making them indispensable for data analysis.
Analogy: If a Series is like a single column in a spreadsheet, a DataFrame is the entire spreadsheet itself. It has rows and columns, each column can have a different data type (e.g., one column for names, another for ages, another for salaries), and you can easily perform operations across rows or columns, just like you would in Excel or Google Sheets, but with the power of Python programming.
How do we use it?
import pandas as pd
import numpy as np
# Creating a DataFrame from a dictionary of lists
data = {
"Name": ["Alice", "Bob", "Charlie", "David"],
"Age": [25, 30, 35, 28],
"City": ["New York", "Los Angeles", "Chicago", "Houston"]
}
df = pd.DataFrame(data)
print(f"DataFrame from dictionary:\n{df}\n")
# Creating a DataFrame from a list of dictionaries
list_of_dicts = [
{"Name": "Eve", "Age": 22, "City": "Miami"},
{"Name": "Frank", "Age": 40, "City": "Boston"}
]
df_from_list = pd.DataFrame(list_of_dicts)
print(f"DataFrame from list of dictionaries:\n{df_from_list}\n")
# Creating a DataFrame from a NumPy array (requires column names)
numpy_data = np.array([
[1, 2, 3],
[4, 5, 6],
[7, 8, 9]
])
columns = ["ColA", "ColB", "ColC"]
df_from_numpy = pd.DataFrame(numpy_data, columns=columns)
print(f"DataFrame from NumPy array:\n{df_from_numpy}")
Code Explanation & Output:
pd.DataFrame(data)
: The most common way to create a DataFrame is from a dictionary where keys become column names and values are lists representing the data in each column.pd.DataFrame(list_of_dicts)
: You can also create a DataFrame from a list of dictionaries, where each dictionary represents a row.pd.DataFrame(numpy_data, columns=columns)
: If you have a NumPy array, you can convert it to a DataFrame by providing the array and a list of column names.
DataFrame from dictionary:
Name Age City
0 Alice 25 New York
1 Bob 30 Los Angeles
2 Charlie 35 Chicago
3 David 28 Houston
DataFrame from list of dictionaries:
Name Age City
0 Eve 22 Miami
1 Frank 40 Boston
DataFrame from NumPy array:
ColA ColB ColC
0 1 2 3
1 4 5 6
2 7 8 9
Topic: Reading and Writing Data
One of the most common tasks in data science is importing data from external files and, after analysis, exporting results back to files. Pandas makes this incredibly easy, especially for CSV (Comma Separated Values) files, which are a ubiquitous format for tabular data.
Reading a CSV file with pd.read_csv()
What is it?
pd.read_csv()
is a Pandas function used to read a CSV file into a DataFrame. It is highly flexible and can handle various CSV formats, delimiters, and encoding issues.
Why is it important?
Most real-world datasets are stored in files, and CSV is one of the most common formats. This function is your gateway to loading external data into Pandas for analysis.
How do we use it?
To demonstrate, we'll first create a dummy CSV file. In a real scenario, you would already have your CSV file.
import pandas as pd
import os
# Create a dummy CSV file for demonstration
dummy_data = {
"StudentID": [1, 2, 3, 4, 5],
"Name": ["Alice", "Bob", "Charlie", "David", "Eve"],
"Score": [85, 92, 78, 95, 88]
}
df_dummy = pd.DataFrame(dummy_data)
csv_file_path = "students.csv"
df_dummy.to_csv(csv_file_path, index=False) # index=False prevents writing the DataFrame index as a column
print(f"Dummy CSV file \'{csv_file_path}\' created.\n")
# Now, read the CSV file into a new DataFrame
df_students = pd.read_csv(csv_file_path)
print(f"DataFrame read from \'{csv_file_path}\':\n")
print(df_students)
# Clean up the dummy file (optional)
os.remove(csv_file_path)
print(f"\nDummy CSV file \'{csv_file_path}\' removed.")
Code Explanation & Output:
- Creating the dummy CSV:
- We first create a small DataFrame
df_dummy
with some sample data. df_dummy.to_csv(csv_file_path, index=False)
: This line writes thedf_dummy
DataFrame to a CSV file namedstudents.csv
.index=False
is crucial here; it tells Pandas not to write the DataFrame's row index as a column in the CSV file. If you omitindex=False
, you'll get an extra column of numbers (0, 1, 2, ...) in your CSV.
- We first create a small DataFrame
- Reading the CSV:
df_students = pd.read_csv(csv_file_path)
: This is the core line. It reads thestudents.csv
file and loads its content directly into a new Pandas DataFrame calleddf_students
.
- Cleanup:
os.remove(csv_file_path)
: This line removes the dummy CSV file after the demonstration. In a real scenario, you wouldn't typically create and then immediately delete your data files.
Dummy CSV file \'students.csv\' created.
DataFrame read from \'students.csv\':
StudentID Name Score
0 1 Alice 85
1 2 Bob 92
2 3 Charlie 78
3 4 David 95
4 5 Eve 88
Dummy CSV file \'students.csv\' removed.
Important Note:
pd.read_csv()
has many parameters to handle different scenarios, such assep
for specifying delimiters other than comma,header
for indicating if there's a header row,names
for providing column names,skiprows
to skip initial rows, andna_values
to specify strings that should be interpreted as missing values. Always refer to the Pandas documentation if you encounter issues reading a specific CSV file.
Topic: First Look at Your Data (Exploratory Data Analysis - EDA)
What is it?
Exploratory Data Analysis (EDA) is an initial, crucial step in any data science project. It involves summarizing the main characteristics of a dataset, often with visual methods. The goal is to understand the data, identify patterns, spot anomalies, test hypotheses, and check assumptions.
Why is it important?
Before you dive into complex modeling, you need to know your data inside out. EDA helps you:
- Understand the structure: How many rows and columns? What are the data types?
- Identify missing values: Are there gaps in your data? How should they be handled?
- Spot outliers: Are there unusual data points that might skew your analysis?
- Discover relationships: How do different variables relate to each other?
- Validate assumptions: Does the data meet the requirements for the analysis or model you plan to use?
It's like getting to know a new friend before you start a big project together. You want to understand their strengths, weaknesses, and quirks.
How do we use it?
Pandas provides several convenient methods for quickly getting a summary of your DataFrame. Let's use a sample DataFrame to demonstrate.
import pandas as pd
import numpy as np
# Create a sample DataFrame for demonstration
data = {
"Name": ["Alice", "Bob", "Charlie", "David", "Eve", "Frank", "Grace", "Heidi", "Ivan", "Judy"],
"Age": [25, 30, 35, 28, 22, 40, 29, 31, np.nan, 26],
"City": ["New York", "Los Angeles", "Chicago", "Houston", "Miami", "Boston", "Seattle", "Denver", "Austin", "Phoenix"],
"Salary": [70000, 85000, 90000, 72000, 65000, 100000, 78000, 80000, 75000, 68000],
"Experience_Years": [2, 5, 8, 3, 1, 10, 4, 6, 3, 2]
}
df = pd.DataFrame(data)
print("Original DataFrame:\n", df)
print("\n" + "-"*30 + "\n")
# .head(): Displays the first N rows (default is 5)
print("df.head():\n", df.head())
print("\n" + "-"*30 + "\n")
# .tail(): Displays the last N rows (default is 5)
print("df.tail(3):\n", df.tail(3))
print("\n" + "-"*30 + "\n")
# .info(): Provides a concise summary of the DataFrame
# Includes index dtype, column dtypes, non-null values, and memory usage
print("df.info():")
df.info()
print("\n" + "-"*30 + "\n")
# .describe(): Generates descriptive statistics of numerical columns
# Includes count, mean, std, min, 25%, 50% (median), 75%, max
print("df.describe():\n", df.describe())
print("\n" + "-"*30 + "\n")
# .shape: Returns a tuple representing the dimensionality of the DataFrame (rows, columns)
print(f"df.shape: {df.shape}")
print("\n" + "-"*30 + "\n")
# .columns: Returns the column labels of the DataFrame
print(f"df.columns: {df.columns}")
Code Explanation & Output:
df.head()
: Shows the first 5 rows of the DataFrame. This is useful for a quick glance at the data structure and content.df.tail(3)
: Shows the last 3 rows. Useful for checking data appended at the end or for time-series data.df.info()
: Provides a summary of the DataFrame. Key information includes:- RangeIndex: The number of entries (rows) and the index range.
- Data columns (total X columns): The total number of columns.
- Column Name, Non-Null Count, Dtype: For each column, it shows the number of non-missing values and its data type. Notice that 'Age' has 9 non-null values out of 10 entries, indicating one missing value (NaN).
- Memory usage: How much memory the DataFrame is consuming.
df.describe()
: Calculates descriptive statistics for numerical columns. For each numerical column, it provides:count
: Number of non-null entries.mean
: Average value.std
: Standard deviation (measure of data dispersion).min
: Minimum value.25%
,50%
(median),75%
: Quartiles, indicating the values below which 25%, 50%, and 75% of the data fall, respectively.max
: Maximum value.
df.shape
: Returns a tuple(number_of_rows, number_of_columns)
. For our DataFrame, it's(10, 5)
.df.columns
: Returns anIndex
object containing the names of all columns in the DataFrame.
Original DataFrame:
Name Age City Salary Experience_Years
0 Alice 25.0 New York 70000 2
1 Bob 30.0 Los Angeles 85000 5
2 Charlie 35.0 Chicago 90000 8
3 David 28.0 Houston 72000 3
4 Eve 22.0 Miami 65000 1
5 Frank 40.0 Boston 100000 10
6 Grace 29.0 Seattle 78000 4
7 Heidi 31.0 Denver 80000 6
8 Ivan NaN Austin 75000 3
9 Judy 26.0 Phoenix 68000 2
------------------------------
df.head():
Name Age City Salary Experience_Years
0 Alice 25.0 New York 70000 2
1 Bob 30.0 Los Angeles 85000 5
2 Charlie 35.0 Chicago 90000 8
3 David 28.0 Houston 72000 3
4 Eve 22.0 Miami 65000 1
------------------------------
df.tail(3):
Name Age City Salary Experience_Years
7 Heidi 31.0 Denver 80000 6
8 Ivan NaN Austin 75000 3
9 Judy 26.0 Phoenix 68000 2
------------------------------
df.info():
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 10 non-null object
1 Age 9 non-null float64
2 City 10 non-null object
3 Salary 10 non-null int64
4 Experience_Years 10 non-null int64
dtypes: float64(1), int64(2), object(2)
memory usage: 528.0+ bytes
------------------------------
df.describe():
Age Salary Experience_Years
count 9.000000 10.00000 10.000000
mean 29.555556 78000.00000 4.400000
std 5.479708 10888.08916 2.913391
min 22.000000 65000.00000 1.000000
25% 26.000000 70500.00000 2.000000
50% 29.000000 76500.00000 3.000000
75% 31.000000 83750.00000 5.750000
max 40.000000 100000.00000 10.000000
------------------------------
df.shape: (10, 5)
------------------------------
df.columns: Index(['Name', 'Age', 'City', 'Salary', 'Experience_Years'], dtype='object')
Key Takeaway: Always start your data analysis with EDA. These simple Pandas functions provide a wealth of information about your dataset, helping you identify potential issues (like missing values) and guide your next steps in data cleaning and preparation.
Topic: Selecting Data
Once you have your data loaded into a DataFrame, the next crucial step is often to select specific parts of it. Pandas offers powerful and flexible ways to select data, whether it's a single column, multiple columns, or specific rows.
Selecting columns
What is it?
Selecting columns in a DataFrame means extracting one or more columns by their names. The result of selecting a single column is a Pandas Series, while selecting multiple columns returns a DataFrame.
Why is it important?
Often, you only need to work with a subset of your data. Selecting columns allows you to focus on relevant variables, reducing memory usage and simplifying your analysis.
How do we use it?
import pandas as pd
data = {
"Name": ["Alice", "Bob", "Charlie", "David"],
"Age": [25, 30, 35, 28],
"City": ["New York", "Los Angeles", "Chicago", "Houston"],
"Salary": [70000, 85000, 90000, 72000]
}
df = pd.DataFrame(data)
print(f"Original DataFrame:\n{df}\n")
# Select a single column (returns a Series)
names = df["Name"]
print(f"Selected \"Name\" column (Series):\n{names}\n")
print(f"Type of names: {type(names)}\n")
# Select multiple columns (returns a DataFrame)
two_columns = df[["Name", "Salary"]]
print(f"Selected \"Name\" and \"Salary\" columns (DataFrame):\n{two_columns}\n")
print(f"Type of two_columns: {type(two_columns)}")
Code Explanation & Output:
df["Name"]
: To select a single column, use square brackets with the column name as a string. This returns a Pandas Series.df[["Name", "Salary"]]
: To select multiple columns, pass a list of column names inside the square brackets. This returns a new DataFrame containing only the specified columns.
Original DataFrame:
Name Age City Salary
0 Alice 25 New York 70000
1 Bob 30 Los Angeles 85000
2 Charlie 35 Chicago 90000
3 David 28 Houston 72000
Selected \"Name\" column (Series):
0 Alice
1 Bob
2 Charlie
3 David
Name: Name, dtype: object
Type of names: <class \'pandas.core.series.Series\'>
Selected \"Name\" and \"Salary\" columns (DataFrame):
Name Salary
0 Alice 70000
1 Bob 85000
2 Charlie 90000
3 David 72000
Type of two_columns: <class \'pandas.core.frame.DataFrame\'>
Selecting rows with .loc[]
(label-based)
What is it?
.loc[]
is a label-based indexer used for selecting data by row and column labels. It allows you to select rows by their index labels and columns by their column names.
Why is it important?
When your DataFrame has meaningful row labels (e.g., dates, IDs), .loc[]
provides an intuitive way to retrieve data based on those labels, making your code more readable and less prone to errors than numerical indexing.
How do we use it?
import pandas as pd
data = {
"Name": ["Alice", "Bob", "Charlie", "David"],
"Age": [25, 30, 35, 28],
"City": ["New York", "Los Angeles", "Chicago", "Houston"],
"Salary": [70000, 85000, 90000, 72000]
}
df = pd.DataFrame(data, index=["A", "B", "C", "D"]) # Custom index
print(f"Original DataFrame with custom index:\n{df}\n")
# Select a single row by its label
row_b = df.loc["B"]
print(f"Row with label \"B\":\n{row_b}\n")
# Select multiple rows by their labels
rows_ac = df.loc[["A", "C"]]
print(f"Rows with labels \"A\" and \"C\":\n{rows_ac}\n")
# Select rows and specific columns by labels
row_c_name_salary = df.loc["C", ["Name", "Salary"]]
print(f"Name and Salary for row \"C\":\n{row_c_name_salary}\n")
# Select a slice of rows by label (inclusive of end label)
rows_slice = df.loc["A":"C"]
print(f"Rows from \"A\" to \"C\" (inclusive):\n{rows_slice}")
Code Explanation & Output:
df.loc["B"]
: Selects the row with the index label
"B". This returns a Series.
df.loc[["A", "C"]]
: Selects multiple rows by passing a list of labels. This returns a DataFrame.df.loc["C", ["Name", "Salary"]]
: Selects specific columns ("Name"
,"Salary"
) for a specific row ("C"
).df.loc["A":"C"]
: Selects a range of rows from label"A"
to"C"
, inclusive of both start and end labels. This is a key difference from standard Python slicing.
Original DataFrame with custom index:
Name Age City Salary
A Alice 25 New York 70000
B Bob 30 Los Angeles 85000
C Charlie 35 Chicago 90000
D David 28 Houston 72000
Row with label \"B\":
Name Bob
Age 30
City Los Angeles
Salary 85000
Name: B, dtype: object
Rows with labels \"A\" and \"C\":
Name Age City Salary
A Alice 25 New York 70000
C Charlie 35 Chicago 90000
Name and Salary for row \"C\":
Name Charlie
Salary 90000
Name: C, dtype: object
Rows from \"A\" to \"C\" (inclusive):
Name Age City Salary
A Alice 25 New York 70000
B Bob 30 Los Angeles 85000
C Charlie 35 Chicago 90000
Selecting rows with .iloc[]
(index-based)
What is it?
.iloc[]
is an integer-location based indexer used for selection by positional integer indices. It works similarly to NumPy array indexing, where you specify rows and columns using their 0-based integer positions.
Why is it important?
Even if your DataFrame has custom labels, you might sometimes need to select data based purely on its numerical position (e.g., the first 5 rows, or the last column). .iloc[]
provides a consistent and reliable way to do this, regardless of the actual labels.
How do we use it?
import pandas as pd
data = {
"Name": ["Alice", "Bob", "Charlie", "David"],
"Age": [25, 30, 35, 28],
"City": ["New York", "Los Angeles", "Chicago", "Houston"],
"Salary": [70000, 85000, 90000, 72000]
}
df = pd.DataFrame(data, index=["A", "B", "C", "D"]) # Custom index
print(f"Original DataFrame with custom index:\n{df}\n")
# Select a single row by its integer position
first_row = df.iloc[0]
print(f"First row (index 0):\n{first_row}\n")
# Select multiple rows by integer positions
first_and_third_rows = df.iloc[[0, 2]]
print(f"First and third rows:\n{first_and_third_rows}\n")
# Select a slice of rows by integer position (exclusive of end index)
rows_slice_iloc = df.iloc[1:3] # Rows at index 1 and 2
print(f"Rows from index 1 to 2 (exclusive of 3):\n{rows_slice_iloc}\n")
# Select rows and specific columns by integer positions
row_1_col_0_2 = df.iloc[1, [0, 2]] # Row at index 1, columns at index 0 and 2
print(f"Row 1, columns 0 and 2:\n{row_1_col_0_2}\n")
# Select a sub-DataFrame using both row and column slicing
sub_df_iloc = df.iloc[0:2, 1:3] # Rows 0-1, columns 1-2
print(f"Sub-DataFrame (rows 0-1, cols 1-2):\n{sub_df_iloc}")
Code Explanation & Output:
df.iloc[0]
: Selects the row at integer position 0 (the first row). This returns a Series.df.iloc[[0, 2]]
: Selects multiple rows by passing a list of integer positions. This returns a DataFrame.df.iloc[1:3]
: Selects a slice of rows from integer position 1 up to (but not including) 3. This is consistent with standard Python slicing.df.iloc[1, [0, 2]]
: Selects specific columns (at integer positions 0 and 2) for a specific row (at integer position 1).df.iloc[0:2, 1:3]
: Selects a sub-DataFrame by slicing both rows (from 0 up to 2) and columns (from 1 up to 3).
Original DataFrame with custom index:
Name Age City Salary
A Alice 25 New York 70000
B Bob 30 Los Angeles 85000
C Charlie 35 Chicago 90000
D David 28 Houston 72000
First row (index 0):
Name Alice
Age 25
City New York
Salary 70000
Name: A, dtype: object
First and third rows:
Name Age City Salary
A Alice 25 New York 70000
C Charlie 35 Chicago 90000
Rows from index 1 to 2 (exclusive of 3):
Name Age City Salary
B Bob 30 Los Angeles 85000
C Charlie 35 Chicago 90000
Row 1, columns 0 and 2:
Name Bob
City Los Angeles
Name: B, dtype: object
Sub-DataFrame (rows 0-1, cols 1-2):
Age City
A 25 New York
B 30 Los Angeles
When to use
.loc[]
vs..iloc[]
:
- Use
.loc[]
when you need to select data based on labels (row names, column names).- Use
.iloc[]
when you need to select data based on integer positions (0-based indices).- For simple column selection by name,
df["column_name"]
is often sufficient.
Topic: Filtering Data
What is it?
Filtering data, also known as boolean indexing or boolean selection, is the process of selecting rows from a DataFrame based on one or more conditions. These conditions evaluate to True
or False
for each row, and only rows where the condition is True
are returned.
Why is it important?
Filtering is one of the most powerful and frequently used operations in data analysis. It allows you to isolate specific subsets of your data that meet certain criteria, enabling focused analysis, cleaning, or preparation for modeling. For example, you might want to analyze only customers over a certain age, transactions above a certain amount, or data from a specific region.
How do we use it?
Let's use our sample DataFrame to demonstrate how to filter data using boolean conditions.
import pandas as pd
data = {
"Name": ["Alice", "Bob", "Charlie", "David", "Eve", "Frank"],
"Age": [25, 30, 35, 28, 22, 40],
"City": ["New York", "Los Angeles", "Chicago", "Houston", "Miami", "Boston"],
"Salary": [70000, 85000, 90000, 72000, 65000, 100000]
}
df = pd.DataFrame(data)
print(f"Original DataFrame:\n{df}\n")
# Filter for people older than 30
older_than_30 = df[df["Age"] > 30]
print(f"People older than 30:\n{older_than_30}\n")
# Filter for people from New York
from_ny = df[df["City"] == "New York"]
print(f"People from New York:\n{from_ny}\n")
# Filter with multiple conditions (AND - use &)
# People older than 30 AND earning more than 80000
older_and_high_salary = df[(df["Age"] > 30) & (df["Salary"] > 80000)]
print(f"People older than 30 AND earning > 80000:\n{older_and_high_salary}\n")
# Filter with multiple conditions (OR - use |)
# People from New York OR Los Angeles
ny_or_la = df[(df["City"] == "New York") | (df["City"] == "Los Angeles")]
print(f"People from New York OR Los Angeles:\n{ny_or_la}\n")
# Using .isin() for multiple categorical values
# People from New York, Los Angeles, or Chicago
cities_of_interest = ["New York", "Los Angeles", "Chicago"]
filtered_by_cities = df[df["City"].isin(cities_of_interest)]
print(f"People from New York, Los Angeles, or Chicago:\n{filtered_by_cities}")
Code Explanation & Output:
df["Age"] > 30
: This is the core of boolean filtering. It creates a Pandas Series ofTrue
/False
values, whereTrue
indicates that the corresponding row in theAge
column meets the condition (is greater than 30).df[df["Age"] > 30]
: When you pass this boolean Series back into the DataFrame using square brackets, Pandas returns only the rows where the boolean Series has aTrue
value.- Multiple Conditions (
&
for AND,|
for OR):&
(bitwise AND): Used to combine multiple conditions where all conditions must be true. Each condition must be enclosed in parentheses()
to ensure correct order of operations.|
(bitwise OR): Used to combine multiple conditions where at least one condition must be true. Again, each condition needs parentheses.
.isin()
: This is a very convenient method for filtering when you want to select rows where a column's value is present in a given list of values. It's much cleaner than chaining multiple|
conditions.
Original DataFrame:
Name Age City Salary
0 Alice 25 New York 70000
1 Bob 30 Los Angeles 85000
2 Charlie 35 Chicago 90000
3 David 28 Houston 72000
4 Eve 22 Miami 65000
5 Frank 40 Boston 100000
People older than 30:
Name Age City Salary
2 Charlie 35 Chicago 90000
5 Frank 40 Boston 100000
People from New York:
Name Age City Salary
0 Alice 25 New York 70000
People older than 30 AND earning > 80000:
Name Age City Salary
2 Charlie 35 Chicago 90000
5 Frank 40 Boston 100000
People from New York OR Los Angeles:
Name Age City Salary
0 Alice 25 New York 70000
1 Bob 30 Los Angeles 85000
People from New York, Los Angeles, or Chicago:
Name Age City Salary
0 Alice 25 New York 70000
1 Bob 30 Los Angeles 85000
2 Charlie 35 Chicago 90000
Important Note: When combining multiple conditions, always use
&
for AND and|
for OR, and enclose each individual condition in parentheses. Usingand
oror
(Python's logical operators) directly with Pandas Series will result in an error because they operate on boolean values, not Series of booleans.
Topic: Handling Missing Data
What is it?
Missing data refers to the absence of a value for a variable in a dataset. In Pandas, missing values are typically represented by NaN
(Not a Number), which is a special floating-point value from NumPy. Missing data can occur for various reasons, such as data entry errors, data corruption, or simply that the information was not collected.
Why is it important?
Missing data can significantly impact your analysis and the performance of machine learning models. Many statistical and machine learning algorithms cannot handle missing values and will either throw an error or produce incorrect results. Therefore, it's crucial to identify and appropriately handle missing data before proceeding with further analysis or modeling.
How do we use it?
Pandas provides convenient methods to detect, remove, or fill missing values. Let's create a DataFrame with some missing data to demonstrate.
import pandas as pd
import numpy as np
data = {
"A": [1, 2, np.nan, 4, 5],
"B": [10, np.nan, 30, 40, 50],
"C": [100, 200, 300, np.nan, np.nan],
"D": ["apple", "banana", "cherry", "date", "elderberry"]
}
df = pd.DataFrame(data)
print(f"Original DataFrame with missing values:\n{df}\n")
# Finding missing values with .isnull().sum()
# .isnull() returns a boolean DataFrame indicating where values are NaN
# .sum() then counts the True values (i.e., NaNs) for each column
print(f"Missing values per column:\n{df.isnull().sum()}\n")
# Dropping missing values with .dropna()
# By default, drops rows containing ANY NaN values
df_dropped_rows = df.dropna()
print(f"DataFrame after dropping rows with any NaN:\n{df_dropped_rows}\n")
# Drop columns with any NaN values
df_dropped_cols = df.dropna(axis=1) # axis=1 means columns
print(f"DataFrame after dropping columns with any NaN:\n{df_dropped_cols}\n")
# Drop rows only if ALL values are NaN
df_dropped_all_nan_rows = df.dropna(how=\'all\')
print(f"DataFrame after dropping rows where ALL values are NaN:\n{df_dropped_all_nan_rows}\n")
# Filling missing values with .fillna()
# Fill with a specific value (e.g., 0)
df_filled_zero = df.fillna(0)
print(f"DataFrame after filling NaNs with 0:\n{df_filled_zero}\n")
# Fill with the mean of the column (for numerical columns)
# This is a common imputation strategy
df_filled_mean = df.copy() # Create a copy to avoid modifying original df
df_filled_mean["A"] = df_filled_mean["A"].fillna(df_filled_mean["A"].mean())
df_filled_mean["B"] = df_filled_mean["B"].fillna(df_filled_mean["B"].mean())
df_filled_mean["C"] = df_filled_mean["C"].fillna(df_filled_mean["C"].mean())
print(f"DataFrame after filling numerical NaNs with column mean:\n{df_filled_mean}\n")
# Fill with the median of the column (for numerical columns)
df_filled_median = df.copy()
df_filled_median["A"] = df_filled_median["A"].fillna(df_filled_median["A"].median())
print(f"DataFrame after filling \"A\" NaN with column median:\n{df_filled_median}\n")
# Forward fill (propagate last valid observation forward to next valid observation)
df_ffill = df.fillna(method=\'ffill\')
print(f"DataFrame after forward fill (ffill):\n{df_ffill}\n")
# Backward fill (propagate next valid observation backward to next valid observation)
df_bfill = df.fillna(method=\'bfill\')
print(f"DataFrame after backward fill (bfill):\n{df_bfill}")
Code Explanation & Output:
df.isnull().sum()
: This is a powerful combination.df.isnull()
creates a boolean DataFrame whereTrue
indicates a missing value. Chaining.sum()
to it counts theTrue
values for each column, giving you a quick summary of missing data per column.df.dropna()
:- By default,
df.dropna()
removes any row that contains at least oneNaN
value. This can lead to significant data loss if many rows have missing values. axis=1
: Specifies that columns containingNaN
values should be dropped instead of rows.how='all'
: Drops a row (or column ifaxis=1
) only if all its values areNaN
.
- By default,
df.fillna(value)
:df.fillna(0)
: Replaces allNaN
values in the DataFrame with0
.df_filled_mean["A"].fillna(df_filled_mean["A"].mean())
: This is a common strategy called mean imputation. It calculates the mean of the non-missing values in column 'A' and then fills theNaN
in that column with this mean. Similarly for 'B' and 'C'.df_filled_median["A"].fillna(df_filled_median["A"].median())
: Similar to mean imputation, but uses the median, which is less sensitive to outliers.method='ffill'
(forward fill): FillsNaN
values with the last observed non-null value in the column.method='bfill'
(backward fill): FillsNaN
values with the next observed non-null value in the column.
Original DataFrame with missing values:
A B C D
0 1.0 10.0 100.0 apple
1 2.0 NaN 200.0 banana
2 NaN 30.0 300.0 cherry
3 4.0 40.0 NaN date
4 5.0 50.0 NaN elderberry
Missing values per column:
A 1
B 1
C 2
D 0
dtype: int64
DataFrame after dropping rows with any NaN:
A B C D
0 1.0 10.0 100.0 apple
DataFrame after dropping columns with any NaN:
D
0 apple
1 banana
2 cherry
3 date
4 elderberry
DataFrame after dropping rows where ALL values are NaN:
A B C D
0 1.0 10.0 100.0 apple
1 2.0 NaN 200.0 banana
2 NaN 30.0 300.0 cherry
3 4.0 40.0 NaN date
4 5.0 50.0 NaN elderberry
DataFrame after filling NaNs with 0:
A B C D
0 1.0 10.0 100.0 apple
1 2.0 0.0 200.0 banana
2 0.0 30.0 300.0 cherry
3 4.0 40.0 0.0 date
4 5.0 50.0 0.0 elderberry
DataFrame after filling numerical NaNs with column mean:
A B C D
0 1.0 10.0 100.0 apple
1 2.0 32.5 200.0 banana
2 3.0 30.0 300.0 cherry
3 4.0 40.0 200.0 date
4 5.0 50.0 200.0 elderberry
DataFrame after filling \"A\" NaN with column median:
A B C D
0 1.0 10.0 100.0 apple
1 2.0 NaN 200.0 banana
2 3.0 30.0 300.0 cherry
3 4.0 40.0 NaN date
4 5.0 50.0 NaN elderberry
DataFrame after forward fill (ffill):
A B C D
0 1.0 10.0 100.0 apple
1 2.0 10.0 200.0 banana
2 2.0 30.0 300.0 cherry
3 4.0 40.0 300.0 date
4 5.0 50.0 300.0 elderberry
DataFrame after backward fill (bfill):
A B C D
0 1.0 10.0 100.0 apple
1 2.0 30.0 200.0 banana
2 4.0 30.0 300.0 cherry
3 4.0 40.0 NaN date
4 5.0 50.0 NaN elderberry
Choosing a strategy: The best way to handle missing data depends on the nature of your data and the reason for the missingness. Dropping rows or columns is simple but can lead to data loss. Imputation (filling missing values) is often preferred, but the choice of imputation method (mean, median, mode, or more advanced techniques) can significantly affect your results. Always consider the implications of your chosen strategy.
Topic: Grouping and Aggregating
What is it?
Grouping and aggregating data is a powerful technique that allows you to summarize data by categories. The process typically involves three steps:
- Splitting: Dividing the data into groups based on some criteria (e.g., grouping sales data by region, or customer data by age group).
- Applying: Applying a function to each group independently (e.g., calculating the sum of sales for each region, or the average age for each customer segment).
- Combining: Combining the results into a new DataFrame or Series.
This entire process is often referred to as "split-apply-combine."
Why is it important?
Grouping and aggregation are essential for gaining insights from your data. Instead of looking at individual data points, you can understand trends and patterns at a higher, more meaningful level. For example, you can answer questions like: "Which product category has the highest average sales?" or "What is the total revenue generated by each sales representative?"
How do we use it?
Pandas provides the .groupby()
method for splitting data into groups, and then you can apply various aggregation functions. Let's use a sample DataFrame representing sales data.
import pandas as pd
data = {
"Region": ["East", "West", "East", "North", "West", "East", "North", "West"],
"Salesperson": ["Alice", "Bob", "Charlie", "David", "Eve", "Frank", "Grace", "Heidi"],
"Product": ["A", "B", "A", "C", "B", "A", "C", "B"],
"Sales": [100, 150, 120, 200, 180, 110, 220, 160]
}
df = pd.DataFrame(data)
print(f"Original Sales DataFrame:\n{df}\n")
# Group by a single column and calculate the sum of Sales
sales_by_region = df.groupby("Region")["Sales"].sum()
print(f"Total Sales by Region:\n{sales_by_region}\n")
# Group by multiple columns and calculate the mean of Sales
sales_by_region_product = df.groupby(["Region", "Product"])["Sales"].mean()
print(f"Average Sales by Region and Product:\n{sales_by_region_product}\n")
# Applying multiple aggregation functions at once using .agg()
# Calculate sum, mean, and count of Sales by Region
multi_agg_by_region = df.groupby("Region")["Sales"].agg(["sum", "mean", "count"])
print(f"Multiple Aggregations by Region:\n{multi_agg_by_region}\n")
# Renaming aggregated columns for clarity
multi_agg_renamed = df.groupby("Region")["Sales"].agg(
Total_Sales=("Sales", "sum"),
Average_Sales=("Sales", "mean"),
Number_of_Transactions=("Sales", "count")
)
print(f"Multiple Aggregations by Region (Renamed Columns):\n{multi_agg_renamed}\n")
# Grouping by a column and applying different aggregations to different columns
df_agg_diff_cols = df.groupby("Region").agg(
Total_Sales=("Sales", "sum"),
Average_Age=("Age", "mean") # Assuming 'Age' column exists and is numerical
)
# Note: For this example, 'Age' column is not in the original df, so this will cause an error.
# This is just to illustrate the concept. Let's add a dummy Age column for demonstration.
df["Age"] = [25, 30, 28, 35, 32, 29, 38, 31]
print(f"DataFrame with Age column:\n{df}\n")
df_agg_diff_cols = df.groupby("Region").agg(
Total_Sales=("Sales", "sum"),
Average_Age=("Age", "mean")
)
print(f"Aggregations on different columns by Region:\n{df_agg_diff_cols}")
Code Explanation & Output:
df.groupby("Region")
: This is the "split" step. It creates aDataFrameGroupBy
object, which is an intermediate object that holds the grouped information.["Sales"].sum()
: This is the "apply" and "combine" step. We select theSales
column from the grouped object and then apply thesum()
aggregation function. Pandas automatically combines the results into a new Series where the index is theRegion
.df.groupby(["Region", "Product"])["Sales"].mean()
: You can group by multiple columns by passing a list of column names togroupby()
. The result will have a MultiIndex..agg(["sum", "mean", "count"])
: The.agg()
method allows you to apply multiple aggregation functions to the same column simultaneously. You pass a list of function names (as strings).Total_Sales=("Sales", "sum")
: When using.agg()
, you can also provide new names for the aggregated columns using keyword arguments, where the value is a tuple(column_to_aggregate, aggregation_function_name)
.df.groupby("Region").agg(Total_Sales=("Sales", "sum"), Average_Age=("Age", "mean"))
: This demonstrates applying different aggregation functions to different columns within the samegroupby
operation. This is very flexible for creating summary tables.
Original Sales DataFrame:
Region Salesperson Product Sales
0 East Alice A 100
1 West Bob B 150
2 East Charlie A 120
3 North David C 200
4 West Eve B 180
5 East Frank A 110
6 North Grace C 220
7 West Heidi B 160
Total Sales by Region:
Region
East 330
North 420
West 490
Name: Sales, dtype: int64
Average Sales by Region and Product:
Region Product
East A 110.0
North C 210.0
West B 163.333333
Name: Sales, dtype: float64
Multiple Aggregations by Region:
sum mean count
Region
East 330 110.000000 3
North 420 210.000000 2
West 490 163.333333 3
Multiple Aggregations by Region (Renamed Columns):
Total_Sales Average_Sales Number_of_Transactions
Region
East 330 110.000000 3
North 420 210.000000 2
West 490 163.333333 3
DataFrame with Age column:
Region Salesperson Product Sales Age
0 East Alice A 100 25
1 West Bob B 150 30
2 East Charlie A 120 28
3 North David C 200 35
4 West Eve B 180 32
5 East Frank A 110 29
6 North Grace C 220 38
7 West Heidi B 160 31
Aggregations on different columns by Region:
Total_Sales Average_Age
Region
East 330 27.333333
North 420 36.500000
West 490 31.000000
The Power of
groupby
: Thegroupby()
method is incredibly versatile and forms the basis for many complex data transformations. Mastering it is key to efficient data analysis with Pandas.
Module 3: Data Visualization - Telling Stories with Data
What is it?
Data visualization is the graphical representation of data. It involves creating charts, plots, maps, and other visual elements to help people understand complex data and identify patterns, trends, and insights that might not be obvious from looking at raw numbers.
Why is it important?
In data science, visualization is not just about making pretty pictures; it's a critical step in the analysis process. It helps in:
- Exploratory Data Analysis (EDA): Visualizing data is essential for understanding its distribution, relationships between variables, and identifying outliers or anomalies.
- Communicating Findings: Complex data insights can be effectively communicated to both technical and non-technical audiences through clear and compelling visualizations.
- Identifying Patterns and Trends: Visualizations can reveal patterns, correlations, and trends that are difficult to spot in tabular data.
- Model Evaluation: Visualizations are often used to evaluate the performance of machine learning models.
Think of data visualization as translating the language of numbers into the universal language of images. A well-designed chart can tell a powerful story about your data in a way that tables of numbers cannot.
In this module, we will explore two of the most popular and powerful data visualization libraries in Python: Matplotlib and Seaborn.
Topic: Part A - Matplotlib, the Grandparent of Plots
What is it?
Matplotlib is a comprehensive library for creating static, interactive, and animated visualizations in Python. It is the oldest and most fundamental plotting library in the Python scientific ecosystem, and many other libraries, including Seaborn, are built on top of it.
Why is it important?
Matplotlib provides a high degree of flexibility and control over your plots. While it can sometimes be more verbose than newer libraries, its foundational role means that understanding Matplotlib is key to understanding how many other plotting tools in Python work. It allows you to customize virtually every aspect of a plot.
Analogy: If data visualization is like painting a picture to tell a story, Matplotlib is like having a full set of brushes, paints, and a blank canvas. It gives you complete control over every stroke and color, allowing you to create highly customized and intricate visualizations from scratch. While this requires more effort than using pre-mixed colors or stencils (like some higher-level libraries), it offers unparalleled artistic freedom.
We will primarily use the pyplot
module from Matplotlib, which provides a convenient interface for creating plots similar to MATLAB.
How do we use it?
We typically import the pyplot
module with the alias plt
.
import matplotlib.pyplot as plt
# Now we can use plt to create plots
print("Matplotlib imported successfully!")
Code Explanation & Output:
import matplotlib.pyplot as plt
: This line imports thepyplot
module from Matplotlib and assigns it the conventional aliasplt
. This is the standard way to use Matplotlib for plotting.
Matplotlib imported successfully!
Anatomy of a Plot (Figure, Axes, Title, Labels)
What is it?
A Matplotlib plot is composed of several key components. Understanding these components is essential for customizing your visualizations:
- Figure: The entire window or page that contains the plot(s). It's the top-level container.
- Axes: This is the actual area where the data is plotted. A Figure can contain multiple Axes, each representing a different plot.
- Title: A descriptive title for the entire Figure or for individual Axes.
- Labels: Labels for the x-axis and y-axis to indicate what the data represents.
- Legend: Explains what each element (e.g., line, bar) in the plot represents.
Why is it important?
Knowing the anatomy of a plot allows you to target specific parts of the visualization for customization, such as setting titles, changing axis limits, adding labels, or modifying the appearance of plotted elements.
How do we use it?
Let's create a simple plot and identify its components.
import matplotlib.pyplot as plt
import numpy as np
# Prepare some data
x = np.linspace(0, 10, 100) # 100 points between 0 and 10
y = np.sin(x)
# Create a Figure and an Axes
fig, ax = plt.subplots() # Creates a Figure and a single Axes
# Plot data on the Axes
ax.plot(x, y)
# Set Title and Labels for the Axes
ax.set_title("Sine Wave")
ax.set_xlabel("X-axis")
ax.set_ylabel("Y-axis")
# Add a Figure title (optional, often Axes title is sufficient)
fig.suptitle("My First Matplotlib Plot", y=1.02) # y adjusts position
# Display the plot
plt.show()
Code Explanation & Output:
fig, ax = plt.subplots()
: This is a common way to create a Figure and one or more Axes.plt.subplots()
returns a tuple containing the Figure object (fig
) and an Axes object (ax
). If you wanted multiple plots in one figure, you could specify the number of rows and columns (e.g.,plt.subplots(2, 2)
).ax.plot(x, y)
: We plot the data (x
andy
) on theax
(Axes) object. Most plotting methods are called on an Axes object.ax.set_title()
,ax.set_xlabel()
,ax.set_ylabel()
: These methods are used to set the title and axis labels for the specific Axes.fig.suptitle()
: This sets a title for the entire Figure. They=1.02
argument slightly adjusts its vertical position to avoid overlapping with the Axes title.plt.show()
: This function displays the generated plot. Withoutplt.show()
, the plot might not be rendered or displayed depending on the environment.
# A plot window will appear with a sine wave, titled "Sine Wave" and labeled axes.
# The overall figure title will be "My First Matplotlib Plot".
Creating Basic Plots: plt.plot()
(line), plt.scatter()
(scatter), plt.bar()
(bar), plt.hist()
(histogram).
What is it?
Matplotlib provides functions for creating various types of basic plots, each suitable for visualizing different kinds of data relationships and distributions.
Why is it important?
Choosing the right type of plot is crucial for effectively communicating the story in your data. Line plots are great for trends over time, scatter plots for relationships between two variables, bar plots for comparing categories, and histograms for showing data distribution.
How do we use it?
Let's create examples of these basic plot types.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# Data for plots
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 5, 4, 5])
categories = ["A", "B", "C", "D", "E"]
values = [10, 25, 15, 20, 30]
hist_data = np.random.randn(1000) # 1000 random numbers from a standard normal distribution
# 1. Line Plot (plt.plot())
plt.figure(figsize=(8, 4)) # Create a new figure with a specified size
plt.plot(x, y, marker=\'o\', linestyle=\'--\', color=\'b\')
plt.title("Simple Line Plot")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.grid(True) # Add a grid
plt.show()
# 2. Scatter Plot (plt.scatter())
plt.figure(figsize=(8, 4))
plt.scatter(x, y, color=\'red\', marker=\'x\')
plt.title("Simple Scatter Plot")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.grid(True)
plt.show()
# 3. Bar Plot (plt.bar())
plt.figure(figsize=(8, 4))
plt.bar(categories, values, color=\'green\')
plt.title("Simple Bar Plot")
plt.xlabel("Category")
plt.ylabel("Value")
plt.show()
# 4. Histogram (plt.hist())
plt.figure(figsize=(8, 4))
plt.hist(hist_data, bins=30, color=\'purple\', edgecolor=\'black\') # bins define the number of bars
plt.title("Histogram of Random Data")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.show()
Code Explanation & Output:
plt.figure(figsize=(8, 4))
: Creates a new figure for each plot.figsize
sets the width and height of the figure in inches.plt.plot(x, y, ...)
: Creates a line plot. Arguments likemarker
,linestyle
, andcolor
customize the appearance.plt.scatter(x, y, ...)
: Creates a scatter plot.color
andmarker
customize the points.plt.bar(categories, values, ...)
: Creates a bar plot. The first argument is the x-coordinates (categories), and the second is the height of the bars (values).plt.hist(hist_data, bins=30, ...)
: Creates a histogram.hist_data
is the data, andbins
determines how many bars (bins) the data will be divided into.edgecolor
adds borders to the bars.plt.title()
,plt.xlabel()
,plt.ylabel()
: Set the title and axis labels for the current plot.plt.grid(True)
: Adds a grid to the plot for easier reading.plt.show()
: Displays the current figure.
# Four separate plot windows will appear, showing:
# 1. A line plot with points and a dashed line.
# 2. A scatter plot with red 'x' markers.
# 3. A bar plot comparing the values of different categories.
# 4. A histogram showing the distribution of the random data.
Matplotlib Workflow: You can use either the
pyplot
interface (likeplt.plot()
,plt.title()
) which implicitly manages figures and axes, or the object-oriented interface (likefig, ax = plt.subplots()
and thenax.plot()
,ax.set_title()
). For more complex plots with multiple subplots, the object-oriented approach is generally recommended for better control.
Topic: Part B - Seaborn, for Beautiful Statistical Plots
What is it?
Seaborn is a Python data visualization library based on Matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. Seaborn is particularly good at visualizing relationships between variables and showing distributions.
Why is it important?
Seaborn simplifies the creation of many common and complex statistical plots. It has built-in themes for aesthetics and functions specifically designed to work with Pandas DataFrames, making it very convenient for data analysis workflows. It often requires less code than Matplotlib to produce visually appealing and statistically informative plots.
Analogy: If Matplotlib is the raw paints and brushes, Seaborn is like having pre-packaged art kits and specialized tools that make it easier to create specific types of beautiful and complex statistical visualizations quickly. It handles many of the aesthetic details automatically, allowing you to focus on the data and the story you want to tell.
How do we use it?
We typically import Seaborn with the alias sns
.
import seaborn as sns
import matplotlib.pyplot as plt # Seaborn works well with Matplotlib
# Now we can use sns to create statistical plots
print("Seaborn imported successfully!")
Code Explanation & Output:
import seaborn as sns
: Imports the Seaborn library with the conventional aliassns
.import matplotlib.pyplot as plt
: It's common to import Matplotlib'spyplot
as well, as Seaborn plots are Matplotlib Figure objects and can be customized using Matplotlib functions.
Seaborn imported successfully!
Key Seaborn Plots: sns.scatterplot()
(with hue), sns.countplot()
, sns.boxplot()
, sns.heatmap()
(for correlations).
What is it?
Seaborn offers a variety of plot types tailored for statistical data. We will look at a few key ones:
sns.scatterplot()
: Similar to Matplotlib's scatter plot, but with enhanced capabilities for showing relationships between two numerical variables, including the ability to use color (hue
), size, or style to represent a third categorical or numerical variable.sns.countplot()
: Shows the counts of observations in each categorical bin using bars. It's a specialized version of a bar plot for showing the frequency of categories.sns.boxplot()
: Displays the distribution of a numerical variable across different categories. Box plots show the median, quartiles, and potential outliers.sns.heatmap()
: Visualizes matrix-like data (like correlation matrices) as a color-coded grid. It's excellent for quickly identifying patterns in matrices.
Why is it important?
These plots are fundamental tools for exploring relationships, distributions, and summaries within your data, especially when dealing with a mix of numerical and categorical variables.
How do we use it?
Let's use a sample DataFrame to demonstrate these Seaborn plots.
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# Create a sample DataFrame
data = {
"Category": ["A", "B", "A", "C", "B", "A", "C", "B", "A", "C"],
"Value": [10, 25, 12, 30, 28, 15, 35, 22, 18, 33],
"Group": ["X", "Y", "X", "Y", "X", "Y", "X", "Y", "X", "Y"],
"Numerical_Feature_1": np.random.rand(10) * 100,
"Numerical_Feature_2": np.random.rand(10) * 50
}
df = pd.DataFrame(data)
print(f"Original DataFrame:\n{df}\n")
# Set a Seaborn style (optional, but makes plots look nicer)
sns.set_theme(style="whitegrid")
# 1. Scatter Plot with Hue (sns.scatterplot())
plt.figure(figsize=(8, 5))
sns.scatterplot(data=df, x="Value", y="Numerical_Feature_1", hue="Category", s=100) # s controls marker size
plt.title("Scatter Plot of Value vs. Numerical_Feature_1 by Category")
plt.show()
# 2. Count Plot (sns.countplot())
plt.figure(figsize=(7, 4))
sns.countplot(data=df, x="Category", hue="Group")
plt.title("Count of Observations per Category and Group")
plt.show()
# 3. Box Plot (sns.boxplot())
plt.figure(figsize=(7, 5))
sns.boxplot(data=df, x="Category", y="Value")
plt.title("Distribution of Value per Category (Box Plot)")
plt.show()
# 4. Heatmap (sns.heatmap())
# First, calculate the correlation matrix
correlation_matrix = df[["Value", "Numerical_Feature_1", "Numerical_Feature_2"]].corr()
plt.figure(figsize=(6, 5))
sns.heatmap(correlation_matrix, annot=True, cmap=\'coolwarm\', fmt=\'.2f\') # annot=True shows values, cmap sets color map
plt.title("Correlation Matrix Heatmap")
plt.show()
Code Explanation & Output:
sns.set_theme(style="whitegrid")
: Sets the visual style for the plots. Seaborn comes with several built-in themes.sns.scatterplot(data=df, x="Value", y="Numerical_Feature_1", hue="Category", s=100)
: Creates a scatter plot using the specified DataFrame (data=df
).x
andy
specify the columns for the axes.hue="Category"
colors the points based on the values in the 'Category' column, automatically adding a legend.s
sets the size of the markers.sns.countplot(data=df, x="Category", hue="Group")
: Creates a bar plot showing the counts of each unique value in the 'Category' column.hue="Group"
further divides the bars based on the 'Group' column.sns.boxplot(data=df, x="Category", y="Value")
: Creates box plots for the 'Value' column, separated by the unique values in the 'Category' column.correlation_matrix = df[["Value", "Numerical_Feature_1", "Numerical_Feature_2"]].corr()
: Calculates the pairwise correlation between the specified numerical columns in the DataFrame. The result is a correlation matrix.sns.heatmap(correlation_matrix, annot=True, cmap=\'coolwarm\', fmt=\'.2f\')
: Creates a heatmap from thecorrelation_matrix
.annot=True
displays the correlation values on the heatmap cells.cmap
sets the color scheme.fmt
formats the annotation text.plt.show()
: Displays the generated plot. Remember that Seaborn functions often return Matplotlib Axes objects, soplt.show()
is still needed to display them.
# Four separate plot windows will appear, showing:
# 1. A scatter plot where points are colored by their category.
# 2. A count plot showing the frequency of each category, split by group.
# 3. Box plots illustrating the distribution of 'Value' for each category.
# 4. A heatmap showing the correlation coefficients between the numerical features.
Seaborn and Matplotlib Integration: Seaborn and Matplotlib work together seamlessly. You can create a plot using Seaborn and then use Matplotlib's
plt
functions (likeplt.title()
,plt.xlabel()
,plt.ylabel()
,plt.figure()
) to further customize the plot.
Module 4: The Bridge to Machine Learning - Data Preparation with Scikit-learn
What is it?
Scikit-learn (often referred to as sklearn
) is a free software machine learning library for the Python programming language. It features various classification, regression, and clustering algorithms, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy. While Scikit-learn is primarily known for its machine learning algorithms, it also provides a robust set of tools for data preprocessing, which is what we will focus on in this module.
Why is it important?
Data preparation is arguably the most critical step in the machine learning pipeline. Raw data is rarely in a format that machine learning algorithms can directly use. Algorithms expect numerical input, and they often perform better when numerical features are on a similar scale. Scikit-learn provides efficient and standardized ways to transform your data into the optimal format for machine learning models.
Topic: Why Do We Need to Prepare Data?
What is it?
Data preparation, also known as data preprocessing, involves transforming raw data into a clean and organized format suitable for machine learning algorithms. This includes handling missing values (which we covered in Pandas), dealing with categorical data, and scaling numerical features.
Why is it important?
Think of machine learning algorithms as highly sophisticated calculators. They are built on mathematical principles and operate on numbers. They don't inherently understand text, dates, or wildly varying scales in numerical data. If you feed them raw, unprepared data, it's like trying to teach a calculator to read a novel – it simply won't work, or it will produce nonsensical results.
Explain that ML algorithms are just math, and they need clean, numerical input.
Machine learning models are essentially complex mathematical equations and statistical functions. For these equations to work correctly and efficiently, their inputs must be in a consistent, numerical format. Imagine trying to calculate the average of a list that contains both numbers and words; the calculation would fail. Similarly, if one feature (like income) ranges from thousands to millions, while another (like age) ranges from tens to hundreds, the algorithm might disproportionately weigh the feature with the larger scale, leading to biased or inaccurate predictions. Data preparation ensures that all features contribute fairly and meaningfully to the model's learning process.
Proper data preparation can significantly improve the performance, accuracy, and training speed of your machine learning models. It's the foundation upon which successful models are built.
Topic: Handling Categorical Data
What is it?
Categorical data represents types of data which may be divided into groups. Examples include gender (Male, Female), colors (Red, Green, Blue), or cities (New York, London, Tokyo). Machine learning algorithms, being mathematical, cannot directly process these text-based categories.
The Problem: ML models don't understand text like 'Male', 'Female', or 'USA'.
Machine learning algorithms are designed to work with numerical input. When they encounter categorical data in text format, they cannot perform calculations or identify patterns. For instance, an algorithm cannot directly compare 'Male' and 'Female' in a mathematical sense. We need a way to convert these categories into a numerical representation that the algorithms can understand without implying any false relationships or order.
The Solution: One-Hot Encoding with sklearn.preprocessing.OneHotEncoder
.
What is it?
One-Hot Encoding is a technique used to convert categorical variables into a numerical format that can be provided to machine learning algorithms. For each unique category in a column, it creates a new binary (0 or 1) column. If an observation belongs to a category, the corresponding new column will have a 1
, and 0
otherwise.
Why is it important?
One-Hot Encoding is crucial because it transforms categorical data into a numerical format without implying any ordinal relationship or magnitude between categories. For example, if you simply assigned numbers (e.g., Male=0, Female=1), the algorithm might incorrectly assume that 'Female' is
greater than or somehow
superior to
‘Male’. One-hot encoding avoids this by treating each category as an independent feature.
How do we use it?
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
# Sample DataFrame with a categorical column
data = {
"City": ["New York", "London", "Paris", "New York", "London"],
"Temperature": [20, 15, 22, 21, 14]
}
df = pd.DataFrame(data)
print(f"Original DataFrame:\n{df}\n")
# Initialize the OneHotEncoder
# handle_unknown=\'ignore\': If a new category appears during testing that wasn\'t in training, it will be ignored.
# sparse_output=False: Returns a dense NumPy array instead of a sparse matrix.
encoder = OneHotEncoder(handle_unknown=\'ignore\', sparse_output=False)
# Fit the encoder to the \"City\" column and transform it
# .values.reshape(-1, 1) is used because OneHotEncoder expects a 2D array
encoded_features = encoder.fit_transform(df[["City"]])
# Get the new column names generated by the encoder
encoded_column_names = encoder.get_feature_names_out(["City"])
# Create a DataFrame from the encoded features
df_encoded = pd.DataFrame(encoded_features, columns=encoded_column_names)
# Concatenate the original DataFrame (excluding the original categorical column) with the new encoded columns
df_final = pd.concat([df.drop("City", axis=1), df_encoded], axis=1)
print(f"DataFrame after One-Hot Encoding:\n{df_final}")
Code Explanation & Output:
from sklearn.preprocessing import OneHotEncoder
: Imports the necessary class.encoder = OneHotEncoder(handle_unknown=\'ignore\', sparse_output=False)
: Initializes the encoder.handle_unknown=\'ignore\'
is good practice to prevent errors if unseen categories appear later.sparse_output=False
ensures the output is a dense NumPy array, which is easier to work with for beginners.encoded_features = encoder.fit_transform(df[["City"]])
: This is a two-step process:fit()
: The encoder learns all unique categories present in the"City"
column.transform()
: The encoder then converts these categories into the one-hot encoded numerical format. We passdf[["City"]]
(note the double brackets) becauseOneHotEncoder
expects a 2D array-like input, even for a single column.
encoded_column_names = encoder.get_feature_names_out(["City"])
: Retrieves the names of the new columns created by the encoder (e.g.,City_London
,City_New York
).df_encoded = pd.DataFrame(encoded_features, columns=encoded_column_names)
: Converts the NumPy array of encoded features into a Pandas DataFrame with appropriate column names.df_final = pd.concat([df.drop("City", axis=1), df_encoded], axis=1)
: Combines the original DataFrame (after dropping the original"City"
column) with the newly created one-hot encoded columns.axis=1
indicates concatenation along columns.
Original DataFrame:
City Temperature
0 New York 20
1 London 15
2 Paris 22
3 New York 21
4 London 14
DataFrame after One-Hot Encoding:
Temperature City_London City_New York City_Paris
0 20 0.0 1.0 0.0
1 15 1.0 0.0 0.0
2 22 0.0 0.0 1.0
3 21 0.0 1.0 0.0
4 14 1.0 0.0 0.0
Topic: Scaling Numerical Data
What is it?
Scaling numerical data is a preprocessing step that transforms numerical features to a standard range. This is particularly important when features have different units or vastly different scales.
The Problem: Features on different scales (e.g., age from 0-100 vs. salary from 30k-200k) can confuse models.
Many machine learning algorithms, especially those that rely on distance calculations (like K-Nearest Neighbors, Support Vector Machines, or neural networks), are sensitive to the scale of input features. If one feature has a much larger range of values than another, the algorithm might implicitly give more weight to the feature with the larger scale, even if it's not more important. This can lead to suboptimal model performance.
For example, if you have a dataset with 'Age' (ranging from 0-100) and 'Salary' (ranging from 30,000-200,000), a distance-based algorithm might consider a difference of 10,000 in salary to be less significant than a difference of 10 in age, simply because the absolute numerical difference is smaller, even if the percentage change or real-world impact is much larger for age.
Scaling ensures that all features contribute equally to the distance calculations and that the model doesn't get biased towards features with larger numerical values.
The Solution 1: Standardization with sklearn.preprocessing.StandardScaler
.
What is it?
Standardization (or Z-score normalization) transforms data such that it has a mean of 0 and a standard deviation of 1. It achieves this by subtracting the mean from each value and then dividing by the standard deviation.
Formula: z = (x - mean) / standard_deviation
Why is it important?
Standardization is particularly useful for algorithms that assume your data is normally distributed or that use gradient descent (like linear regression, logistic regression, neural networks). It helps these algorithms converge faster and perform better by placing all features on a similar scale around zero.
How do we use it?
import pandas as pd
from sklearn.preprocessing import StandardScaler
# Sample DataFrame with numerical columns of different scales
data = {
"Age": [25, 30, 35, 28, 22],
"Salary": [70000, 85000, 90000, 72000, 65000]
}
df = pd.DataFrame(data)
print(f"Original DataFrame:\n{df}\n")
# Initialize the StandardScaler
scaler = StandardScaler()
# Fit the scaler to the data and transform it
# We select all numerical columns to scale
scaled_features = scaler.fit_transform(df[["Age", "Salary"]])
# Create a DataFrame from the scaled features
df_scaled = pd.DataFrame(scaled_features, columns=["Age_Scaled", "Salary_Scaled"])
# Concatenate with original DataFrame (optional, for comparison)
df_final = pd.concat([df, df_scaled], axis=1)
print(f"DataFrame after Standardization (StandardScaler):\n{df_final}")
Code Explanation & Output:
from sklearn.preprocessing import StandardScaler
: Imports theStandardScaler
class.scaler = StandardScaler()
: Initializes the scaler.scaled_features = scaler.fit_transform(df[["Age", "Salary"]])
: This learns the mean and standard deviation for each selected column (fit()
) and then applies the scaling transformation (transform()
). Again, double brackets[[]]
are used to pass a DataFrame (2D array) to the scaler.df_scaled = pd.DataFrame(scaled_features, columns=["Age_Scaled", "Salary_Scaled"])
: Creates a new DataFrame from the scaled NumPy array, assigning new column names to indicate they are scaled.df_final = pd.concat([df, df_scaled], axis=1)
: Combines the original and scaled DataFrames for easy comparison.
Original DataFrame:
Age Salary
0 25 70000
1 30 85000
2 35 90000
3 28 72000
4 22 65000
DataFrame after Standardization (StandardScaler):
Age Salary Age_Scaled Salary_Scaled
0 25 70000 -0.490393 -0.589664
1 30 85000 0.735590 0.884496
2 35 90000 1.961574 1.376940
3 28 72000 0.000000 -0.392943
4 22 65000 -2.206771 -1.278829
Notice that after scaling, the Age_Scaled
and Salary_Scaled
columns have values centered around 0, with a standard deviation of 1 (though with only 5 data points, this isn't perfectly visible).
The Solution 2: Normalization with sklearn.preprocessing.MinMaxScaler
.
What is it?
Normalization (or Min-Max scaling) transforms data to a fixed range, usually between 0 and 1. It achieves this by subtracting the minimum value from each data point and then dividing by the range (maximum value - minimum value).
Formula: x_scaled = (x - min) / (max - min)
Why is it important?
Normalization is useful when you need features to be within a specific bounded range. It's often preferred for algorithms that don't assume a specific distribution of the data, such as neural networks with activation functions that are sensitive to input ranges (e.g., sigmoid or tanh).
How do we use it?
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
# Sample DataFrame with numerical columns
data = {
"Age": [25, 30, 35, 28, 22],
"Salary": [70000, 85000, 90000, 72000, 65000]
}
df = pd.DataFrame(data)
print(f"Original DataFrame:\n{df}\n")
# Initialize the MinMaxScaler
scaler = MinMaxScaler()
# Fit the scaler to the data and transform it
scaled_features = scaler.fit_transform(df[["Age", "Salary"]])
# Create a DataFrame from the scaled features
df_scaled = pd.DataFrame(scaled_features, columns=["Age_Normalized", "Salary_Normalized"])
# Concatenate with original DataFrame (optional, for comparison)
df_final = pd.concat([df, df_scaled], axis=1)
print(f"DataFrame after Normalization (MinMaxScaler):\n{df_final}")
Code Explanation & Output:
from sklearn.preprocessing import MinMaxScaler
: Imports theMinMaxScaler
class.scaler = MinMaxScaler()
: Initializes the scaler.scaled_features = scaler.fit_transform(df[["Age", "Salary"]])
: Learns the minimum and maximum values for each selected column and then applies the Min-Max scaling.df_scaled = pd.DataFrame(scaled_features, columns=["Age_Normalized", "Salary_Normalized"])
: Creates a new DataFrame from the scaled NumPy array, assigning new column names.
Original DataFrame:
Age Salary
0 25 70000
1 30 85000
2 35 90000
3 28 72000
4 22 65000
DataFrame after Normalization (MinMaxScaler):
Age Salary Age_Normalized Salary_Normalized
0 25 70000 0.428571 0.200000
1 30 85000 0.714286 0.700000
2 35 90000 1.000000 0.800000
3 28 72000 0.571429 0.280000
4 22 65000 0.000000 0.000000
Notice that after normalization, all values in Age_Normalized
and Salary_Normalized
are between 0 and 1.
When to use Standardization vs. Normalization:
- Standardization is generally preferred when the data follows a Gaussian (normal) distribution, or when algorithms assume zero mean and unit variance (e.g., Linear Regression, Logistic Regression, SVMs, Neural Networks).
- Normalization is useful when your data has a fixed range or when you need to preserve the relationships between the original values (e.g., image processing, neural networks with sigmoid activation).
- If your data has outliers, StandardScaler is often less affected than MinMaxScaler because it uses the mean and standard deviation, which are more robust to outliers than min/max values.
Choosing between standardization and normalization often depends on the specific machine learning algorithm you plan to use and the characteristics of your data. It's a common practice to try both and see which one yields better results for your particular problem.
Module 5: Hands-On Capstone Project - Analyzing the Titanic Dataset
Topic: The Goal
What is it?
This module is dedicated to a hands-on project where we will apply the fundamental data science skills we've learned using NumPy, Pandas, Matplotlib, Seaborn, and Scikit-learn's preprocessing tools. Our focus will be on a real-world dataset: the famous Titanic passenger data.
Why is it important?
Learning theoretical concepts is essential, but applying them to a real dataset is where your understanding truly solidifies. This project will walk you through a typical initial data science workflow: loading data, exploring it to understand its characteristics, cleaning it to handle issues like missing values, visualizing it to uncover patterns, and preparing it for potential future use in machine learning models. It's a crucial step before you would build predictive models.
Our goal is NOT to build a predictive model yet. While the Titanic dataset is often used for classification (predicting survival), in this module, we will focus purely on the data exploration, cleaning, and preparation steps. This is often the most time-consuming part of a data science project, and mastering it is fundamental before you even think about applying machine learning algorithms.
By the end of this module, you will have a clean, well-understood, and preprocessed dataset, ready for the next steps in a data science pipeline.
Topic: The Dataset
What is it?
The dataset we will be using is the famous Titanic dataset, commonly available on platforms like Kaggle. It contains information about the passengers aboard the RMS Titanic when it sank in 1912.
Why is it important?
The Titanic dataset is a classic dataset for beginners in data science and machine learning. It's relatively small, easy to understand, and contains a mix of numerical and categorical features, as well as missing values, making it perfect for practicing data cleaning, exploration, and preprocessing techniques.
Introduce the Kaggle Titanic dataset. Explain the columns (Pclass, Sex, Age, Fare, Embarked, Survived).
The dataset typically includes the following columns:
- PassengerId: A unique identifier for each passenger.
- Survived: This is the target variable (though we won't be predicting it in this module). It indicates whether the passenger survived (1) or not (0).
- Pclass: Passenger class (1st, 2nd, or 3rd). This is a categorical feature representing socioeconomic status.
- Name: The passenger's name.
- Sex: The passenger's gender (male or female). This is a categorical feature.
- Age: The passenger's age in years. This is a numerical feature and often contains missing values.
- SibSp: Number of siblings/spouses aboard the Titanic. Numerical.
- Parch: Number of parents/children aboard the Titanic. Numerical.
- Ticket: Ticket number. String.
- Fare: The fare paid for the ticket. Numerical.
- Cabin: Cabin number. String, often contains many missing values.
- Embarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton). Categorical, may contain missing values.
Our focus will be on using the features like Pclass
, Sex
, Age
, Fare
, and Embarked
to understand the passenger demographics and prepare this data for potential future modeling.
Topic: Step 1: Load and Explore the Data
What is it?
Loading and exploring data is the very first practical step in any data analysis project. It involves reading the dataset into a Pandas DataFrame and then using basic DataFrame methods to get a high-level overview of its structure, content, and initial characteristics.
Why is it important?
This initial exploration helps you quickly understand the dataset, identify potential issues (like missing values or incorrect data types), and form preliminary hypotheses. It sets the stage for all subsequent data cleaning, transformation, and analysis steps.
How do we use it?
We will simulate the titanic.csv
file for this example. In a real scenario, you would download this file from Kaggle or another source and place it in your working directory.
import pandas as pd
import numpy as np
import os
# Simulate creating a titanic.csv file for demonstration purposes
titanic_data = {
"PassengerId": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
"Survived": [0, 1, 1, 1, 0, 0, 0, 0, 1, 1],
"Pclass": [3, 1, 3, 1, 3, 3, 1, 3, 3, 2],
"Name": [
"Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Florence Briggs Thayer)",
"Heikkinen, Miss. Laina", "Futrelle, Mrs. Jacques Heath (Lily May Peel)",
"Allen, Mr. William Henry", "Moran, Mr. James",
"McCarthy, Mr. Timothy J", "Palsson, Master. Gosta Leonard",
"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)", "Nasser, Mrs. Nicholas (Adele Achem)"
],
"Sex": ["male", "female", "female", "female", "male", "male", "male", "male", "female", "female"],
"Age": [22.0, 38.0, 26.0, 35.0, 35.0, np.nan, 54.0, 2.0, 27.0, 14.0],
"SibSp": [1, 1, 0, 1, 0, 0, 0, 3, 0, 1],
"Parch": [0, 0, 0, 0, 0, 0, 0, 1, 2, 0],
"Ticket": ["A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "373450", "330877", "17463", "349909", "347742", "237736"],
"Fare": [7.25, 71.2833, 7.925, 53.1, 8.05, 8.4583, 51.8625, 21.075, 11.1333, 30.0708],
"Cabin": [np.nan, "C85", np.nan, "C123", np.nan, np.nan, "E46", np.nan, np.nan, np.nan],
"Embarked": ["S", "C", "S", "S", "S", "Q", "S", "S", "S", "C"]
}
df_titanic_dummy = pd.DataFrame(titanic_data)
csv_file_path = "titanic.csv"
df_titanic_dummy.to_csv(csv_file_path, index=False)
print(f"Dummy \'{csv_file_path}\' created for demonstration.\n")
# Use Pandas to load titanic.csv
df = pd.read_csv(csv_file_path)
print("--- First 5 rows (df.head()) ---\n")
print(df.head())
print("\n" + "-"*40 + "\n")
print("--- DataFrame Info (df.info()) ---\n")
df.info()
print("\n" + "-"*40 + "\n")
print("--- Descriptive Statistics (df.describe()) ---\n")
print(df.describe())
print("\n" + "-"*40 + "\n")
# Clean up the dummy file (optional)
os.remove(csv_file_path)
print(f"Dummy \'{csv_file_path}\' removed.")
Code Explanation & Output:
- Simulating
titanic.csv
: We create a small DataFramedf_titanic_dummy
that mimics the structure and data types of the actual Titanic dataset, including somenp.nan
values to represent missing data inAge
andCabin
. df_titanic_dummy.to_csv(csv_file_path, index=False)
: Saves this dummy DataFrame to a CSV file namedtitanic.csv
.df = pd.read_csv(csv_file_path)
: This is the actual command you would use to load the Titanic dataset from a CSV file into a DataFrame nameddf
.df.head()
: Displays the first 5 rows of the DataFrame. This gives you a quick visual inspection of the data, including column names and a few sample entries.df.info()
: Provides a summary of the DataFrame, including:- The number of entries (rows).
- A list of all columns.
- The
Non-Null Count
for each column, which immediately tells us where missing values exist (e.g.,Age
has 9 non-null values out of 10, meaning 1 missing;Cabin
has only 2 non-null values, meaning 8 missing). - The
Dtype
(data type) of each column (e.g.,int64
,float64
,object
for strings). - Memory usage.
df.describe()
: Generates descriptive statistics for numerical columns. This includescount
,mean
,std
(standard deviation),min
,25%
(first quartile),50%
(median),75%
(third quartile), andmax
. This helps you understand the central tendency, spread, and range of your numerical data.
Dummy \'titanic.csv\' created for demonstration.
--- First 5 rows (df.head()) ---
PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
Name Sex Age SibSp \
0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0
Parch Ticket Fare Cabin Embarked
0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
----------------------------------------
--- DataFrame Info (df.info()) ---
<class \'pandas.core.frame.DataFrame\'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 10 non-null int64
1 Survived 10 non-null int64
2 Pclass 10 non-null int64
3 Name 10 non-null object
4 Sex 10 non-null object
5 Age 9 non-null float64
6 SibSp 10 non-null int64
7 Parch 10 non-null int64
8 Ticket 10 non-null object
9 Fare 10 non-null float64
10 Cabin 2 non-null object
11 Embarked 10 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 1.1+ KB
----------------------------------------
--- Descriptive Statistics (df.describe()) ---
PassengerId Survived Pclass Age SibSp Parch \
count 10.000000 10.000000 10.000000 9.000000 10.000000 10.000000
mean 5.500000 0.500000 2.300000 28.111111 0.700000 0.300000
std 3.027650 0.527046 0.948683 14.586689 1.059301 0.674945
min 1.000000 0.000000 1.000000 2.000000 0.000000 0.000000
25% 3.250000 0.000000 1.250000 22.000000 0.000000 0.000000
50% 5.500000 0.500000 3.000000 27.000000 0.500000 0.000000
75% 7.750000 1.000000 3.000000 35.000000 1.000000 0.000000
max 10.000000 1.000000 3.000000 54.000000 3.000000 2.000000
Fare
count 10.000000
mean 29.729310
std 22.779848
min 7.250000
25% 8.050000
50% 16.266650
75% 50.672500
max 71.283300
----------------------------------------
Dummy \'titanic.csv\' removed.
Initial Observations: From
df.info()
, we immediately see thatAge
andCabin
columns have missing values.Cabin
has a significant number of missing values (only 2 non-null out of 10 entries), which might make it difficult to use directly.Age
has fewer missing values, which we can likely impute.df.describe()
gives us a sense of the range and distribution of numerical features likeAge
andFare
.
Topic: Step 2: Clean the Data
What is it?
Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset. It is a crucial step in the data science pipeline, as dirty data can lead to inaccurate models and misleading insights.
Why is it important?
Machine learning models are highly sensitive to the quality of the data they are trained on. Missing values, inconsistent formats, or erroneous entries can cause models to perform poorly or even fail to train. Cleaning the data ensures that our dataset is reliable and suitable for analysis and modeling.
How do we use it?
Based on our initial exploration, we identified missing values in the Age
and Cabin
columns. The Embarked
column might also have a few missing values in the full dataset. We will focus on Age
and Embarked
as Cabin
has too many missing values to be easily imputed for a beginner project.
Let's start by recreating our dummy titanic.csv
file to ensure we have a consistent starting point with missing values.
import pandas as pd
import numpy as np
import os
# Re-simulate creating a titanic.csv file for demonstration purposes
titanic_data = {
"PassengerId": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
"Survived": [0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1],
"Pclass": [3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1],
"Name": [
"Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Florence Briggs Thayer)",
"Heikkinen, Miss. Laina", "Futrelle, Mrs. Jacques Heath (Lily May Peel)",
"Allen, Mr. William Henry", "Moran, Mr. James",
"McCarthy, Mr. Timothy J", "Palsson, Master. Gosta Leonard",
"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)", "Nasser, Mrs. Nicholas (Adele Achem)",
"Bonnell, Miss. Elizabeth", "Saundercock, Mr. William Henry"
],
"Sex": ["male", "female", "female", "female", "male", "male", "male", "male", "female", "female", "female", "male"],
"Age": [22.0, 38.0, 26.0, 35.0, 35.0, np.nan, 54.0, 2.0, 27.0, 14.0, 58.0, np.nan],
"SibSp": [1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 0, 0],
"Parch": [0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0],
"Ticket": ["A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "373450", "330877", "17463", "349909", "347742", "237736", "113781", "A/5. 2151"],
"Fare": [7.25, 71.2833, 7.925, 53.1, 8.05, 8.4583, 51.8625, 21.075, 11.1333, 30.0708, 26.55, 8.05],
"Cabin": [np.nan, "C85", np.nan, "C123", np.nan, np.nan, "E46", np.nan, np.nan, np.nan, "C103", np.nan],
"Embarked": ["S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", np.nan, "S"]
}
df = pd.DataFrame(titanic_data)
csv_file_path = "titanic.csv"
df.to_csv(csv_file_path, index=False)
print(f"Dummy \'{csv_file_path}\' created for demonstration.\n")
# Load the dataset
df = pd.read_csv(csv_file_path)
# Use .isnull().sum() to find missing values (especially in \'Age\' and \'Embarked\')
print(f"Missing values before cleaning:\n{df.isnull().sum()}\n")
# Demonstrate filling the missing \'Age\' values with the median age.
# The median is often preferred over the mean for skewed distributions or when outliers are present.
median_age = df["Age"].median()
df["Age"].fillna(median_age, inplace=True) # inplace=True modifies the DataFrame directly
print(f"Median Age used for imputation: {median_age}\n")
# Demonstrate filling the missing \'Embarked\' values with the mode.
# The mode is the most frequent value, suitable for categorical data.
mode_embarked = df["Embarked"].mode()[0] # .mode() can return multiple modes, so we take the first one
df["Embarked"].fillna(mode_embarked, inplace=True)
print(f"Mode Embarked port used for imputation: {mode_embarked}\n")
print(f"Missing values after cleaning:\n{df.isnull().sum()}\n")
print("DataFrame after cleaning (first few rows):\n")
print(df.head())
# Clean up the dummy file (optional)
os.remove(csv_file_path)
print(f"\nDummy \'{csv_file_path}\' removed.")
Code Explanation & Output:
- Re-simulating
titanic.csv
: We create a slightly larger dummy dataset to better illustrate the missing values inAge
andEmbarked
. df.isnull().sum()
: This is our first check to confirm the presence of missing values inAge
andEmbarked
before we start cleaning.median_age = df["Age"].median()
: Calculates the median of theAge
column. The median is the middle value in a sorted list of numbers and is robust to outliers, making it a good choice for imputing numerical data.df["Age"].fillna(median_age, inplace=True)
: This line fills allNaN
values in theAge
column with the calculatedmedian_age
.inplace=True
means the DataFramedf
is modified directly, without needing to reassign the result.mode_embarked = df["Embarked"].mode()[0]
: Calculates the mode (most frequent value) of theEmbarked
column. Sincemode()
can return multiple values if there's a tie, we select the first one[0]
.df["Embarked"].fillna(mode_embarked, inplace=True)
: FillsNaN
values in theEmbarked
column with the calculatedmode_embarked
.- The final
df.isnull().sum()
check confirms that theAge
andEmbarked
columns no longer have missing values. We still see missing values inCabin
because we chose not to impute it due to the high number of missing entries.
Dummy \'titanic.csv\' created for demonstration.
Missing values before cleaning:
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 2
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 10
Embarked 1
dtype: int64
Median Age used for imputation: 28.5
Mode Embarked port used for imputation: S
Missing values after cleaning:
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 10
Embarked 0
dtype: int64
DataFrame after cleaning (first few rows):
PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
Name Sex Age SibSp \
0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0
Parch Ticket Fare Cabin Embarked
0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
Dummy \'titanic.csv\' removed.
Choosing Imputation Strategy: The choice of imputation method (mean, median, mode, or more advanced techniques) depends on the data distribution and the context. For numerical data, median is often preferred for skewed distributions or presence of outliers, while mean is suitable for normally distributed data. For categorical data, mode is a common choice.
Topic: Step 3: Visualize the Data to Find Insights (EDA)
What is it?
After loading and cleaning our data, the next critical step is to visualize it. Data visualization, as we learned in Module 3, is about creating graphical representations of our data to uncover patterns, relationships, and distributions that are difficult to discern from raw numbers alone. This is a key part of Exploratory Data Analysis (EDA).
Why is it important?
Visualizations help us to:
- Understand distributions: How are ages distributed? What is the spread of fares?
- Identify relationships: Is there a relationship between passenger class and survival? Does gender play a role?
- Spot anomalies: Are there any unusual data points that need further investigation?
- Communicate findings: Presenting data visually makes it easier for others to understand your insights.
It's like drawing a map of a new territory – it helps you navigate and understand the landscape of your data.
How do we use it?
We will use Seaborn and Matplotlib to create various plots to gain insights from the cleaned Titanic dataset. First, let's ensure we have a fresh, cleaned DataFrame to work with.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
# Re-simulate creating a titanic.csv file for demonstration purposes
titanic_data = {
"PassengerId": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
"Survived": [0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1],
"Pclass": [3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1],
"Name": [
"Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Florence Briggs Thayer)",
"Heikkinen, Miss. Laina", "Futrelle, Mrs. Jacques Heath (Lily May Peel)",
"Allen, Mr. William Henry", "Moran, Mr. James",
"McCarthy, Mr. Timothy J", "Palsson, Master. Gosta Leonard",
"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)", "Nasser, Mrs. Nicholas (Adele Achem)",
"Bonnell, Miss. Elizabeth", "Saundercock, Mr. William Henry"
],
"Sex": ["male", "female", "female", "female", "male", "male", "male", "male", "female", "female", "female", "male"],
"Age": [22.0, 38.0, 26.0, 35.0, 35.0, np.nan, 54.0, 2.0, 27.0, 14.0, 58.0, np.nan],
"SibSp": [1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 0, 0],
"Parch": [0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0],
"Ticket": ["A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "373450", "330877", "17463", "349909", "347742", "237736", "113781", "A/5. 2151"],
"Fare": [7.25, 71.2833, 7.925, 53.1, 8.05, 8.4583, 51.8625, 21.075, 11.1333, 30.0708, 26.55, 8.05],
"Cabin": [np.nan, "C85", np.nan, "C123", np.nan, np.nan, "E46", np.nan, np.nan, np.nan, "C103", np.nan],
"Embarked": ["S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", np.nan, "S"]
}
df = pd.DataFrame(titanic_data)
csv_file_path = "titanic.csv"
df.to_csv(csv_file_path, index=False)
# Load the dataset and clean it (as done in Step 2)
df = pd.read_csv(csv_file_path)
median_age = df["Age"].median()
df["Age"].fillna(median_age, inplace=True)
mode_embarked = df["Embarked"].mode()[0]
df["Embarked"].fillna(mode_embarked, inplace=True)
print("Cleaned DataFrame (first 5 rows):\n", df.head())
print("\n" + "-"*40 + "\n")
# Use Seaborn\`s countplot to see the distribution of survivors.
plt.figure(figsize=(6, 4))
sns.countplot(data=df, x="Survived")
plt.title("Distribution of Survivors (0=No, 1=Yes)")
plt.xlabel("Survived")
plt.ylabel("Count")
plt.show()
print("\n" + "-"*40 + "\n")
# Use Seaborn\`s countplot with the hue parameter to see how survival relates to \`Sex\`, \`Pclass\`, and \`Embarked\`.
plt.figure(figsize=(6, 4))
sns.countplot(data=df, x="Sex", hue="Survived")
plt.title("Survival Count by Sex")
plt.xlabel("Sex")
plt.ylabel("Count")
plt.legend(title="Survived", labels=["No", "Yes"])
plt.show()
print("\n" + "-"*40 + "\n")
plt.figure(figsize=(7, 5))
sns.countplot(data=df, x="Pclass", hue="Survived")
plt.title("Survival Count by Passenger Class")
plt.xlabel("Passenger Class")
plt.ylabel("Count")
plt.legend(title="Survived", labels=["No", "Yes"])
plt.show()
print("\n" + "-"*40 + "\n")
plt.figure(figsize=(7, 5))
sns.countplot(data=df, x="Embarked", hue="Survived")
plt.title("Survival Count by Embarked Port")
plt.xlabel("Embarked Port")
plt.ylabel("Count")
plt.legend(title="Survived", labels=["No", "Yes"])
plt.show()
print("\n" + "-"*40 + "\n")
# Use Matplotlib\`s hist to visualize the distribution of \`Age\` and \`Fare\`.
plt.figure(figsize=(8, 5))
plt.hist(df["Age"], bins=20, edgecolor=\'black\')
plt.title("Distribution of Age")
plt.xlabel("Age")
plt.ylabel("Frequency")
plt.show()
print("\n" + "-"*40 + "\n")
plt.figure(figsize=(8, 5))
plt.hist(df["Fare"], bins=30, edgecolor=\'black\')
plt.title("Distribution of Fare")
plt.xlabel("Fare")
plt.ylabel("Frequency")
plt.show()
print("\n" + "-"*40 + "\n")
# Use Seaborn\`s heatmap to show a correlation matrix of the numerical columns.
# Select only numerical columns for correlation calculation
numerical_df = df.select_dtypes(include=[np.number])
correlation_matrix = numerical_df.corr()
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap=\'coolwarm\', fmt=\'.2f\')
plt.title("Correlation Matrix of Numerical Features")
plt.show()
# Clean up the dummy file (optional)
os.remove(csv_file_path)
print(f"\nDummy \'{csv_file_path}\' removed.")
Code Explanation & Output:
- Setup: We re-create and clean the dummy
titanic.csv
to ensure consistency. sns.countplot(data=df, x="Survived")
: This plot shows the number of passengers who survived (1) versus those who did not (0). It gives us a quick overview of the target variable's distribution.sns.countplot(data=df, x="Sex", hue="Survived")
: By addinghue="Survived"
, we can see the survival counts broken down by gender. This is a powerful way to visually explore relationships between categorical variables and survival. We also add aplt.legend
to clarify what 0 and 1 represent forSurvived
.sns.countplot(data=df, x="Pclass", hue="Survived")
: Similar to the above, this plot visualizes survival counts across different passenger classes (1st, 2nd, 3rd).sns.countplot(data=df, x="Embarked", hue="Survived")
: This plot shows survival counts based on the port of embarkation.plt.hist(df["Age"], bins=20, edgecolor=\'black\')
: We use Matplotlib'shist
function to visualize the distribution of theAge
column.bins
controls the number of bars, andedgecolor
adds borders for better visibility. This helps us understand the age demographics of the passengers.plt.hist(df["Fare"], bins=30, edgecolor=\'black\')
: Similarly, this plots the distribution ofFare
, showing how ticket prices are distributed.numerical_df = df.select_dtypes(include=[np.number])
: Before calculating correlations, it's good practice to select only the numerical columns from the DataFrame, as correlation is a numerical concept.correlation_matrix = numerical_df.corr()
: This calculates the pairwise correlation between all numerical columns innumerical_df
. The correlation coefficient ranges from -1 to 1, where 1 indicates a perfect positive correlation, -1 a perfect negative correlation, and 0 no linear correlation.sns.heatmap(correlation_matrix, annot=True, cmap=\'coolwarm\', fmt=\'.2f\')
: This creates a heatmap of the correlation matrix.annot=True
displays the correlation values on the map,cmap=\'coolwarm\'
sets a diverging color map (red for negative, blue for positive, white for zero), andfmt=\'.2f\'
formats the annotation to two decimal places. This visualization quickly highlights which numerical features are strongly correlated with each other.
Cleaned DataFrame (first 5 rows):
PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
Name Sex Age SibSp \
0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0
Parch Ticket Fare Cabin Embarked
0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
----------------------------------------
# A plot window will appear showing the distribution of survivors.
----------------------------------------
# A plot window will appear showing survival counts by sex.
----------------------------------------
# A plot window will appear showing survival counts by passenger class.
----------------------------------------
# A plot window will appear showing survival counts by embarked port.
----------------------------------------
# A plot window will appear showing the distribution of age.
----------------------------------------
# A plot window will appear showing the distribution of fare.
----------------------------------------
# A plot window will appear showing the correlation matrix heatmap.
Dummy \'titanic.csv\' removed.
Insights from Visualizations:
- The
Survived
countplot shows that more passengers did not survive than survived.- The
Sex
vs.Survived
plot clearly indicates that a much higher proportion of females survived compared to males. This is a very strong indicator.- The
Pclass
vs.Survived
plot suggests that passengers in 1st class had a higher survival rate than those in 2nd or 3rd class, indicating socioeconomic status played a role.- The
Embarked
vs.Survived
plot might show some differences, but generally, Southampton (S
) had the most passengers, and its survival rate might be lower due to the sheer number of people.- The
Age
histogram shows a distribution skewed towards younger ages, with a peak in the 20s and 30s. There are also some very young passengers.- The
Fare
histogram shows that most passengers paid lower fares, with a long tail indicating a few very expensive tickets.- The correlation heatmap helps us quickly see relationships between numerical features. For example,
Fare
andPclass
might show a negative correlation (higher class, lower Pclass number, higher fare).SibSp
andParch
might be positively correlated, as they both relate to family size.
Topic: Step 4: Preprocess the Data for a Future Model
What is it?
Data preprocessing is the final stage of preparing our data before it can be fed into a machine learning model. This involves transforming categorical variables into numerical ones and scaling numerical features to ensure they are on a similar range. These steps are crucial because most machine learning algorithms require numerical input and perform better when features are scaled.
Why is it important?
As discussed in Module 4, machine learning algorithms are mathematical constructs that operate on numbers. They cannot directly interpret text-based categories or handle features with vastly different scales. Preprocessing ensures that our data is in the optimal format for the algorithms to learn effectively, leading to more accurate and robust models.
How do we use it?
We will apply the techniques learned in Module 4 (one-hot encoding and scaling) to our cleaned Titanic dataset. We will continue from the cleaned DataFrame we prepared in Step 2.
Let's ensure we have a fresh, cleaned DataFrame to work with.
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, StandardScaler
import os
# Re-simulate creating a titanic.csv file for demonstration purposes
titanic_data = {
"PassengerId": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
"Survived": [0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1],
"Pclass": [3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1],
"Name": [
"Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Florence Briggs Thayer)",
"Heikkinen, Miss. Laina", "Futrelle, Mrs. Jacques Heath (Lily May Peel)",
"Allen, Mr. William Henry", "Moran, Mr. James",
"McCarthy, Mr. Timothy J", "Palsson, Master. Gosta Leonard",
"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)", "Nasser, Mrs. Nicholas (Adele Achem)",
"Bonnell, Miss. Elizabeth", "Saundercock, Mr. William Henry"
],
"Sex": ["male", "female", "female", "female", "male", "male", "male", "male", "female", "female", "female", "male"],
"Age": [22.0, 38.0, 26.0, 35.0, 35.0, np.nan, 54.0, 2.0, 27.0, 14.0, 58.0, np.nan],
"SibSp": [1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 0, 0],
"Parch": [0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0],
"Ticket": ["A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "373450", "330877", "17463", "349909", "347742", "237736", "113781", "A/5. 2151"],
"Fare": [7.25, 71.2833, 7.925, 53.1, 8.05, 8.4583, 51.8625, 21.075, 11.1333, 30.0708, 26.55, 8.05],
"Cabin": [np.nan, "C85", np.nan, "C123", np.nan, np.nan, "E46", np.nan, np.nan, np.nan, "C103", np.nan],
"Embarked": ["S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", np.nan, "S"]
}
df = pd.DataFrame(titanic_data)
csv_file_path = "titanic.csv"
df.to_csv(csv_file_path, index=False)
# Load the dataset and clean it (as done in Step 2)
df = pd.read_csv(csv_file_path)
median_age = df["Age"].median()
df["Age"].fillna(median_age, inplace=True)
mode_embarked = df["Embarked"].mode()[0]
df["Embarked"].fillna(mode_embarked, inplace=True)
print("Cleaned DataFrame (first 5 rows):\n", df.head())
print("\n" + "-"*40 + "\n")
# Use Pandas to convert the \`Sex\` column to 0s and 1s.
# We can use map or replace for this binary categorical variable.
# Let\`s map \
'male' to 0 and 'female' to 1.
df["Sex"] = df["Sex"].map({"male": 0, "female": 1})
print(f"DataFrame after converting \"Sex\" to numerical:\n{df[["Sex"]].head()}\n")
print("\n" + "-"*40 + "\n")
# Use Scikit-learn\`s OneHotEncoder on the \`Embarked\` column.
# First, identify the categorical column to encode
categorical_cols = ["Embarked"]
# Initialize OneHotEncoder
# handle_unknown=\'ignore\' is important for deployment if new categories might appear
# sparse_output=False ensures a dense array output
ohe = OneHotEncoder(handle_unknown=\'ignore\', sparse_output=False)
# Fit and transform the categorical column
# We use .values.reshape(-1, 1) to ensure it's a 2D array as expected by OneHotEncoder
encoded_features = ohe.fit_transform(df[categorical_cols])
# Create a DataFrame from the encoded features with proper column names
encoded_df = pd.DataFrame(encoded_features, columns=ohe.get_feature_names_out(categorical_cols))
# Concatenate the original DataFrame (dropping the original 'Embarked' column) with the new encoded columns
df = pd.concat([df.drop(categorical_cols, axis=1), encoded_df], axis=1)
print(f"DataFrame after One-Hot Encoding \"Embarked\":\n{df[["Embarked_C", "Embarked_Q", "Embarked_S"]].head()}\n")
print("\n" + "-"*40 + "\n")
# Use Scikit-learn\`s StandardScaler on the \`Age\` and \`Fare\` columns.
# Identify numerical columns to scale
numerical_cols = ["Age", "Fare"]
# Initialize StandardScaler
scaler = StandardScaler()
# Fit and transform the numerical columns
scaled_features = scaler.fit_transform(df[numerical_cols])
# Create a DataFrame from the scaled features with proper column names
scaled_df = pd.DataFrame(scaled_features, columns=[col + "_Scaled" for col in numerical_cols])
# Concatenate the original DataFrame (dropping the original 'Age' and 'Fare' columns) with the new scaled columns
df = pd.concat([df.drop(numerical_cols, axis=1), scaled_df], axis=1)
print(f"DataFrame after StandardScaler on \"Age\" and \"Fare\":\n{df[["Age_Scaled", "Fare_Scaled"]].head()}\n")
# Clean up the dummy file (optional)
os.remove(csv_file_path)
print(f"\nDummy \'{csv_file_path}\' removed.")
Code Explanation & Output:
- Converting
Sex
:df["Sex"].map({"male": 0, "female": 1})
: We use the.map()
method to replace the string values"male"
and"female"
with their numerical equivalents,0
and1
respectively. This is suitable for binary categorical variables.
- One-Hot Encoding
Embarked
:ohe = OneHotEncoder(handle_unknown=\'ignore\', sparse_output=False)
: Initializes theOneHotEncoder
.handle_unknown=\'ignore\'
is important for robustness in real-world scenarios where new, unseen categories might appear.sparse_output=False
ensures the output is a dense NumPy array.encoded_features = ohe.fit_transform(df[categorical_cols])
: The encoder learns the unique categories inEmbarked
(fit
) and then transforms the column into new binary columns (transform
). We passdf[categorical_cols]
(double brackets) becauseOneHotEncoder
expects a 2D array-like input.encoded_df = pd.DataFrame(encoded_features, columns=ohe.get_feature_names_out(categorical_cols))
: Converts the NumPy array output from the encoder into a Pandas DataFrame, usingget_feature_names_out()
to get the appropriate new column names (e.g.,Embarked_C
,Embarked_Q
,Embarked_S
).df = pd.concat([df.drop(categorical_cols, axis=1), encoded_df], axis=1)
: The originalEmbarked
column is dropped, and the newly created one-hot encoded columns are concatenated back into the DataFrame.
- Scaling
Age
andFare
:scaler = StandardScaler()
: Initializes theStandardScaler
.scaled_features = scaler.fit_transform(df[numerical_cols])
: The scaler learns the mean and standard deviation ofAge
andFare
(fit
) and then transforms their values to have a mean of 0 and a standard deviation of 1 (transform
).scaled_df = pd.DataFrame(scaled_features, columns=[col + "_Scaled" for col in numerical_cols])
: Creates a new DataFrame for the scaled features with new column names (e.g.,Age_Scaled
).df = pd.concat([df.drop(numerical_cols, axis=1), scaled_df], axis=1)
: The originalAge
andFare
columns are dropped, and the scaled versions are added to the DataFrame.
Cleaned DataFrame (first 5 rows):
PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
Name Sex Age SibSp \
0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0
Parch Ticket Fare Cabin Embarked
0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
----------------------------------------
DataFrame after converting "Sex" to numerical:
Sex
0 0
1 1
2 1
3 1
4 0
----------------------------------------
DataFrame after One-Hot Encoding "Embarked":
Embarked_C Embarked_Q Embarked_S
0 0.0 0.0 1.0
1 1.0 0.0 0.0
2 0.0 0.0 1.0
3 0.0 0.0 1.0
4 0.0 0.0 1.0
----------------------------------------
DataFrame after StandardScaler on "Age" and "Fare":
Age_Scaled Fare_Scaled
0 -0.589857 -0.589664
1 0.600856 1.961574
2 -0.262143 -0.559400
3 0.350686 1.376940
4 0.350686 -0.539235
Dummy \'titanic.csv\' removed.
Final Topic: Our Data is Ready!
What is it?
After all the steps of loading, cleaning, exploring, and preprocessing, we now have a DataFrame where all relevant features are numerical and scaled appropriately. This is the format that machine learning algorithms expect and can work with effectively.
Why is it important?
This final, transformed DataFrame is the culmination of our data preparation efforts. It represents a dataset that is clean, consistent, and optimized for machine learning. Without these steps, building accurate and reliable models would be significantly more challenging, if not impossible.
How do we use it?
Let's display the final, transformed DataFrame to see the result of all our preprocessing efforts. We will also check its info()
to confirm data types and non-null counts.
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, StandardScaler
import os
# Re-simulate creating a titanic.csv file for demonstration purposes
titanic_data = {
"PassengerId": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
"Survived": [0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1],
"Pclass": [3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1],
"Name": [
"Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Florence Briggs Thayer)",
"Heikkinen, Miss. Laina", "Futrelle, Mrs. Jacques Heath (Lily May Peel)",
"Allen, Mr. William Henry", "Moran, Mr. James",
"McCarthy, Mr. Timothy J", "Palsson, Master. Gosta Leonard",
"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)", "Nasser, Mrs. Nicholas (Adele Achem)",
"Bonnell, Miss. Elizabeth", "Saundercock, Mr. William Henry"
],
"Sex": ["male", "female", "female", "female", "male", "male", "male", "male", "female", "female", "female", "male"],
"Age": [22.0, 38.0, 26.0, 35.0, 35.0, np.nan, 54.0, 2.0, 27.0, 14.0, 58.0, np.nan],
"SibSp": [1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 0, 0],
"Parch": [0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0],
"Ticket": ["A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "373450", "330877", "17463", "349909", "347742", "237736", "113781", "A/5. 2151"],
"Fare": [7.25, 71.2833, 7.925, 53.1, 8.05, 8.4583, 51.8625, 21.075, 11.1333, 30.0708, 26.55, 8.05],
"Cabin": [np.nan, "C85", np.nan, "C123", np.nan, np.nan, "E46", np.nan, np.nan, np.nan, "C103", np.nan],
"Embarked": ["S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", np.nan, "S"]
}
df = pd.DataFrame(titanic_data)
csv_file_path = "titanic.csv"
df.to_csv(csv_file_path, index=False)
# Load the dataset and clean it (as done in Step 2)
df = pd.read_csv(csv_file_path)
median_age = df["Age"].median()
df["Age"].fillna(median_age, inplace=True)
mode_embarked = df["Embarked"].mode()[0]
df["Embarked"].fillna(mode_embarked, inplace=True)
# Convert Sex to numerical
df["Sex"] = df["Sex"].map({"male": 0, "female": 1})
# One-Hot Encode Embarked
categorical_cols = ["Embarked"]
ohe = OneHotEncoder(handle_unknown=\'ignore\', sparse_output=False)
encoded_features = ohe.fit_transform(df[categorical_cols])
encoded_df = pd.DataFrame(encoded_features, columns=ohe.get_feature_names_out(categorical_cols))
df = pd.concat([df.drop(categorical_cols, axis=1), encoded_df], axis=1)
# Scale Age and Fare
numerical_cols = ["Age", "Fare"]
scaler = StandardScaler()
scaled_features = scaler.fit_transform(df[numerical_cols])
scaled_df = pd.DataFrame(scaled_features, columns=[col + "_Scaled" for col in numerical_cols])
df = pd.concat([df.drop(numerical_cols, axis=1), scaled_df], axis=1)
# Drop columns that are not needed for modeling (e.g., Name, Ticket, Cabin, PassengerId)
# Cabin has too many missing values, Name and Ticket are identifiers/strings
df_final_processed = df.drop(columns=["Name", "Ticket", "Cabin", "PassengerId"], errors=\'ignore\')
print(f"Final Processed DataFrame (first 5 rows):\n{df_final_processed.head()}\n")
print("\n" + "-"*40 + "\n")
print("Final Processed DataFrame Info:\n")
df_final_processed.info()
# Clean up the dummy file (optional)
os.remove(csv_file_path)
print(f"\nDummy \'{csv_file_path}\' removed.")
Code Explanation & Output:
- Re-running all preprocessing steps: The code block re-executes all the cleaning and preprocessing steps (imputing missing values, converting
Sex
, one-hot encodingEmbarked
, and scalingAge
andFare
) to arrive at the final processed DataFrame. df_final_processed = df.drop(columns=["Name", "Ticket", "Cabin", "PassengerId"], errors=\'ignore\')
: We drop columns that are typically not used directly in machine learning models.Name
andTicket
are unique identifiers or strings that don't directly contribute to numerical models.Cabin
was dropped due to a high number of missing values.PassengerId
is an identifier and not a feature for the model.df_final_processed.head()
: Shows the first few rows of the DataFrame after all transformations. You can see thatSex
is now numerical,Embarked
has been replaced byEmbarked_C
,Embarked_Q
,Embarked_S
(one-hot encoded), andAge
andFare
have their scaled versions.df_final_processed.info()
: Provides a final summary. Notice that all columns now have 12 non-null entries (no missing values in the relevant columns) and are numerical (int64
orfloat64
), which is the ideal format for machine learning algorithms.
Final Processed DataFrame (first 5 rows):
Survived Pclass Sex SibSp Parch Embarked_C Embarked_Q Embarked_S \
0 0 3 0 1 0 0.0 0.0 1.0
1 1 1 1 1 0 1.0 0.0 0.0
2 1 3 1 0 0 0.0 0.0 1.0
3 1 1 1 1 0 0.0 0.0 1.0
4 0 3 0 0 0 0.0 0.0 1.0
Age_Scaled Fare_Scaled
0 -0.589857 -0.589664
1 0.600856 1.961574
2 -0.262143 -0.559400
3 0.350686 1.376940
4 0.350686 -0.539235
----------------------------------------
Final Processed DataFrame Info:
<class \'pandas.core.frame.DataFrame\'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Survived 12 non-null int64
1 Pclass 12 non-null int64
2 Sex 12 non-null int64
3 SibSp 12 non-null int64
4 Parch 12 non-null int64
5 Embarked_C 12 non-null float64
6 Embarked_Q 12 non-null float64
7 Embarked_S 12 non-null float64
8 Age_Scaled 12 non-null float64
9 Fare_Scaled 12 non-null float64
dtypes: float64(5), int64(5)
memory usage: 1.1 KB
Dummy \'titanic.csv\' removed.
Conclusion:
Congratulations! You have successfully completed a comprehensive data preparation pipeline using Python, Pandas, Matplotlib, Seaborn, and Scikit-learn. This cleaned, transformed, and scaled DataFrame is now perfectly ready to be used as input for any machine learning algorithm. The next step in a real data science project would be to choose and train a machine learning model, evaluate its performance, and then deploy it. You now have the foundational skills to tackle these exciting challenges!