Module 0_ Welcome to the Data Science Toolkit! 1

Module 0: Data Science Toolkit!

Module 1: NumPy - The Foundation for Numerical Data

Topic: What is NumPy?

What is it?

NumPy, which stands for Numerical Python, is the fundamental package for scientific computing in Python. It provides support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.

Why is it important?

While Python lists are versatile, they are not optimized for numerical operations, especially when dealing with large datasets. NumPy arrays, on the other hand, are significantly faster and more memory-efficient for numerical computations. This is because NumPy is implemented in C, allowing it to perform operations much closer to the hardware.

Analogy: If standard Python lists are like flexible, general-purpose containers that can hold anything but are slow for heavy lifting, NumPy arrays are like specialized, high-performance containers specifically designed for numbers. They are built for speed and efficiency when you need to perform mathematical operations on collections of numbers, much like a forklift is designed for lifting heavy loads efficiently compared to carrying them by hand.

NumPy is the backbone of many other data science libraries in Python, including Pandas, Matplotlib, and Scikit-learn. Understanding NumPy is crucial for effectively using these libraries.

How do we use it?

First, we need to import the NumPy library, typically using the alias np.

import numpy as np

# Now we can use np to access NumPy functions and objects
print(np.__version__)

Code Explanation & Output:

import numpy as np: This line imports the NumPy library and assigns it the conventional alias np. This allows us to refer to NumPy functions and objects using np. instead of the full numpy.. This is a widely adopted convention in the Python data science community.
print(np.__version__): This line prints the version of NumPy that is installed in your environment. This is a simple way to confirm that NumPy is imported correctly and to know which version you are working with.

1.26.4  # The exact version number might vary depending on your installation

Topic: Creating NumPy Arrays

NumPy arrays are the core data structure in NumPy. Let's explore several common ways to create them.

Creating arrays from Python lists

What is it?

One of the most straightforward ways to create a NumPy array is by converting a standard Python list or a list of lists into an ndarray (N-dimensional array).

Why is it important?

This method is essential for transitioning existing Python data structures into the more efficient NumPy array format, allowing you to leverage NumPy's powerful numerical capabilities.

How do we use it?

import numpy as np

# Create a 1D array from a list
list_1d = [1, 2, 3, 4, 5]
numpy_1d = np.array(list_1d)
print(f"1D Array: {numpy_1d}")
print(f"Type of 1D Array: {type(numpy_1d)}\n")

# Create a 2D array from a list of lists
list_2d = [[10, 20, 30], [40, 50, 60]]
numpy_2d = np.array(list_2d)
print(f"2D Array:\n{numpy_2d}")
print(f"Type of 2D Array: {type(numpy_2d)}")

Code Explanation & Output:

list_1d = [1, 2, 3, 4, 5]: Defines a standard Python list containing integers.
numpy_1d = np.array(list_1d): Converts list_1d into a 1-dimensional NumPy array. The np.array() function is the primary way to create arrays from existing Python sequences.
list_2d = [[10, 20, 30], [40, 50, 60]]: Defines a Python list of lists, which represents a 2-dimensional structure.
numpy_2d = np.array(list_2d): Converts list_2d into a 2-dimensional NumPy array. NumPy automatically infers the dimensions from the nested list structure.

1D Array: [1 2 3 4 5]
Type of 1D Array: <class 'numpy.ndarray'>

2D Array:
[[10 20 30]
 [40 50 60]]
Type of 2D Array: <class 'numpy.ndarray'>

Using `np.arange()`

What is it?

np.arange() is a NumPy function that returns evenly spaced values within a given interval. It's similar to Python's built-in range() function but returns a NumPy array.

Why is it important?

It's incredibly useful for creating sequences of numbers, which are often needed for indexing, generating sample data, or defining ranges for plots.

How do we use it?

import numpy as np

# Create an array from 0 up to (but not including) 10
arr_arange_1 = np.arange(10)
print(f"np.arange(10): {arr_arange_1}\n")

# Create an array from 5 up to (but not including) 15
arr_arange_2 = np.arange(5, 15)
print(f"np.arange(5, 15): {arr_arange_2}\n")

# Create an array with a step of 2
arr_arange_3 = np.arange(0, 20, 2)
print(f"np.arange(0, 20, 2): {arr_arange_3}")

Code Explanation & Output:

np.arange(10): Creates a 1D array with values starting from 0 up to (but not including) 10. The default start is 0 and the default step is 1.
np.arange(5, 15): Creates a 1D array with values starting from 5 up to (but not including) 15. Here, 5 is the start and 15 is the stop.
np.arange(0, 20, 2): Creates a 1D array with values starting from 0 up to (but not including) 20, with a step of 2. This means it will include 0, 2, 4, ..., 18.

np.arange(10): [0 1 2 3 4 5 6 7 8 9]

np.arange(5, 15): [ 5  6  7  8  9 10 11 12 13 14]

np.arange(0, 20, 2): [ 0  2  4  6  8 10 12 14 16 18]

Using `np.zeros()`

What is it?

np.zeros() creates a new array of a specified shape, filled with zeros.

Why is it important?

It's commonly used to initialize arrays when you know the size you need but don't yet have the actual data. It's a placeholder for future calculations.

How do we use it?

import numpy as np

# Create a 1D array of 5 zeros
arr_zeros_1d = np.zeros(5)
print(f"1D Zeros Array: {arr_zeros_1d}\n")

# Create a 2D array (3 rows, 4 columns) of zeros
arr_zeros_2d = np.zeros((3, 4))
print(f"2D Zeros Array:\n{arr_zeros_2d}")

Code Explanation & Output:

np.zeros(5): Creates a 1D array containing five floating-point zeros. By default, NumPy arrays are created with a float64 data type unless specified otherwise.
np.zeros((3, 4)): Creates a 2D array with 3 rows and 4 columns, all filled with zeros. Notice that the shape for multi-dimensional arrays is passed as a tuple.

1D Zeros Array: [0. 0. 0. 0. 0.]

2D Zeros Array:
[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]

Using `np.ones()`

What is it?

Similar to np.zeros(), np.ones() creates a new array of a specified shape, but filled with ones.

Why is it important?

Useful for initializing arrays, especially in scenarios where you might want to multiply by these values or use them as a base for incremental calculations.

How do we use it?

import numpy as np

# Create a 1D array of 3 ones
arr_ones_1d = np.ones(3)
print(f"1D Ones Array: {arr_ones_1d}\n")

# Create a 2D array (2 rows, 5 columns) of ones
arr_ones_2d = np.ones((2, 5))
print(f"2D Ones Array:\n{arr_ones_2d}")

Code Explanation & Output:

np.ones(3): Creates a 1D array containing three floating-point ones.
np.ones((2, 5)): Creates a 2D array with 2 rows and 5 columns, all filled with ones.

1D Ones Array: [1. 1. 1.]

2D Ones Array:
[[1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]]

Using `np.linspace()`

What is it?

np.linspace() returns evenly spaced numbers over a specified interval. It takes the start, stop, and the number of samples to generate.

Why is it important?

This function is particularly useful for creating arrays for plotting functions, generating evenly distributed data points, or for simulations where a precise number of samples within a range is required.

How do we use it?

import numpy as np

# Create an array of 5 evenly spaced numbers between 0 and 10 (inclusive)
arr_linspace_1 = np.linspace(0, 10, 5)
print(f"np.linspace(0, 10, 5): {arr_linspace_1}\n")

# Create an array of 7 evenly spaced numbers between 1 and 20
arr_linspace_2 = np.linspace(1, 20, 7)
print(f"np.linspace(1, 20, 7): {arr_linspace_2}")

Code Explanation & Output:

np.linspace(0, 10, 5): Creates a 1D array of 5 numbers, evenly spaced between 0 and 10. Unlike arange(), linspace() includes the stop value by default.
np.linspace(1, 20, 7): Creates a 1D array of 7 numbers, evenly spaced between 1 and 20.

np.linspace(0, 10, 5): [ 0.   2.5  5.   7.5 10. ]

np.linspace(1, 20, 7): [ 1.    4.16666667  7.33333333 10.5         13.66666667 16.83333333
 20.        ]

Topic: Array Attributes & Operations

Once you have a NumPy array, you can explore its properties (attributes) and perform various mathematical operations on it.

Array Attributes

What are they?

NumPy arrays have several attributes that provide information about the array itself, such as its dimensions, shape, size, and data type.

Why are they important?

Understanding these attributes is crucial for debugging, reshaping arrays, and ensuring your data is in the correct format for operations.

How do we use them?

Let's create a sample array and explore its attributes:

import numpy as np

# Create a 2D array
my_array = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]])
print(f"Original Array:\n{my_array}\n")

# .shape: Returns a tuple with the dimensions of the array (rows, columns)
print(f"Shape of the array: {my_array.shape}")

# .ndim: Returns the number of dimensions of the array
print(f"Number of dimensions: {my_array.ndim}")

# .size: Returns the total number of elements in the array
print(f"Total number of elements: {my_array.size}")

# .dtype: Returns the data type of the elements in the array
print(f"Data type of elements: {my_array.dtype}")

Code Explanation & Output:

my_array = np.array(...): Creates a 2D array with 4 rows and 3 columns.
my_array.shape: This attribute returns (4, 3), indicating 4 rows and 3 columns.
my_array.ndim: This attribute returns 2, as it's a 2-dimensional array.
my_array.size: This attribute returns 12, which is the total number of elements (4 rows * 3 columns).
my_array.dtype: This attribute returns int64 (or int32 depending on your system), indicating that the elements are 64-bit integers.

Original Array:
[[ 1  2  3]
 [ 4  5  6]
 [ 7  8  9]
 [10 11 12]]

Shape of the array: (4, 3)
Number of dimensions: 2
Total number of elements: 12
Data type of elements: int64

Basic Arithmetic Operations

What are they?

NumPy allows you to perform element-wise arithmetic operations (addition, subtraction, multiplication, division) on arrays. This means the operation is applied to each corresponding element in the arrays.

Why are they important?

This is the core of NumPy's power for numerical computation. Performing these operations on entire arrays is much faster and more concise than using loops in standard Python.

How do we use them?

import numpy as np

array_a = np.array([1, 2, 3, 4])
array_b = np.array([10, 20, 30, 40])

print(f"Array A: {array_a}")
print(f"Array B: {array_b}\n")

# Element-wise addition
sum_array = array_a + array_b
print(f"A + B: {sum_array}")

# Element-wise subtraction
diff_array = array_b - array_a
print(f"B - A: {diff_array}")

# Element-wise multiplication
prod_array = array_a * array_b
print(f"A * B: {prod_array}")

# Element-wise division
quot_array = array_b / array_a
print(f"B / A: {quot_array}")

# Scalar operations (operation with a single number)
scalar_mult = array_a * 2
print(f"A * 2: {scalar_mult}")

scalar_add = array_a + 100
print(f"A + 100: {scalar_add}")

Code Explanation & Output:

array_a and array_b are two 1D NumPy arrays of the same size.
array_a + array_b: Adds corresponding elements: [1+10, 2+20, 3+30, 4+40].
array_b - array_a: Subtracts corresponding elements: [10-1, 20-2, 30-3, 40-4].
array_a * array_b: Multiplies corresponding elements: [1*10, 2*20, 3*30, 4*40].
array_b / array_a: Divides corresponding elements: [10/1, 20/2, 30/3, 40/4].
array_a * 2: Multiplies every element in array_a by 2. This is called broadcasting, where NumPy intelligently handles operations between arrays of different (but compatible) shapes.
array_a + 100: Adds 100 to every element in array_a (another example of broadcasting).

Array A: [1 2 3 4]
Array B: [10 20 30 40]

A + B: [11 22 33 44]
B - A: [ 9 18 27 36]
A * B: [ 10  40  90 160]
B / A: [10. 10. 10. 10.]
A * 2: [2 4 6 8]
A + 100: [101 102 103 104]

Important Tip: For element-wise operations, the arrays generally need to have the same shape or be compatible for broadcasting. If they are not, NumPy will raise a ValueError.

Topic: Indexing and Slicing

Accessing specific elements or portions of a NumPy array is fundamental. This is done through indexing and slicing, concepts similar to Python lists but with powerful extensions for multi-dimensional arrays.

1D array indexing and slicing

What is it?

For one-dimensional arrays, indexing allows you to retrieve a single element using its position (index), and slicing allows you to extract a contiguous sequence of elements.

Why is it important?

It enables you to pinpoint and work with individual data points or subsets of your data, which is crucial for data cleaning, analysis, and feature engineering.

How do we use it?

import numpy as np

arr_1d = np.array([10, 20, 30, 40, 50, 60, 70])
print(f"Original 1D Array: {arr_1d}\n")

# Indexing: Accessing a single element
# Remember: Python uses 0-based indexing
first_element = arr_1d[0]
print(f"First element (index 0): {first_element}")

third_element = arr_1d[2]
print(f"Third element (index 2): {third_element}")

last_element = arr_1d[-1] # Negative indexing accesses from the end
print(f"Last element (index -1): {last_element}\n")

# Slicing: [start:stop:step]
# The 'stop' index is exclusive (up to, but not including)
slice_1 = arr_1d[1:4] # Elements from index 1 up to (but not including) 4
print(f"Slice from index 1 to 3: {slice_1}")

slice_2 = arr_1d[:3] # Elements from the beginning up to (but not including) 3
print(f"Slice from beginning to index 2: {slice_2}")

slice_3 = arr_1d[4:] # Elements from index 4 to the end
print(f"Slice from index 4 to end: {slice_3}")

slice_4 = arr_1d[::2] # Every second element (step of 2)
print(f"Every second element: {slice_4}")

slice_5 = arr_1d[::-1] # Reverse the array
print(f"Reversed array: {slice_5}")

Code Explanation & Output:

arr_1d[0]: Retrieves the element at index 0 (the first element).
arr_1d[2]: Retrieves the element at index 2 (the third element).
arr_1d[-1]: Retrieves the last element using negative indexing.
arr_1d[1:4]: Creates a slice containing elements from index 1, 2, and 3. Index 4 is excluded.
arr_1d[:3]: Creates a slice from the beginning up to index 2.
arr_1d[4:]: Creates a slice from index 4 to the end of the array.
arr_1d[::2]: Creates a slice that includes every second element, starting from the beginning.
arr_1d[::-1]: Creates a reversed copy of the array.

Original 1D Array: [10 20 30 40 50 60 70]

First element (index 0): 10
Third element (index 2): 30
Last element (index -1): 70

Slice from index 1 to 3: [20 30 40]
Slice from beginning to index 2: [10 20 30]
Slice from index 4 to end: [50 60 70]
Every second element: [10 30 50 70]
Reversed array: [70 60 50 40 30 20 10]

2D array indexing and slicing (e.g., array[row, col])

What is it?

For two-dimensional arrays (matrices), indexing and slicing involve specifying both row and column indices. The syntax is array[row_index, column_index].

Why is it important?

Most real-world datasets are tabular (2D), so being able to precisely select rows, columns, or sub-sections of a 2D array is fundamental for data manipulation and analysis.

How do we use it?

import numpy as np

arr_2d = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9],
    [10, 11, 12]
])
print(f"Original 2D Array:\n{arr_2d}\n")

# Accessing a single element: array[row, column]
element = arr_2d[1, 2] # Row index 1, Column index 2 (value 6)
print(f"Element at row 1, column 2: {element}\n")

# Slicing rows
first_row = arr_2d[0, :]
print(f"First row: {first_row}")

last_two_rows = arr_2d[2:, :]
print(f"Last two rows:\n{last_two_rows}\n")

# Slicing columns
first_column = arr_2d[:, 0]
print(f"First column: {first_column}")

middle_columns = arr_2d[:, 1:3]
print(f"Middle two columns:\n{middle_columns}\n")

# Slicing both rows and columns (sub-array)
sub_array = arr_2d[1:3, 0:2] # Rows from index 1 to 2, columns from index 0 to 1
print(f"Sub-array (rows 1-2, cols 0-1):\n{sub_array}")

Code Explanation & Output:

arr_2d[1, 2]: Accesses the element at the second row (index 1) and third column (index 2), which is 6.
arr_2d[0, :]: Selects the entire first row. The : indicates all elements along that dimension.
arr_2d[2:, :]: Selects all rows from index 2 to the end, and all columns.
arr_2d[:, 0]: Selects all rows, but only the first column.
arr_2d[:, 1:3]: Selects all rows, but only columns from index 1 up to (but not including) 3.
arr_2d[1:3, 0:2]: Selects a sub-array. It takes rows with indices 1 and 2, and columns with indices 0 and 1.

Original 2D Array:
[[ 1  2  3]
 [ 4  5  6]
 [ 7  8  9]
 [10 11 12]]

Element at row 1, column 2: 6

First row: [1 2 3]
Last two rows:
[[ 7  8  9]
 [10 11 12]]

First column: [ 1  4  7 10]
Middle two columns:
[[ 2  3]
 [ 5  6]
 [ 8  9]
 [11 12]]

Sub-array (rows 1-2, cols 0-1):
[[4 5]
 [7 8]]

Important Note: When you slice a NumPy array, you are usually getting a view of the original array, not a copy. This means if you modify the sliced array, the original array will also be modified. To get a true copy, use the .copy() method (e.g., my_slice = arr_2d[1:3, 0:2].copy()).

Topic: Key NumPy Functions

NumPy provides a vast collection of functions that operate efficiently on arrays. We'll focus on two important categories: Universal Functions (ufuncs) and Aggregation Functions.

Universal Functions: `np.sqrt()`, `np.exp()`, `np.sin()`

What are they?

Universal functions, or ufuncs, are functions that operate element-wise on ndarrays. This means they apply a mathematical operation to each element of the array individually, producing a new array with the results.

Why are they important?

Ufuncs are highly optimized and much faster than performing the same operations using Python loops. They are essential for performing common mathematical transformations on entire datasets efficiently.

How do we use it?

import numpy as np

arr = np.array([1, 4, 9, 16, 25])
print(f"Original Array: {arr}\n")

# np.sqrt(): Calculates the square root of each element
sqrt_arr = np.sqrt(arr)
print(f"Square root: {sqrt_arr}")

# np.exp(): Calculates the exponential of each element (e^x)
exp_arr = np.exp(arr)
print(f"Exponential: {exp_arr}")

# np.sin(): Calculates the sine of each element (in radians)
sin_arr = np.sin(arr)
print(f"Sine: {sin_arr}")

# You can also apply ufuncs to 2D arrays
arr_2d = np.array([[1, 2], [3, 4]])
print(f"\nOriginal 2D Array:\n{arr_2d}\n")

sqrt_arr_2d = np.sqrt(arr_2d)
print(f"Square root of 2D array:\n{sqrt_arr_2d}")

Code Explanation & Output:

np.sqrt(arr): Computes the square root for each element in arr.
np.exp(arr): Computes e (Euler's number, approximately 2.71828) raised to the power of each element in arr.
np.sin(arr): Computes the sine of each element in arr. Note that these functions typically expect input in radians.
The example with arr_2d demonstrates that ufuncs work seamlessly with multi-dimensional arrays, applying the operation element-wise across all dimensions.

Original Array: [ 1  4  9 16 25]

Square root: [1. 2. 3. 4. 5.]
Exponential: [2.71828183e+00 5.45981500e+01 8.10308393e+03 8.88611052e+06
 7.20048994e+10]
Sine: [ 0.84147098  0.7568025   0.14112001 -0.28790332 -0.13235175]

Original 2D Array:
[[1 2]
 [3 4]]

Square root of 2D array:
[[1.         1.41421356]
 [1.73205081 2.        ]]

Aggregation Functions: `np.sum()`, `np.mean()`, `np.max()`, `np.std()`

What are they?

Aggregation functions (or reduction functions) perform an operation on an array and return a single value (or a smaller array) that summarizes the data. They

reduce the dimensionality of the array by computing a single statistic.

Why are they important?

These functions are fundamental for summarizing data, calculating descriptive statistics, and understanding the overall characteristics of your datasets. They are heavily used in exploratory data analysis (EDA).

How do we use it?

import numpy as np

arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
print(f"Original Array: {arr}\n")

# np.sum(): Calculates the sum of all elements
total_sum = np.sum(arr)
print(f"Sum of all elements: {total_sum}")

# np.mean(): Calculates the arithmetic mean (average) of all elements
average = np.mean(arr)
print(f"Mean of all elements: {average}")

# np.max(): Finds the maximum element
maximum = np.max(arr)
print(f"Maximum element: {maximum}")

# np.min(): Finds the minimum element
minimum = np.min(arr)
print(f"Minimum element: {minimum}")

# np.std(): Calculates the standard deviation
std_dev = np.std(arr)
print(f"Standard deviation: {std_dev}\n")

# Aggregation along an axis (for multi-dimensional arrays)
arr_2d = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]
])
print(f"Original 2D Array:\n{arr_2d}\n")

# Sum along columns (axis=0)
sum_columns = np.sum(arr_2d, axis=0)
print(f"Sum along columns (axis=0): {sum_columns}")

# Sum along rows (axis=1)
sum_rows = np.sum(arr_2d, axis=1)
print(f"Sum along rows (axis=1): {sum_rows}")

Code Explanation & Output:

np.sum(arr): Calculates the sum of all elements in the 1D array arr.
np.mean(arr): Calculates the average of all elements in arr.
np.max(arr): Returns the largest value in arr.
np.min(arr): Returns the smallest value in arr.
np.std(arr): Calculates the standard deviation of the elements in arr. Standard deviation measures the amount of variation or dispersion of a set of values.
For arr_2d:
- np.sum(arr_2d, axis=0): Calculates the sum of elements along axis=0 (columns). This means it sums down each column. For example, the first element of the result [1+4+7, 2+5+8, 3+6+9].
- np.sum(arr_2d, axis=1): Calculates the sum of elements along axis=1 (rows). This means it sums across each row. For example, the first element of the result [1+2+3, 4+5+6, 7+8+9].

Original Array: [ 1  2  3  4  5  6  7  8  9 10]

Sum of all elements: 55
Mean of all elements: 5.5
Maximum element: 10
Minimum element: 1
Standard deviation: 2.8722813232690143

Original 2D Array:
[[1 2 3]
 [4 5 6]
 [7 8 9]]

Sum along columns (axis=0): [12 15 18]
Sum along rows (axis=1): [ 6 15 24]

Tip on axis: Think of axis=0 as operating

vertically (down the columns) and axis=1 as operating horizontally (across the rows). This concept of axis is crucial in NumPy and Pandas for performing operations along specific dimensions.

Module 2: Pandas - The Ultimate Data Manipulation Tool

Topic: What is Pandas?

What is it?

Pandas is a fast, powerful, flexible, and easy-to-use open-source data analysis and manipulation tool, built on top of the Python programming language. It provides data structures like DataFrames and Series that are designed to work with tabular and time-series data.

Why is it important?

In the real world, data rarely comes in a perfectly clean, numerical format. It's often messy, incomplete, and comes from various sources. Pandas excels at handling these real-world complexities, allowing you to clean, transform, and analyze data efficiently. It's the workhorse for almost any data science project.

Analogy: If NumPy is the raw material (like metal or wood) that provides the fundamental building blocks for numerical operations, then Pandas is the factory that takes that raw material and shapes it into useful, organized components. Think of a DataFrame as a highly organized assembly line where you can easily inspect, modify, and combine different parts of your data. It's where the real data wrangling happens, turning raw numbers into structured information ready for analysis or machine learning.

Pandas makes data manipulation intuitive and powerful, allowing you to perform complex operations with just a few lines of code.

How do we use it?

Just like NumPy, we typically import Pandas with a conventional alias, pd.

import pandas as pd

# Now we can use pd to access Pandas functions and objects
print(pd.__version__)

Code Explanation & Output:

import pandas as pd: This line imports the Pandas library and assigns it the conventional alias pd. This is a widely accepted standard in the data science community.
print(pd.__version__): This line prints the installed version of Pandas, useful for checking your environment.

2.2.2  # The exact version number might vary depending on your installation

Topic: Core Pandas Structures

Pandas introduces two primary data structures that are fundamental to its operation: the Series and the DataFrame.

The Series (a labeled 1D array)

What is it?

A Pandas Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). It is essentially a column in a spreadsheet or a single column of a DataFrame.

Why is it important?

Series are the building blocks of DataFrames. They allow you to work with individual columns of data, providing powerful indexing capabilities and enabling efficient element-wise operations, similar to NumPy arrays but with added labels (an index).

How do we use it?

import pandas as pd
import numpy as np

# Creating a Series from a list
data = [10, 20, 30, 40, 50]
s = pd.Series(data)
print(f"Series from list:\n{s}\n")

# Creating a Series with a custom index
labels = ["a", "b", "c", "d", "e"]
s_labeled = pd.Series(data, index=labels)
print(f"Series with custom index:\n{s_labeled}\n")

# Creating a Series from a dictionary
dict_data = {"apple": 100, "banana": 150, "orange": 120}
s_dict = pd.Series(dict_data)
print(f"Series from dictionary:\n{s_dict}\n")

# Accessing elements in a Series
print(f"Element at index 2 (from list Series): {s[2]}")
print(f"Element with label \'b\' (from labeled Series): {s_labeled["b"]}")
print(f"Value of \'orange\' (from dictionary Series): {s_dict["orange"]}\n")

# Operations on Series (element-wise)
s_ops = pd.Series([1, 2, 3, 4])
print(f"Original Series for operations: {s_ops}")
print(f"Series + 5: {s_ops + 5}")
print(f"Series * 2: {s_ops * 2}")

Code Explanation & Output:

pd.Series(data): Creates a Series where the index is automatically generated (0, 1, 2, ...).
pd.Series(data, index=labels): Creates a Series with a custom index provided by the labels list. This allows you to refer to elements by meaningful names instead of just numerical positions.
pd.Series(dict_data): When creating a Series from a dictionary, the dictionary keys automatically become the Series index, and the values become the Series data.
Accessing elements: You can access elements by their numerical position (like a list) or by their label (if a custom index is provided).
Operations: Arithmetic operations on Series are element-wise, similar to NumPy arrays. Pandas aligns data based on the index when performing operations between two Series.

Series from list:
0    10
1    20
2    30
3    40
4    50
dtype: int64

Series with custom index:
a    10
b    20
c    30
d    40
e    50
dtype: int64

Series from dictionary:
apple     100
banana    150
orange    120
dtype: int64

Element at index 2 (from list Series): 30
Element with label \'b\' (from labeled Series): 20
Value of \'orange\' (from dictionary Series): 120

Original Series for operations: 0    1
1    2
2    3
3    4
dtype: int64
Series + 5: 0     6
1     7
2     8
3     9
dtype: int64
Series * 2: 0    2
1    4
2    6
3    8
dtype: int64

The DataFrame (our primary tool, a 2D table)

What is it?

A Pandas DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns). It is the most commonly used Pandas object and is essentially a spreadsheet or a SQL table.

Why is it important?

DataFrames are where the magic happens in Pandas. They allow you to store and manipulate complex datasets with multiple columns of different data types. They provide powerful tools for filtering, selecting, grouping, and transforming data, making them indispensable for data analysis.

Analogy: If a Series is like a single column in a spreadsheet, a DataFrame is the entire spreadsheet itself. It has rows and columns, each column can have a different data type (e.g., one column for names, another for ages, another for salaries), and you can easily perform operations across rows or columns, just like you would in Excel or Google Sheets, but with the power of Python programming.

How do we use it?

import pandas as pd
import numpy as np

# Creating a DataFrame from a dictionary of lists
data = {
    "Name": ["Alice", "Bob", "Charlie", "David"],
    "Age": [25, 30, 35, 28],
    "City": ["New York", "Los Angeles", "Chicago", "Houston"]
}
df = pd.DataFrame(data)
print(f"DataFrame from dictionary:\n{df}\n")

# Creating a DataFrame from a list of dictionaries
list_of_dicts = [
    {"Name": "Eve", "Age": 22, "City": "Miami"},
    {"Name": "Frank", "Age": 40, "City": "Boston"}
]
df_from_list = pd.DataFrame(list_of_dicts)
print(f"DataFrame from list of dictionaries:\n{df_from_list}\n")

# Creating a DataFrame from a NumPy array (requires column names)
numpy_data = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]
])
columns = ["ColA", "ColB", "ColC"]
df_from_numpy = pd.DataFrame(numpy_data, columns=columns)
print(f"DataFrame from NumPy array:\n{df_from_numpy}")

Code Explanation & Output:

pd.DataFrame(data): The most common way to create a DataFrame is from a dictionary where keys become column names and values are lists representing the data in each column.
pd.DataFrame(list_of_dicts): You can also create a DataFrame from a list of dictionaries, where each dictionary represents a row.
pd.DataFrame(numpy_data, columns=columns): If you have a NumPy array, you can convert it to a DataFrame by providing the array and a list of column names.

DataFrame from dictionary:
      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago
3    David   28      Houston

DataFrame from list of dictionaries:
    Name  Age    City
0    Eve   22   Miami
1  Frank   40  Boston

DataFrame from NumPy array:
   ColA  ColB  ColC
0     1     2     3
1     4     5     6
2     7     8     9

Topic: Reading and Writing Data

One of the most common tasks in data science is importing data from external files and, after analysis, exporting results back to files. Pandas makes this incredibly easy, especially for CSV (Comma Separated Values) files, which are a ubiquitous format for tabular data.

Reading a CSV file with `pd.read_csv()`

What is it?

pd.read_csv() is a Pandas function used to read a CSV file into a DataFrame. It is highly flexible and can handle various CSV formats, delimiters, and encoding issues.

Why is it important?

Most real-world datasets are stored in files, and CSV is one of the most common formats. This function is your gateway to loading external data into Pandas for analysis.

How do we use it?

To demonstrate, we'll first create a dummy CSV file. In a real scenario, you would already have your CSV file.

import pandas as pd
import os

# Create a dummy CSV file for demonstration
dummy_data = {
    "StudentID": [1, 2, 3, 4, 5],
    "Name": ["Alice", "Bob", "Charlie", "David", "Eve"],
    "Score": [85, 92, 78, 95, 88]
}
df_dummy = pd.DataFrame(dummy_data)

csv_file_path = "students.csv"
df_dummy.to_csv(csv_file_path, index=False) # index=False prevents writing the DataFrame index as a column

print(f"Dummy CSV file \'{csv_file_path}\' created.\n")

# Now, read the CSV file into a new DataFrame
df_students = pd.read_csv(csv_file_path)

print(f"DataFrame read from \'{csv_file_path}\':\n")
print(df_students)

# Clean up the dummy file (optional)
os.remove(csv_file_path)
print(f"\nDummy CSV file \'{csv_file_path}\' removed.")

Code Explanation & Output:

Creating the dummy CSV:
- We first create a small DataFrame df_dummy with some sample data.
- df_dummy.to_csv(csv_file_path, index=False): This line writes the df_dummy DataFrame to a CSV file named students.csv. index=False is crucial here; it tells Pandas not to write the DataFrame's row index as a column in the CSV file. If you omit index=False, you'll get an extra column of numbers (0, 1, 2, ...) in your CSV.
Reading the CSV:
- df_students = pd.read_csv(csv_file_path): This is the core line. It reads the students.csv file and loads its content directly into a new Pandas DataFrame called df_students.
Cleanup:
- os.remove(csv_file_path): This line removes the dummy CSV file after the demonstration. In a real scenario, you wouldn't typically create and then immediately delete your data files.

Dummy CSV file \'students.csv\' created.

DataFrame read from \'students.csv\':
   StudentID     Name  Score
0          1    Alice     85
1          2      Bob     92
2          3  Charlie     78
3          4    David     95
4          5      Eve     88

Dummy CSV file \'students.csv\' removed.

Important Note: pd.read_csv() has many parameters to handle different scenarios, such as sep for specifying delimiters other than comma, header for indicating if there's a header row, names for providing column names, skiprows to skip initial rows, and na_values to specify strings that should be interpreted as missing values. Always refer to the Pandas documentation if you encounter issues reading a specific CSV file.

Topic: First Look at Your Data (Exploratory Data Analysis - EDA)

What is it?

Exploratory Data Analysis (EDA) is an initial, crucial step in any data science project. It involves summarizing the main characteristics of a dataset, often with visual methods. The goal is to understand the data, identify patterns, spot anomalies, test hypotheses, and check assumptions.

Why is it important?

Before you dive into complex modeling, you need to know your data inside out. EDA helps you:

Understand the structure: How many rows and columns? What are the data types?
Identify missing values: Are there gaps in your data? How should they be handled?
Spot outliers: Are there unusual data points that might skew your analysis?
Discover relationships: How do different variables relate to each other?
Validate assumptions: Does the data meet the requirements for the analysis or model you plan to use?

It's like getting to know a new friend before you start a big project together. You want to understand their strengths, weaknesses, and quirks.

How do we use it?

Pandas provides several convenient methods for quickly getting a summary of your DataFrame. Let's use a sample DataFrame to demonstrate.

import pandas as pd
import numpy as np

# Create a sample DataFrame for demonstration
data = {
    "Name": ["Alice", "Bob", "Charlie", "David", "Eve", "Frank", "Grace", "Heidi", "Ivan", "Judy"],
    "Age": [25, 30, 35, 28, 22, 40, 29, 31, np.nan, 26],
    "City": ["New York", "Los Angeles", "Chicago", "Houston", "Miami", "Boston", "Seattle", "Denver", "Austin", "Phoenix"],
    "Salary": [70000, 85000, 90000, 72000, 65000, 100000, 78000, 80000, 75000, 68000],
    "Experience_Years": [2, 5, 8, 3, 1, 10, 4, 6, 3, 2]
}
df = pd.DataFrame(data)

print("Original DataFrame:\n", df)
print("\n" + "-"*30 + "\n")

# .head(): Displays the first N rows (default is 5)
print("df.head():\n", df.head())
print("\n" + "-"*30 + "\n")

# .tail(): Displays the last N rows (default is 5)
print("df.tail(3):\n", df.tail(3))
print("\n" + "-"*30 + "\n")

# .info(): Provides a concise summary of the DataFrame
# Includes index dtype, column dtypes, non-null values, and memory usage
print("df.info():")
df.info()
print("\n" + "-"*30 + "\n")

# .describe(): Generates descriptive statistics of numerical columns
# Includes count, mean, std, min, 25%, 50% (median), 75%, max
print("df.describe():\n", df.describe())
print("\n" + "-"*30 + "\n")

# .shape: Returns a tuple representing the dimensionality of the DataFrame (rows, columns)
print(f"df.shape: {df.shape}")
print("\n" + "-"*30 + "\n")

# .columns: Returns the column labels of the DataFrame
print(f"df.columns: {df.columns}")

Code Explanation & Output:

df.head(): Shows the first 5 rows of the DataFrame. This is useful for a quick glance at the data structure and content.
df.tail(3): Shows the last 3 rows. Useful for checking data appended at the end or for time-series data.
df.info(): Provides a summary of the DataFrame. Key information includes:
- RangeIndex: The number of entries (rows) and the index range.
- Data columns (total X columns): The total number of columns.
- Column Name, Non-Null Count, Dtype: For each column, it shows the number of non-missing values and its data type. Notice that 'Age' has 9 non-null values out of 10 entries, indicating one missing value (NaN).
- Memory usage: How much memory the DataFrame is consuming.
df.describe(): Calculates descriptive statistics for numerical columns. For each numerical column, it provides:
- count: Number of non-null entries.
- mean: Average value.
- std: Standard deviation (measure of data dispersion).
- min: Minimum value.
- 25%, 50% (median), 75%: Quartiles, indicating the values below which 25%, 50%, and 75% of the data fall, respectively.
- max: Maximum value.
df.shape: Returns a tuple (number_of_rows, number_of_columns). For our DataFrame, it's (10, 5).
df.columns: Returns an Index object containing the names of all columns in the DataFrame.

Original DataFrame:
        Name  Age         City  Salary  Experience_Years
0      Alice  25.0     New York   70000                 2
1        Bob  30.0  Los Angeles   85000                 5
2    Charlie  35.0      Chicago   90000                 8
3      David  28.0      Houston   72000                 3
4        Eve  22.0        Miami   65000                 1
5      Frank  40.0       Boston  100000                10
6      Grace  29.0      Seattle   78000                 4
7      Heidi  31.0       Denver   80000                 6
8       Ivan   NaN       Austin   75000                 3
9       Judy  26.0      Phoenix   68000                 2

------------------------------
df.head():
       Name   Age         City  Salary  Experience_Years
0    Alice  25.0     New York   70000                 2
1      Bob  30.0  Los Angeles   85000                 5
2  Charlie  35.0      Chicago   90000                 8
3    David  28.0      Houston   72000                 3
4      Eve  22.0        Miami   65000                 1

------------------------------
df.tail(3):
     Name   Age     City  Salary  Experience_Years
7    Heidi  31.0   Denver   80000                 6
8     Ivan   NaN   Austin   75000                 3
9     Judy  26.0  Phoenix   68000                 2

------------------------------
df.info():
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Name              10 non-null     object 
 1   Age               9 non-null      float64
 2   City              10 non-null     object 
 3   Salary            10 non-null     int64  
 4   Experience_Years  10 non-null     int64  
dtypes: float64(1), int64(2), object(2)
memory usage: 528.0+ bytes

------------------------------
df.describe():
              Age        Salary  Experience_Years
count    9.000000      10.00000        10.000000
mean    29.555556   78000.00000         4.400000
std      5.479708    10888.08916         2.913391
min     22.000000   65000.00000         1.000000
25%     26.000000   70500.00000         2.000000
50%     29.000000   76500.00000         3.000000
75%     31.000000   83750.00000         5.750000
max     40.000000  100000.00000        10.000000

------------------------------
df.shape: (10, 5)

------------------------------
df.columns: Index(['Name', 'Age', 'City', 'Salary', 'Experience_Years'], dtype='object')

Key Takeaway: Always start your data analysis with EDA. These simple Pandas functions provide a wealth of information about your dataset, helping you identify potential issues (like missing values) and guide your next steps in data cleaning and preparation.

Topic: Selecting Data

Once you have your data loaded into a DataFrame, the next crucial step is often to select specific parts of it. Pandas offers powerful and flexible ways to select data, whether it's a single column, multiple columns, or specific rows.

Selecting columns

What is it?

Selecting columns in a DataFrame means extracting one or more columns by their names. The result of selecting a single column is a Pandas Series, while selecting multiple columns returns a DataFrame.

Why is it important?

Often, you only need to work with a subset of your data. Selecting columns allows you to focus on relevant variables, reducing memory usage and simplifying your analysis.

How do we use it?

import pandas as pd

data = {
    "Name": ["Alice", "Bob", "Charlie", "David"],
    "Age": [25, 30, 35, 28],
    "City": ["New York", "Los Angeles", "Chicago", "Houston"],
    "Salary": [70000, 85000, 90000, 72000]
}
df = pd.DataFrame(data)
print(f"Original DataFrame:\n{df}\n")

# Select a single column (returns a Series)
names = df["Name"]
print(f"Selected \"Name\" column (Series):\n{names}\n")
print(f"Type of names: {type(names)}\n")

# Select multiple columns (returns a DataFrame)
two_columns = df[["Name", "Salary"]]
print(f"Selected \"Name\" and \"Salary\" columns (DataFrame):\n{two_columns}\n")
print(f"Type of two_columns: {type(two_columns)}")

Code Explanation & Output:

df["Name"]: To select a single column, use square brackets with the column name as a string. This returns a Pandas Series.
df[["Name", "Salary"]]: To select multiple columns, pass a list of column names inside the square brackets. This returns a new DataFrame containing only the specified columns.

Original DataFrame:
      Name  Age         City  Salary
0    Alice   25     New York   70000
1      Bob   30  Los Angeles   85000
2  Charlie   35      Chicago   90000
3    David   28      Houston   72000

Selected \"Name\" column (Series):
0      Alice
1        Bob
2    Charlie
3      David
Name: Name, dtype: object

Type of names: <class \'pandas.core.series.Series\'>

Selected \"Name\" and \"Salary\" columns (DataFrame):
      Name  Salary
0    Alice   70000
1      Bob   85000
2  Charlie   90000
3    David   72000

Type of two_columns: <class \'pandas.core.frame.DataFrame\'>

Selecting rows with `.loc[]` (label-based)

What is it?

.loc[] is a label-based indexer used for selecting data by row and column labels. It allows you to select rows by their index labels and columns by their column names.

Why is it important?

When your DataFrame has meaningful row labels (e.g., dates, IDs), .loc[] provides an intuitive way to retrieve data based on those labels, making your code more readable and less prone to errors than numerical indexing.

How do we use it?

import pandas as pd

data = {
    "Name": ["Alice", "Bob", "Charlie", "David"],
    "Age": [25, 30, 35, 28],
    "City": ["New York", "Los Angeles", "Chicago", "Houston"],
    "Salary": [70000, 85000, 90000, 72000]
}
df = pd.DataFrame(data, index=["A", "B", "C", "D"]) # Custom index
print(f"Original DataFrame with custom index:\n{df}\n")

# Select a single row by its label
row_b = df.loc["B"]
print(f"Row with label \"B\":\n{row_b}\n")

# Select multiple rows by their labels
rows_ac = df.loc[["A", "C"]]
print(f"Rows with labels \"A\" and \"C\":\n{rows_ac}\n")

# Select rows and specific columns by labels
row_c_name_salary = df.loc["C", ["Name", "Salary"]]
print(f"Name and Salary for row \"C\":\n{row_c_name_salary}\n")

# Select a slice of rows by label (inclusive of end label)
rows_slice = df.loc["A":"C"]
print(f"Rows from \"A\" to \"C\" (inclusive):\n{rows_slice}")

Code Explanation & Output:

df.loc["B"]: Selects the row with the index label

"B". This returns a Series.

df.loc[["A", "C"]]: Selects multiple rows by passing a list of labels. This returns a DataFrame.
df.loc["C", ["Name", "Salary"]]: Selects specific columns ("Name", "Salary") for a specific row ("C").
df.loc["A":"C"]: Selects a range of rows from label "A" to "C", inclusive of both start and end labels. This is a key difference from standard Python slicing.

Original DataFrame with custom index:
      Name  Age         City  Salary
A    Alice   25     New York   70000
B      Bob   30  Los Angeles   85000
C  Charlie   35      Chicago   90000
D    David   28      Houston   72000

Row with label \"B\":
Name             Bob
Age               30
City     Los Angeles
Salary         85000
Name: B, dtype: object

Rows with labels \"A\" and \"C\":
      Name  Age       City  Salary
A    Alice   25   New York   70000
C  Charlie   35    Chicago   90000

Name and Salary for row \"C\":
Name      Charlie
Salary      90000
Name: C, dtype: object

Rows from \"A\" to \"C\" (inclusive):
      Name  Age         City  Salary
A    Alice   25     New York   70000
B      Bob   30  Los Angeles   85000
C  Charlie   35      Chicago   90000

Selecting rows with `.iloc[]` (index-based)

What is it?

.iloc[] is an integer-location based indexer used for selection by positional integer indices. It works similarly to NumPy array indexing, where you specify rows and columns using their 0-based integer positions.

Why is it important?

Even if your DataFrame has custom labels, you might sometimes need to select data based purely on its numerical position (e.g., the first 5 rows, or the last column). .iloc[] provides a consistent and reliable way to do this, regardless of the actual labels.

How do we use it?

import pandas as pd

data = {
    "Name": ["Alice", "Bob", "Charlie", "David"],
    "Age": [25, 30, 35, 28],
    "City": ["New York", "Los Angeles", "Chicago", "Houston"],
    "Salary": [70000, 85000, 90000, 72000]
}
df = pd.DataFrame(data, index=["A", "B", "C", "D"]) # Custom index
print(f"Original DataFrame with custom index:\n{df}\n")

# Select a single row by its integer position
first_row = df.iloc[0]
print(f"First row (index 0):\n{first_row}\n")

# Select multiple rows by integer positions
first_and_third_rows = df.iloc[[0, 2]]
print(f"First and third rows:\n{first_and_third_rows}\n")

# Select a slice of rows by integer position (exclusive of end index)
rows_slice_iloc = df.iloc[1:3] # Rows at index 1 and 2
print(f"Rows from index 1 to 2 (exclusive of 3):\n{rows_slice_iloc}\n")

# Select rows and specific columns by integer positions
row_1_col_0_2 = df.iloc[1, [0, 2]] # Row at index 1, columns at index 0 and 2
print(f"Row 1, columns 0 and 2:\n{row_1_col_0_2}\n")

# Select a sub-DataFrame using both row and column slicing
sub_df_iloc = df.iloc[0:2, 1:3] # Rows 0-1, columns 1-2
print(f"Sub-DataFrame (rows 0-1, cols 1-2):\n{sub_df_iloc}")

Code Explanation & Output:

df.iloc[0]: Selects the row at integer position 0 (the first row). This returns a Series.
df.iloc[[0, 2]]: Selects multiple rows by passing a list of integer positions. This returns a DataFrame.
df.iloc[1:3]: Selects a slice of rows from integer position 1 up to (but not including) 3. This is consistent with standard Python slicing.
df.iloc[1, [0, 2]]: Selects specific columns (at integer positions 0 and 2) for a specific row (at integer position 1).
df.iloc[0:2, 1:3]: Selects a sub-DataFrame by slicing both rows (from 0 up to 2) and columns (from 1 up to 3).

Original DataFrame with custom index:
      Name  Age         City  Salary
A    Alice   25     New York   70000
B      Bob   30  Los Angeles   85000
C  Charlie   35      Chicago   90000
D    David   28      Houston   72000

First row (index 0):
Name          Alice
Age              25
City       New York
Salary        70000
Name: A, dtype: object

First and third rows:
      Name  Age       City  Salary
A    Alice   25   New York   70000
C  Charlie   35    Chicago   90000

Rows from index 1 to 2 (exclusive of 3):
    Name  Age         City  Salary
B    Bob   30  Los Angeles   85000
C  Charlie   35      Chicago   90000

Row 1, columns 0 and 2:
Name          Bob
City    Los Angeles
Name: B, dtype: object

Sub-DataFrame (rows 0-1, cols 1-2):
   Age         City
A   25     New York
B   30  Los Angeles

When to use .loc[] vs. .iloc[]:

Use .loc[] when you need to select data based on labels (row names, column names).

Use .iloc[] when you need to select data based on integer positions (0-based indices).

For simple column selection by name, df["column_name"] is often sufficient.

Topic: Filtering Data

What is it?

Filtering data, also known as boolean indexing or boolean selection, is the process of selecting rows from a DataFrame based on one or more conditions. These conditions evaluate to True or False for each row, and only rows where the condition is True are returned.

Why is it important?

Filtering is one of the most powerful and frequently used operations in data analysis. It allows you to isolate specific subsets of your data that meet certain criteria, enabling focused analysis, cleaning, or preparation for modeling. For example, you might want to analyze only customers over a certain age, transactions above a certain amount, or data from a specific region.

How do we use it?

Let's use our sample DataFrame to demonstrate how to filter data using boolean conditions.

import pandas as pd

data = {
    "Name": ["Alice", "Bob", "Charlie", "David", "Eve", "Frank"],
    "Age": [25, 30, 35, 28, 22, 40],
    "City": ["New York", "Los Angeles", "Chicago", "Houston", "Miami", "Boston"],
    "Salary": [70000, 85000, 90000, 72000, 65000, 100000]
}
df = pd.DataFrame(data)
print(f"Original DataFrame:\n{df}\n")

# Filter for people older than 30
older_than_30 = df[df["Age"] > 30]
print(f"People older than 30:\n{older_than_30}\n")

# Filter for people from New York
from_ny = df[df["City"] == "New York"]
print(f"People from New York:\n{from_ny}\n")

# Filter with multiple conditions (AND - use &)
# People older than 30 AND earning more than 80000
older_and_high_salary = df[(df["Age"] > 30) & (df["Salary"] > 80000)]
print(f"People older than 30 AND earning > 80000:\n{older_and_high_salary}\n")

# Filter with multiple conditions (OR - use |)
# People from New York OR Los Angeles
ny_or_la = df[(df["City"] == "New York") | (df["City"] == "Los Angeles")]
print(f"People from New York OR Los Angeles:\n{ny_or_la}\n")

# Using .isin() for multiple categorical values
# People from New York, Los Angeles, or Chicago
cities_of_interest = ["New York", "Los Angeles", "Chicago"]
filtered_by_cities = df[df["City"].isin(cities_of_interest)]
print(f"People from New York, Los Angeles, or Chicago:\n{filtered_by_cities}")

Code Explanation & Output:

df["Age"] > 30: This is the core of boolean filtering. It creates a Pandas Series of True/False values, where True indicates that the corresponding row in the Age column meets the condition (is greater than 30).
df[df["Age"] > 30]: When you pass this boolean Series back into the DataFrame using square brackets, Pandas returns only the rows where the boolean Series has a True value.
Multiple Conditions (& for AND, | for OR):
- & (bitwise AND): Used to combine multiple conditions where all conditions must be true. Each condition must be enclosed in parentheses () to ensure correct order of operations.
- | (bitwise OR): Used to combine multiple conditions where at least one condition must be true. Again, each condition needs parentheses.
.isin(): This is a very convenient method for filtering when you want to select rows where a column's value is present in a given list of values. It's much cleaner than chaining multiple | conditions.

Original DataFrame:
      Name  Age         City  Salary
0    Alice   25     New York   70000
1      Bob   30  Los Angeles   85000
2  Charlie   35      Chicago   90000
3    David   28      Houston   72000
4      Eve   22        Miami   65000
5    Frank   40       Boston  100000

People older than 30:
      Name  Age      City  Salary
2  Charlie   35   Chicago   90000
5    Frank   40    Boston  100000

People from New York:
    Name  Age      City  Salary
0  Alice   25  New York   70000

People older than 30 AND earning > 80000:
    Name  Age    City  Salary
2  Charlie   35  Chicago   90000
5    Frank   40   Boston  100000

People from New York OR Los Angeles:
    Name  Age         City  Salary
0  Alice   25     New York   70000
1    Bob   30  Los Angeles   85000

People from New York, Los Angeles, or Chicago:
      Name  Age         City  Salary
0    Alice   25     New York   70000
1      Bob   30  Los Angeles   85000
2  Charlie   35      Chicago   90000

Important Note: When combining multiple conditions, always use & for AND and | for OR, and enclose each individual condition in parentheses. Using and or or (Python's logical operators) directly with Pandas Series will result in an error because they operate on boolean values, not Series of booleans.

Topic: Handling Missing Data

What is it?

Missing data refers to the absence of a value for a variable in a dataset. In Pandas, missing values are typically represented by NaN (Not a Number), which is a special floating-point value from NumPy. Missing data can occur for various reasons, such as data entry errors, data corruption, or simply that the information was not collected.

Why is it important?

Missing data can significantly impact your analysis and the performance of machine learning models. Many statistical and machine learning algorithms cannot handle missing values and will either throw an error or produce incorrect results. Therefore, it's crucial to identify and appropriately handle missing data before proceeding with further analysis or modeling.

How do we use it?

Pandas provides convenient methods to detect, remove, or fill missing values. Let's create a DataFrame with some missing data to demonstrate.

import pandas as pd
import numpy as np

data = {
    "A": [1, 2, np.nan, 4, 5],
    "B": [10, np.nan, 30, 40, 50],
    "C": [100, 200, 300, np.nan, np.nan],
    "D": ["apple", "banana", "cherry", "date", "elderberry"]
}
df = pd.DataFrame(data)
print(f"Original DataFrame with missing values:\n{df}\n")

# Finding missing values with .isnull().sum()
# .isnull() returns a boolean DataFrame indicating where values are NaN
# .sum() then counts the True values (i.e., NaNs) for each column
print(f"Missing values per column:\n{df.isnull().sum()}\n")

# Dropping missing values with .dropna()
# By default, drops rows containing ANY NaN values
df_dropped_rows = df.dropna()
print(f"DataFrame after dropping rows with any NaN:\n{df_dropped_rows}\n")

# Drop columns with any NaN values
df_dropped_cols = df.dropna(axis=1) # axis=1 means columns
print(f"DataFrame after dropping columns with any NaN:\n{df_dropped_cols}\n")

# Drop rows only if ALL values are NaN
df_dropped_all_nan_rows = df.dropna(how=\'all\')
print(f"DataFrame after dropping rows where ALL values are NaN:\n{df_dropped_all_nan_rows}\n")

# Filling missing values with .fillna()
# Fill with a specific value (e.g., 0)
df_filled_zero = df.fillna(0)
print(f"DataFrame after filling NaNs with 0:\n{df_filled_zero}\n")

# Fill with the mean of the column (for numerical columns)
# This is a common imputation strategy
df_filled_mean = df.copy() # Create a copy to avoid modifying original df
df_filled_mean["A"] = df_filled_mean["A"].fillna(df_filled_mean["A"].mean())
df_filled_mean["B"] = df_filled_mean["B"].fillna(df_filled_mean["B"].mean())
df_filled_mean["C"] = df_filled_mean["C"].fillna(df_filled_mean["C"].mean())
print(f"DataFrame after filling numerical NaNs with column mean:\n{df_filled_mean}\n")

# Fill with the median of the column (for numerical columns)
df_filled_median = df.copy()
df_filled_median["A"] = df_filled_median["A"].fillna(df_filled_median["A"].median())
print(f"DataFrame after filling \"A\" NaN with column median:\n{df_filled_median}\n")

# Forward fill (propagate last valid observation forward to next valid observation)
df_ffill = df.fillna(method=\'ffill\')
print(f"DataFrame after forward fill (ffill):\n{df_ffill}\n")

# Backward fill (propagate next valid observation backward to next valid observation)
df_bfill = df.fillna(method=\'bfill\')
print(f"DataFrame after backward fill (bfill):\n{df_bfill}")

Code Explanation & Output:

df.isnull().sum(): This is a powerful combination. df.isnull() creates a boolean DataFrame where True indicates a missing value. Chaining .sum() to it counts the True values for each column, giving you a quick summary of missing data per column.
df.dropna():
- By default, df.dropna() removes any row that contains at least one NaN value. This can lead to significant data loss if many rows have missing values.
- axis=1: Specifies that columns containing NaN values should be dropped instead of rows.
- how='all': Drops a row (or column if axis=1) only if all its values are NaN.
df.fillna(value):
- df.fillna(0): Replaces all NaN values in the DataFrame with 0.
- df_filled_mean["A"].fillna(df_filled_mean["A"].mean()): This is a common strategy called mean imputation. It calculates the mean of the non-missing values in column 'A' and then fills the NaN in that column with this mean. Similarly for 'B' and 'C'.
- df_filled_median["A"].fillna(df_filled_median["A"].median()): Similar to mean imputation, but uses the median, which is less sensitive to outliers.
- method='ffill' (forward fill): Fills NaN values with the last observed non-null value in the column.
- method='bfill' (backward fill): Fills NaN values with the next observed non-null value in the column.

Original DataFrame with missing values:
     A     B      C           D
0  1.0  10.0  100.0       apple
1  2.0   NaN  200.0      banana
2  NaN  30.0  300.0      cherry
3  4.0  40.0    NaN        date
4  5.0  50.0    NaN  elderberry

Missing values per column:
A    1
B    1
C    2
D    0
dtype: int64

DataFrame after dropping rows with any NaN:
     A     B      C           D
0  1.0  10.0  100.0       apple

DataFrame after dropping columns with any NaN:
            D
0       apple
1      banana
2      cherry
3        date
4  elderberry

DataFrame after dropping rows where ALL values are NaN:
     A     B      C           D
0  1.0  10.0  100.0       apple
1  2.0   NaN  200.0      banana
2  NaN  30.0  300.0      cherry
3  4.0  40.0    NaN        date
4  5.0  50.0    NaN  elderberry

DataFrame after filling NaNs with 0:
     A     B      C           D
0  1.0  10.0  100.0       apple
1  2.0   0.0  200.0      banana
2  0.0  30.0  300.0      cherry
3  4.0  40.0    0.0        date
4  5.0  50.0    0.0  elderberry

DataFrame after filling numerical NaNs with column mean:
     A     B      C           D
0  1.0  10.0  100.0       apple
1  2.0  32.5  200.0      banana
2  3.0  30.0  300.0      cherry
3  4.0  40.0  200.0        date
4  5.0  50.0  200.0  elderberry

DataFrame after filling \"A\" NaN with column median:
     A     B      C           D
0  1.0  10.0  100.0       apple
1  2.0   NaN  200.0      banana
2  3.0  30.0  300.0      cherry
3  4.0  40.0    NaN        date
4  5.0  50.0    NaN  elderberry

DataFrame after forward fill (ffill):
     A     B      C           D
0  1.0  10.0  100.0       apple
1  2.0  10.0  200.0      banana
2  2.0  30.0  300.0      cherry
3  4.0  40.0  300.0        date
4  5.0  50.0  300.0  elderberry

DataFrame after backward fill (bfill):
     A     B      C           D
0  1.0  10.0  100.0       apple
1  2.0  30.0  200.0      banana
2  4.0  30.0  300.0      cherry
3  4.0  40.0    NaN        date
4  5.0  50.0    NaN  elderberry

Choosing a strategy: The best way to handle missing data depends on the nature of your data and the reason for the missingness. Dropping rows or columns is simple but can lead to data loss. Imputation (filling missing values) is often preferred, but the choice of imputation method (mean, median, mode, or more advanced techniques) can significantly affect your results. Always consider the implications of your chosen strategy.

Topic: Grouping and Aggregating

What is it?

Grouping and aggregating data is a powerful technique that allows you to summarize data by categories. The process typically involves three steps:

Splitting: Dividing the data into groups based on some criteria (e.g., grouping sales data by region, or customer data by age group).
Applying: Applying a function to each group independently (e.g., calculating the sum of sales for each region, or the average age for each customer segment).
Combining: Combining the results into a new DataFrame or Series.

This entire process is often referred to as "split-apply-combine."

Why is it important?

Grouping and aggregation are essential for gaining insights from your data. Instead of looking at individual data points, you can understand trends and patterns at a higher, more meaningful level. For example, you can answer questions like: "Which product category has the highest average sales?" or "What is the total revenue generated by each sales representative?"

How do we use it?

Pandas provides the .groupby() method for splitting data into groups, and then you can apply various aggregation functions. Let's use a sample DataFrame representing sales data.

import pandas as pd

data = {
    "Region": ["East", "West", "East", "North", "West", "East", "North", "West"],
    "Salesperson": ["Alice", "Bob", "Charlie", "David", "Eve", "Frank", "Grace", "Heidi"],
    "Product": ["A", "B", "A", "C", "B", "A", "C", "B"],
    "Sales": [100, 150, 120, 200, 180, 110, 220, 160]
}
df = pd.DataFrame(data)
print(f"Original Sales DataFrame:\n{df}\n")

# Group by a single column and calculate the sum of Sales
sales_by_region = df.groupby("Region")["Sales"].sum()
print(f"Total Sales by Region:\n{sales_by_region}\n")

# Group by multiple columns and calculate the mean of Sales
sales_by_region_product = df.groupby(["Region", "Product"])["Sales"].mean()
print(f"Average Sales by Region and Product:\n{sales_by_region_product}\n")

# Applying multiple aggregation functions at once using .agg()
# Calculate sum, mean, and count of Sales by Region
multi_agg_by_region = df.groupby("Region")["Sales"].agg(["sum", "mean", "count"])
print(f"Multiple Aggregations by Region:\n{multi_agg_by_region}\n")

# Renaming aggregated columns for clarity
multi_agg_renamed = df.groupby("Region")["Sales"].agg(
    Total_Sales=("Sales", "sum"),
    Average_Sales=("Sales", "mean"),
    Number_of_Transactions=("Sales", "count")
)
print(f"Multiple Aggregations by Region (Renamed Columns):\n{multi_agg_renamed}\n")

# Grouping by a column and applying different aggregations to different columns
df_agg_diff_cols = df.groupby("Region").agg(
    Total_Sales=("Sales", "sum"),
    Average_Age=("Age", "mean") # Assuming 'Age' column exists and is numerical
)
# Note: For this example, 'Age' column is not in the original df, so this will cause an error.
# This is just to illustrate the concept. Let's add a dummy Age column for demonstration.

df["Age"] = [25, 30, 28, 35, 32, 29, 38, 31]
print(f"DataFrame with Age column:\n{df}\n")

df_agg_diff_cols = df.groupby("Region").agg(
    Total_Sales=("Sales", "sum"),
    Average_Age=("Age", "mean")
)
print(f"Aggregations on different columns by Region:\n{df_agg_diff_cols}")

Code Explanation & Output:

df.groupby("Region"): This is the "split" step. It creates a DataFrameGroupBy object, which is an intermediate object that holds the grouped information.
["Sales"].sum(): This is the "apply" and "combine" step. We select the Sales column from the grouped object and then apply the sum() aggregation function. Pandas automatically combines the results into a new Series where the index is the Region.
df.groupby(["Region", "Product"])["Sales"].mean(): You can group by multiple columns by passing a list of column names to groupby(). The result will have a MultiIndex.
.agg(["sum", "mean", "count"]): The .agg() method allows you to apply multiple aggregation functions to the same column simultaneously. You pass a list of function names (as strings).
Total_Sales=("Sales", "sum"): When using .agg(), you can also provide new names for the aggregated columns using keyword arguments, where the value is a tuple (column_to_aggregate, aggregation_function_name).
df.groupby("Region").agg(Total_Sales=("Sales", "sum"), Average_Age=("Age", "mean")): This demonstrates applying different aggregation functions to different columns within the same groupby operation. This is very flexible for creating summary tables.

Original Sales DataFrame:
  Region Salesperson Product  Sales
0   East     Alice       A    100
1   West       Bob       B    150
2   East   Charlie       A    120
3  North     David       C    200
4   West       Eve       B    180
5   East     Frank       A    110
6  North     Grace       C    220
7   West     Heidi       B    160

Total Sales by Region:
Region
East     330
North    420
West     490
Name: Sales, dtype: int64

Average Sales by Region and Product:
Region  Product
East    A          110.0
North   C          210.0
West    B          163.333333
Name: Sales, dtype: float64

Multiple Aggregations by Region:
       sum        mean  count
Region                     
East   330  110.000000      3
North  420  210.000000      2
West   490  163.333333      3

Multiple Aggregations by Region (Renamed Columns):
        Total_Sales  Average_Sales  Number_of_Transactions
Region                                                  
East            330     110.000000                       3
North           420     210.000000                       2
West            490     163.333333                       3

DataFrame with Age column:
  Region Salesperson Product  Sales  Age
0   East     Alice       A    100   25
1   West       Bob       B    150   30
2   East   Charlie       A    120   28
3  North     David       C    200   35
4   West       Eve       B    180   32
5   East     Frank       A    110   29
6  North     Grace       C    220   38
7   West     Heidi       B    160   31

Aggregations on different columns by Region:
        Total_Sales  Average_Age
Region                          
East            330    27.333333
North           420    36.500000
West            490    31.000000

The Power of groupby: The groupby() method is incredibly versatile and forms the basis for many complex data transformations. Mastering it is key to efficient data analysis with Pandas.

Module 3: Data Visualization - Telling Stories with Data

What is it?

Data visualization is the graphical representation of data. It involves creating charts, plots, maps, and other visual elements to help people understand complex data and identify patterns, trends, and insights that might not be obvious from looking at raw numbers.

Why is it important?

In data science, visualization is not just about making pretty pictures; it's a critical step in the analysis process. It helps in:

Exploratory Data Analysis (EDA): Visualizing data is essential for understanding its distribution, relationships between variables, and identifying outliers or anomalies.
Communicating Findings: Complex data insights can be effectively communicated to both technical and non-technical audiences through clear and compelling visualizations.
Identifying Patterns and Trends: Visualizations can reveal patterns, correlations, and trends that are difficult to spot in tabular data.
Model Evaluation: Visualizations are often used to evaluate the performance of machine learning models.

Think of data visualization as translating the language of numbers into the universal language of images. A well-designed chart can tell a powerful story about your data in a way that tables of numbers cannot.

In this module, we will explore two of the most popular and powerful data visualization libraries in Python: Matplotlib and Seaborn.

Topic: Part A - Matplotlib, the Grandparent of Plots

What is it?

Matplotlib is a comprehensive library for creating static, interactive, and animated visualizations in Python. It is the oldest and most fundamental plotting library in the Python scientific ecosystem, and many other libraries, including Seaborn, are built on top of it.

Why is it important?

Matplotlib provides a high degree of flexibility and control over your plots. While it can sometimes be more verbose than newer libraries, its foundational role means that understanding Matplotlib is key to understanding how many other plotting tools in Python work. It allows you to customize virtually every aspect of a plot.

Analogy: If data visualization is like painting a picture to tell a story, Matplotlib is like having a full set of brushes, paints, and a blank canvas. It gives you complete control over every stroke and color, allowing you to create highly customized and intricate visualizations from scratch. While this requires more effort than using pre-mixed colors or stencils (like some higher-level libraries), it offers unparalleled artistic freedom.

We will primarily use the pyplot module from Matplotlib, which provides a convenient interface for creating plots similar to MATLAB.

How do we use it?

We typically import the pyplot module with the alias plt.

import matplotlib.pyplot as plt

# Now we can use plt to create plots
print("Matplotlib imported successfully!")

Code Explanation & Output:

import matplotlib.pyplot as plt: This line imports the pyplot module from Matplotlib and assigns it the conventional alias plt. This is the standard way to use Matplotlib for plotting.

Matplotlib imported successfully!

Anatomy of a Plot (Figure, Axes, Title, Labels)

What is it?

A Matplotlib plot is composed of several key components. Understanding these components is essential for customizing your visualizations:

Figure: The entire window or page that contains the plot(s). It's the top-level container.
Axes: This is the actual area where the data is plotted. A Figure can contain multiple Axes, each representing a different plot.
Title: A descriptive title for the entire Figure or for individual Axes.
Labels: Labels for the x-axis and y-axis to indicate what the data represents.
Legend: Explains what each element (e.g., line, bar) in the plot represents.

Why is it important?

Knowing the anatomy of a plot allows you to target specific parts of the visualization for customization, such as setting titles, changing axis limits, adding labels, or modifying the appearance of plotted elements.

How do we use it?

Let's create a simple plot and identify its components.

import matplotlib.pyplot as plt
import numpy as np

# Prepare some data
x = np.linspace(0, 10, 100) # 100 points between 0 and 10
y = np.sin(x)

# Create a Figure and an Axes
fig, ax = plt.subplots() # Creates a Figure and a single Axes

# Plot data on the Axes
ax.plot(x, y)

# Set Title and Labels for the Axes
ax.set_title("Sine Wave")
ax.set_xlabel("X-axis")
ax.set_ylabel("Y-axis")

# Add a Figure title (optional, often Axes title is sufficient)
fig.suptitle("My First Matplotlib Plot", y=1.02) # y adjusts position

# Display the plot
plt.show()

Code Explanation & Output:

fig, ax = plt.subplots(): This is a common way to create a Figure and one or more Axes. plt.subplots() returns a tuple containing the Figure object (fig) and an Axes object (ax). If you wanted multiple plots in one figure, you could specify the number of rows and columns (e.g., plt.subplots(2, 2)).
ax.plot(x, y): We plot the data (x and y) on the ax (Axes) object. Most plotting methods are called on an Axes object.
ax.set_title(), ax.set_xlabel(), ax.set_ylabel(): These methods are used to set the title and axis labels for the specific Axes.
fig.suptitle(): This sets a title for the entire Figure. The y=1.02 argument slightly adjusts its vertical position to avoid overlapping with the Axes title.
plt.show(): This function displays the generated plot. Without plt.show(), the plot might not be rendered or displayed depending on the environment.

# A plot window will appear with a sine wave, titled "Sine Wave" and labeled axes.
# The overall figure title will be "My First Matplotlib Plot".

Creating Basic Plots: `plt.plot()` (line), `plt.scatter()` (scatter), `plt.bar()` (bar), `plt.hist()` (histogram).

What is it?

Matplotlib provides functions for creating various types of basic plots, each suitable for visualizing different kinds of data relationships and distributions.

Why is it important?

Choosing the right type of plot is crucial for effectively communicating the story in your data. Line plots are great for trends over time, scatter plots for relationships between two variables, bar plots for comparing categories, and histograms for showing data distribution.

How do we use it?

Let's create examples of these basic plot types.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Data for plots
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 5, 4, 5])

categories = ["A", "B", "C", "D", "E"]
values = [10, 25, 15, 20, 30]

hist_data = np.random.randn(1000) # 1000 random numbers from a standard normal distribution

# 1. Line Plot (plt.plot())
plt.figure(figsize=(8, 4)) # Create a new figure with a specified size
plt.plot(x, y, marker=\'o\', linestyle=\'--\', color=\'b\')
plt.title("Simple Line Plot")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.grid(True) # Add a grid
plt.show()

# 2. Scatter Plot (plt.scatter())
plt.figure(figsize=(8, 4))
plt.scatter(x, y, color=\'red\', marker=\'x\')
plt.title("Simple Scatter Plot")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.grid(True)
plt.show()

# 3. Bar Plot (plt.bar())
plt.figure(figsize=(8, 4))
plt.bar(categories, values, color=\'green\')
plt.title("Simple Bar Plot")
plt.xlabel("Category")
plt.ylabel("Value")
plt.show()

# 4. Histogram (plt.hist())
plt.figure(figsize=(8, 4))
plt.hist(hist_data, bins=30, color=\'purple\', edgecolor=\'black\') # bins define the number of bars
plt.title("Histogram of Random Data")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.show()

Code Explanation & Output:

plt.figure(figsize=(8, 4)): Creates a new figure for each plot. figsize sets the width and height of the figure in inches.
plt.plot(x, y, ...): Creates a line plot. Arguments like marker, linestyle, and color customize the appearance.
plt.scatter(x, y, ...): Creates a scatter plot. color and marker customize the points.
plt.bar(categories, values, ...): Creates a bar plot. The first argument is the x-coordinates (categories), and the second is the height of the bars (values).
plt.hist(hist_data, bins=30, ...): Creates a histogram. hist_data is the data, and bins determines how many bars (bins) the data will be divided into. edgecolor adds borders to the bars.
plt.title(), plt.xlabel(), plt.ylabel(): Set the title and axis labels for the current plot.
plt.grid(True): Adds a grid to the plot for easier reading.
plt.show(): Displays the current figure.

# Four separate plot windows will appear, showing:
# 1. A line plot with points and a dashed line.
# 2. A scatter plot with red 'x' markers.
# 3. A bar plot comparing the values of different categories.
# 4. A histogram showing the distribution of the random data.

Matplotlib Workflow: You can use either the pyplot interface (like plt.plot(), plt.title()) which implicitly manages figures and axes, or the object-oriented interface (like fig, ax = plt.subplots() and then ax.plot(), ax.set_title()). For more complex plots with multiple subplots, the object-oriented approach is generally recommended for better control.

Topic: Part B - Seaborn, for Beautiful Statistical Plots

What is it?

Seaborn is a Python data visualization library based on Matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. Seaborn is particularly good at visualizing relationships between variables and showing distributions.

Why is it important?

Seaborn simplifies the creation of many common and complex statistical plots. It has built-in themes for aesthetics and functions specifically designed to work with Pandas DataFrames, making it very convenient for data analysis workflows. It often requires less code than Matplotlib to produce visually appealing and statistically informative plots.

Analogy: If Matplotlib is the raw paints and brushes, Seaborn is like having pre-packaged art kits and specialized tools that make it easier to create specific types of beautiful and complex statistical visualizations quickly. It handles many of the aesthetic details automatically, allowing you to focus on the data and the story you want to tell.

How do we use it?

We typically import Seaborn with the alias sns.

import seaborn as sns
import matplotlib.pyplot as plt # Seaborn works well with Matplotlib

# Now we can use sns to create statistical plots
print("Seaborn imported successfully!")

Code Explanation & Output:

import seaborn as sns: Imports the Seaborn library with the conventional alias sns.
import matplotlib.pyplot as plt: It's common to import Matplotlib's pyplot as well, as Seaborn plots are Matplotlib Figure objects and can be customized using Matplotlib functions.

Seaborn imported successfully!

Key Seaborn Plots: `sns.scatterplot()` (with hue), `sns.countplot()`, `sns.boxplot()`, `sns.heatmap()` (for correlations).

What is it?

Seaborn offers a variety of plot types tailored for statistical data. We will look at a few key ones:

sns.scatterplot(): Similar to Matplotlib's scatter plot, but with enhanced capabilities for showing relationships between two numerical variables, including the ability to use color (hue), size, or style to represent a third categorical or numerical variable.
sns.countplot(): Shows the counts of observations in each categorical bin using bars. It's a specialized version of a bar plot for showing the frequency of categories.
sns.boxplot(): Displays the distribution of a numerical variable across different categories. Box plots show the median, quartiles, and potential outliers.
sns.heatmap(): Visualizes matrix-like data (like correlation matrices) as a color-coded grid. It's excellent for quickly identifying patterns in matrices.

Why is it important?

These plots are fundamental tools for exploring relationships, distributions, and summaries within your data, especially when dealing with a mix of numerical and categorical variables.

How do we use it?

Let's use a sample DataFrame to demonstrate these Seaborn plots.

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Create a sample DataFrame
data = {
    "Category": ["A", "B", "A", "C", "B", "A", "C", "B", "A", "C"],
    "Value": [10, 25, 12, 30, 28, 15, 35, 22, 18, 33],
    "Group": ["X", "Y", "X", "Y", "X", "Y", "X", "Y", "X", "Y"],
    "Numerical_Feature_1": np.random.rand(10) * 100,
    "Numerical_Feature_2": np.random.rand(10) * 50
}
df = pd.DataFrame(data)
print(f"Original DataFrame:\n{df}\n")

# Set a Seaborn style (optional, but makes plots look nicer)
sns.set_theme(style="whitegrid")

# 1. Scatter Plot with Hue (sns.scatterplot())
plt.figure(figsize=(8, 5))
sns.scatterplot(data=df, x="Value", y="Numerical_Feature_1", hue="Category", s=100) # s controls marker size
plt.title("Scatter Plot of Value vs. Numerical_Feature_1 by Category")
plt.show()

# 2. Count Plot (sns.countplot())
plt.figure(figsize=(7, 4))
sns.countplot(data=df, x="Category", hue="Group")
plt.title("Count of Observations per Category and Group")
plt.show()

# 3. Box Plot (sns.boxplot())
plt.figure(figsize=(7, 5))
sns.boxplot(data=df, x="Category", y="Value")
plt.title("Distribution of Value per Category (Box Plot)")
plt.show()

# 4. Heatmap (sns.heatmap())
# First, calculate the correlation matrix
correlation_matrix = df[["Value", "Numerical_Feature_1", "Numerical_Feature_2"]].corr()

plt.figure(figsize=(6, 5))
sns.heatmap(correlation_matrix, annot=True, cmap=\'coolwarm\', fmt=\'.2f\') # annot=True shows values, cmap sets color map
plt.title("Correlation Matrix Heatmap")
plt.show()

Code Explanation & Output:

sns.set_theme(style="whitegrid"): Sets the visual style for the plots. Seaborn comes with several built-in themes.
sns.scatterplot(data=df, x="Value", y="Numerical_Feature_1", hue="Category", s=100): Creates a scatter plot using the specified DataFrame (data=df). x and y specify the columns for the axes. hue="Category" colors the points based on the values in the 'Category' column, automatically adding a legend. s sets the size of the markers.
sns.countplot(data=df, x="Category", hue="Group"): Creates a bar plot showing the counts of each unique value in the 'Category' column. hue="Group" further divides the bars based on the 'Group' column.
sns.boxplot(data=df, x="Category", y="Value"): Creates box plots for the 'Value' column, separated by the unique values in the 'Category' column.
correlation_matrix = df[["Value", "Numerical_Feature_1", "Numerical_Feature_2"]].corr(): Calculates the pairwise correlation between the specified numerical columns in the DataFrame. The result is a correlation matrix.
sns.heatmap(correlation_matrix, annot=True, cmap=\'coolwarm\', fmt=\'.2f\'): Creates a heatmap from the correlation_matrix. annot=True displays the correlation values on the heatmap cells. cmap sets the color scheme. fmt formats the annotation text.
plt.show(): Displays the generated plot. Remember that Seaborn functions often return Matplotlib Axes objects, so plt.show() is still needed to display them.

# Four separate plot windows will appear, showing:
# 1. A scatter plot where points are colored by their category.
# 2. A count plot showing the frequency of each category, split by group.
# 3. Box plots illustrating the distribution of 'Value' for each category.
# 4. A heatmap showing the correlation coefficients between the numerical features.

Seaborn and Matplotlib Integration: Seaborn and Matplotlib work together seamlessly. You can create a plot using Seaborn and then use Matplotlib's plt functions (like plt.title(), plt.xlabel(), plt.ylabel(), plt.figure()) to further customize the plot.

Module 4: The Bridge to Machine Learning - Data Preparation with Scikit-learn

What is it?

Scikit-learn (often referred to as sklearn) is a free software machine learning library for the Python programming language. It features various classification, regression, and clustering algorithms, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy. While Scikit-learn is primarily known for its machine learning algorithms, it also provides a robust set of tools for data preprocessing, which is what we will focus on in this module.

Why is it important?

Data preparation is arguably the most critical step in the machine learning pipeline. Raw data is rarely in a format that machine learning algorithms can directly use. Algorithms expect numerical input, and they often perform better when numerical features are on a similar scale. Scikit-learn provides efficient and standardized ways to transform your data into the optimal format for machine learning models.

Topic: Why Do We Need to Prepare Data?

What is it?

Data preparation, also known as data preprocessing, involves transforming raw data into a clean and organized format suitable for machine learning algorithms. This includes handling missing values (which we covered in Pandas), dealing with categorical data, and scaling numerical features.

Why is it important?

Think of machine learning algorithms as highly sophisticated calculators. They are built on mathematical principles and operate on numbers. They don't inherently understand text, dates, or wildly varying scales in numerical data. If you feed them raw, unprepared data, it's like trying to teach a calculator to read a novel – it simply won't work, or it will produce nonsensical results.

Explain that ML algorithms are just math, and they need clean, numerical input.
Machine learning models are essentially complex mathematical equations and statistical functions. For these equations to work correctly and efficiently, their inputs must be in a consistent, numerical format. Imagine trying to calculate the average of a list that contains both numbers and words; the calculation would fail. Similarly, if one feature (like income) ranges from thousands to millions, while another (like age) ranges from tens to hundreds, the algorithm might disproportionately weigh the feature with the larger scale, leading to biased or inaccurate predictions. Data preparation ensures that all features contribute fairly and meaningfully to the model's learning process.

Proper data preparation can significantly improve the performance, accuracy, and training speed of your machine learning models. It's the foundation upon which successful models are built.

Topic: Handling Categorical Data

What is it?

Categorical data represents types of data which may be divided into groups. Examples include gender (Male, Female), colors (Red, Green, Blue), or cities (New York, London, Tokyo). Machine learning algorithms, being mathematical, cannot directly process these text-based categories.

The Problem: ML models don't understand text like 'Male', 'Female', or 'USA'.

Machine learning algorithms are designed to work with numerical input. When they encounter categorical data in text format, they cannot perform calculations or identify patterns. For instance, an algorithm cannot directly compare 'Male' and 'Female' in a mathematical sense. We need a way to convert these categories into a numerical representation that the algorithms can understand without implying any false relationships or order.

The Solution: One-Hot Encoding with `sklearn.preprocessing.OneHotEncoder`.

What is it?

One-Hot Encoding is a technique used to convert categorical variables into a numerical format that can be provided to machine learning algorithms. For each unique category in a column, it creates a new binary (0 or 1) column. If an observation belongs to a category, the corresponding new column will have a 1, and 0 otherwise.

Why is it important?

One-Hot Encoding is crucial because it transforms categorical data into a numerical format without implying any ordinal relationship or magnitude between categories. For example, if you simply assigned numbers (e.g., Male=0, Female=1), the algorithm might incorrectly assume that 'Female' is

greater than or somehow

superior to

‘Male’. One-hot encoding avoids this by treating each category as an independent feature.

How do we use it?

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Sample DataFrame with a categorical column
data = {
    "City": ["New York", "London", "Paris", "New York", "London"],
    "Temperature": [20, 15, 22, 21, 14]
}
df = pd.DataFrame(data)
print(f"Original DataFrame:\n{df}\n")

# Initialize the OneHotEncoder
# handle_unknown=\'ignore\': If a new category appears during testing that wasn\'t in training, it will be ignored.
# sparse_output=False: Returns a dense NumPy array instead of a sparse matrix.
encoder = OneHotEncoder(handle_unknown=\'ignore\', sparse_output=False)

# Fit the encoder to the \"City\" column and transform it
# .values.reshape(-1, 1) is used because OneHotEncoder expects a 2D array
encoded_features = encoder.fit_transform(df[["City"]])

# Get the new column names generated by the encoder
encoded_column_names = encoder.get_feature_names_out(["City"])

# Create a DataFrame from the encoded features
df_encoded = pd.DataFrame(encoded_features, columns=encoded_column_names)

# Concatenate the original DataFrame (excluding the original categorical column) with the new encoded columns
df_final = pd.concat([df.drop("City", axis=1), df_encoded], axis=1)

print(f"DataFrame after One-Hot Encoding:\n{df_final}")

Code Explanation & Output:

from sklearn.preprocessing import OneHotEncoder: Imports the necessary class.
encoder = OneHotEncoder(handle_unknown=\'ignore\', sparse_output=False): Initializes the encoder. handle_unknown=\'ignore\' is good practice to prevent errors if unseen categories appear later. sparse_output=False ensures the output is a dense NumPy array, which is easier to work with for beginners.
encoded_features = encoder.fit_transform(df[["City"]]): This is a two-step process:
- fit(): The encoder learns all unique categories present in the "City" column.
- transform(): The encoder then converts these categories into the one-hot encoded numerical format. We pass df[["City"]] (note the double brackets) because OneHotEncoder expects a 2D array-like input, even for a single column.
encoded_column_names = encoder.get_feature_names_out(["City"]): Retrieves the names of the new columns created by the encoder (e.g., City_London, City_New York).
df_encoded = pd.DataFrame(encoded_features, columns=encoded_column_names): Converts the NumPy array of encoded features into a Pandas DataFrame with appropriate column names.
df_final = pd.concat([df.drop("City", axis=1), df_encoded], axis=1): Combines the original DataFrame (after dropping the original "City" column) with the newly created one-hot encoded columns. axis=1 indicates concatenation along columns.

Original DataFrame:
       City  Temperature
0  New York           20
1    London           15
2     Paris           22
3  New York           21
4    London           14

DataFrame after One-Hot Encoding:
   Temperature  City_London  City_New York  City_Paris
0           20          0.0            1.0         0.0
1           15          1.0            0.0         0.0
2           22          0.0            0.0         1.0
3           21          0.0            1.0         0.0
4           14          1.0            0.0         0.0

Topic: Scaling Numerical Data

What is it?

Scaling numerical data is a preprocessing step that transforms numerical features to a standard range. This is particularly important when features have different units or vastly different scales.

The Problem: Features on different scales (e.g., age from 0-100 vs. salary from 30k-200k) can confuse models.

Many machine learning algorithms, especially those that rely on distance calculations (like K-Nearest Neighbors, Support Vector Machines, or neural networks), are sensitive to the scale of input features. If one feature has a much larger range of values than another, the algorithm might implicitly give more weight to the feature with the larger scale, even if it's not more important. This can lead to suboptimal model performance.

For example, if you have a dataset with 'Age' (ranging from 0-100) and 'Salary' (ranging from 30,000-200,000), a distance-based algorithm might consider a difference of 10,000 in salary to be less significant than a difference of 10 in age, simply because the absolute numerical difference is smaller, even if the percentage change or real-world impact is much larger for age.

Scaling ensures that all features contribute equally to the distance calculations and that the model doesn't get biased towards features with larger numerical values.

The Solution 1: Standardization with `sklearn.preprocessing.StandardScaler`.

What is it?

Standardization (or Z-score normalization) transforms data such that it has a mean of 0 and a standard deviation of 1. It achieves this by subtracting the mean from each value and then dividing by the standard deviation.

Formula: z = (x - mean) / standard_deviation

Why is it important?

Standardization is particularly useful for algorithms that assume your data is normally distributed or that use gradient descent (like linear regression, logistic regression, neural networks). It helps these algorithms converge faster and perform better by placing all features on a similar scale around zero.

How do we use it?

import pandas as pd
from sklearn.preprocessing import StandardScaler

# Sample DataFrame with numerical columns of different scales
data = {
    "Age": [25, 30, 35, 28, 22],
    "Salary": [70000, 85000, 90000, 72000, 65000]
}
df = pd.DataFrame(data)
print(f"Original DataFrame:\n{df}\n")

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit the scaler to the data and transform it
# We select all numerical columns to scale
scaled_features = scaler.fit_transform(df[["Age", "Salary"]])

# Create a DataFrame from the scaled features
df_scaled = pd.DataFrame(scaled_features, columns=["Age_Scaled", "Salary_Scaled"])

# Concatenate with original DataFrame (optional, for comparison)
df_final = pd.concat([df, df_scaled], axis=1)

print(f"DataFrame after Standardization (StandardScaler):\n{df_final}")

Code Explanation & Output:

from sklearn.preprocessing import StandardScaler: Imports the StandardScaler class.
scaler = StandardScaler(): Initializes the scaler.
scaled_features = scaler.fit_transform(df[["Age", "Salary"]]): This learns the mean and standard deviation for each selected column (fit()) and then applies the scaling transformation (transform()). Again, double brackets [[]] are used to pass a DataFrame (2D array) to the scaler.
df_scaled = pd.DataFrame(scaled_features, columns=["Age_Scaled", "Salary_Scaled"]): Creates a new DataFrame from the scaled NumPy array, assigning new column names to indicate they are scaled.
df_final = pd.concat([df, df_scaled], axis=1): Combines the original and scaled DataFrames for easy comparison.

Original DataFrame:
   Age  Salary
0   25   70000
1   30   85000
2   35   90000
3   28   72000
4   22   65000

DataFrame after Standardization (StandardScaler):
   Age  Salary  Age_Scaled  Salary_Scaled
0   25   70000   -0.490393      -0.589664
1   30   85000    0.735590       0.884496
2   35   90000    1.961574       1.376940
3   28   72000    0.000000      -0.392943
4   22   65000   -2.206771      -1.278829

Notice that after scaling, the Age_Scaled and Salary_Scaled columns have values centered around 0, with a standard deviation of 1 (though with only 5 data points, this isn't perfectly visible).

The Solution 2: Normalization with `sklearn.preprocessing.MinMaxScaler`.

What is it?

Normalization (or Min-Max scaling) transforms data to a fixed range, usually between 0 and 1. It achieves this by subtracting the minimum value from each data point and then dividing by the range (maximum value - minimum value).

Formula: x_scaled = (x - min) / (max - min)

Why is it important?

Normalization is useful when you need features to be within a specific bounded range. It's often preferred for algorithms that don't assume a specific distribution of the data, such as neural networks with activation functions that are sensitive to input ranges (e.g., sigmoid or tanh).

How do we use it?

import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Sample DataFrame with numerical columns
data = {
    "Age": [25, 30, 35, 28, 22],
    "Salary": [70000, 85000, 90000, 72000, 65000]
}
df = pd.DataFrame(data)
print(f"Original DataFrame:\n{df}\n")

# Initialize the MinMaxScaler
scaler = MinMaxScaler()

# Fit the scaler to the data and transform it
scaled_features = scaler.fit_transform(df[["Age", "Salary"]])

# Create a DataFrame from the scaled features
df_scaled = pd.DataFrame(scaled_features, columns=["Age_Normalized", "Salary_Normalized"])

# Concatenate with original DataFrame (optional, for comparison)
df_final = pd.concat([df, df_scaled], axis=1)

print(f"DataFrame after Normalization (MinMaxScaler):\n{df_final}")

Code Explanation & Output:

from sklearn.preprocessing import MinMaxScaler: Imports the MinMaxScaler class.
scaler = MinMaxScaler(): Initializes the scaler.
scaled_features = scaler.fit_transform(df[["Age", "Salary"]]): Learns the minimum and maximum values for each selected column and then applies the Min-Max scaling.
df_scaled = pd.DataFrame(scaled_features, columns=["Age_Normalized", "Salary_Normalized"]): Creates a new DataFrame from the scaled NumPy array, assigning new column names.

Original DataFrame:
   Age  Salary
0   25   70000
1   30   85000
2   35   90000
3   28   72000
4   22   65000

DataFrame after Normalization (MinMaxScaler):
   Age  Salary  Age_Normalized  Salary_Normalized
0   25   70000        0.428571           0.200000
1   30   85000        0.714286           0.700000
2   35   90000        1.000000           0.800000
3   28   72000        0.571429           0.280000
4   22   65000        0.000000           0.000000

Notice that after normalization, all values in Age_Normalized and Salary_Normalized are between 0 and 1.

When to use Standardization vs. Normalization:

Standardization is generally preferred when the data follows a Gaussian (normal) distribution, or when algorithms assume zero mean and unit variance (e.g., Linear Regression, Logistic Regression, SVMs, Neural Networks).

Normalization is useful when your data has a fixed range or when you need to preserve the relationships between the original values (e.g., image processing, neural networks with sigmoid activation).

If your data has outliers, StandardScaler is often less affected than MinMaxScaler because it uses the mean and standard deviation, which are more robust to outliers than min/max values.

Choosing between standardization and normalization often depends on the specific machine learning algorithm you plan to use and the characteristics of your data. It's a common practice to try both and see which one yields better results for your particular problem.

Module 5: Hands-On Capstone Project - Analyzing the Titanic Dataset

Topic: The Goal

What is it?

This module is dedicated to a hands-on project where we will apply the fundamental data science skills we've learned using NumPy, Pandas, Matplotlib, Seaborn, and Scikit-learn's preprocessing tools. Our focus will be on a real-world dataset: the famous Titanic passenger data.

Why is it important?

Learning theoretical concepts is essential, but applying them to a real dataset is where your understanding truly solidifies. This project will walk you through a typical initial data science workflow: loading data, exploring it to understand its characteristics, cleaning it to handle issues like missing values, visualizing it to uncover patterns, and preparing it for potential future use in machine learning models. It's a crucial step before you would build predictive models.

Our goal is NOT to build a predictive model yet. While the Titanic dataset is often used for classification (predicting survival), in this module, we will focus purely on the data exploration, cleaning, and preparation steps. This is often the most time-consuming part of a data science project, and mastering it is fundamental before you even think about applying machine learning algorithms.

By the end of this module, you will have a clean, well-understood, and preprocessed dataset, ready for the next steps in a data science pipeline.

Topic: The Dataset

What is it?

The dataset we will be using is the famous Titanic dataset, commonly available on platforms like Kaggle. It contains information about the passengers aboard the RMS Titanic when it sank in 1912.

Why is it important?

The Titanic dataset is a classic dataset for beginners in data science and machine learning. It's relatively small, easy to understand, and contains a mix of numerical and categorical features, as well as missing values, making it perfect for practicing data cleaning, exploration, and preprocessing techniques.

Introduce the Kaggle Titanic dataset. Explain the columns (Pclass, Sex, Age, Fare, Embarked, Survived).

The dataset typically includes the following columns:

PassengerId: A unique identifier for each passenger.
Survived: This is the target variable (though we won't be predicting it in this module). It indicates whether the passenger survived (1) or not (0).
Pclass: Passenger class (1st, 2nd, or 3rd). This is a categorical feature representing socioeconomic status.
Name: The passenger's name.
Sex: The passenger's gender (male or female). This is a categorical feature.
Age: The passenger's age in years. This is a numerical feature and often contains missing values.
SibSp: Number of siblings/spouses aboard the Titanic. Numerical.
Parch: Number of parents/children aboard the Titanic. Numerical.
Ticket: Ticket number. String.
Fare: The fare paid for the ticket. Numerical.
Cabin: Cabin number. String, often contains many missing values.
Embarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton). Categorical, may contain missing values.

Our focus will be on using the features like Pclass, Sex, Age, Fare, and Embarked to understand the passenger demographics and prepare this data for potential future modeling.

Topic: Step 1: Load and Explore the Data

What is it?

Loading and exploring data is the very first practical step in any data analysis project. It involves reading the dataset into a Pandas DataFrame and then using basic DataFrame methods to get a high-level overview of its structure, content, and initial characteristics.

Why is it important?

This initial exploration helps you quickly understand the dataset, identify potential issues (like missing values or incorrect data types), and form preliminary hypotheses. It sets the stage for all subsequent data cleaning, transformation, and analysis steps.

How do we use it?

We will simulate the titanic.csv file for this example. In a real scenario, you would download this file from Kaggle or another source and place it in your working directory.

import pandas as pd
import numpy as np
import os

# Simulate creating a titanic.csv file for demonstration purposes
titanic_data = {
    "PassengerId": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    "Survived": [0, 1, 1, 1, 0, 0, 0, 0, 1, 1],
    "Pclass": [3, 1, 3, 1, 3, 3, 1, 3, 3, 2],
    "Name": [
        "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Florence Briggs Thayer)",
        "Heikkinen, Miss. Laina", "Futrelle, Mrs. Jacques Heath (Lily May Peel)",
        "Allen, Mr. William Henry", "Moran, Mr. James",
        "McCarthy, Mr. Timothy J", "Palsson, Master. Gosta Leonard",
        "Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)", "Nasser, Mrs. Nicholas (Adele Achem)"
    ],
    "Sex": ["male", "female", "female", "female", "male", "male", "male", "male", "female", "female"],
    "Age": [22.0, 38.0, 26.0, 35.0, 35.0, np.nan, 54.0, 2.0, 27.0, 14.0],
    "SibSp": [1, 1, 0, 1, 0, 0, 0, 3, 0, 1],
    "Parch": [0, 0, 0, 0, 0, 0, 0, 1, 2, 0],
    "Ticket": ["A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "373450", "330877", "17463", "349909", "347742", "237736"],
    "Fare": [7.25, 71.2833, 7.925, 53.1, 8.05, 8.4583, 51.8625, 21.075, 11.1333, 30.0708],
    "Cabin": [np.nan, "C85", np.nan, "C123", np.nan, np.nan, "E46", np.nan, np.nan, np.nan],
    "Embarked": ["S", "C", "S", "S", "S", "Q", "S", "S", "S", "C"]
}
df_titanic_dummy = pd.DataFrame(titanic_data)
csv_file_path = "titanic.csv"
df_titanic_dummy.to_csv(csv_file_path, index=False)
print(f"Dummy \'{csv_file_path}\' created for demonstration.\n")

# Use Pandas to load titanic.csv
df = pd.read_csv(csv_file_path)

print("--- First 5 rows (df.head()) ---\n")
print(df.head())
print("\n" + "-"*40 + "\n")

print("--- DataFrame Info (df.info()) ---\n")
df.info()
print("\n" + "-"*40 + "\n")

print("--- Descriptive Statistics (df.describe()) ---\n")
print(df.describe())
print("\n" + "-"*40 + "\n")

# Clean up the dummy file (optional)
os.remove(csv_file_path)
print(f"Dummy \'{csv_file_path}\' removed.")

Code Explanation & Output:

Simulating titanic.csv: We create a small DataFrame df_titanic_dummy that mimics the structure and data types of the actual Titanic dataset, including some np.nan values to represent missing data in Age and Cabin.
df_titanic_dummy.to_csv(csv_file_path, index=False): Saves this dummy DataFrame to a CSV file named titanic.csv.
df = pd.read_csv(csv_file_path): This is the actual command you would use to load the Titanic dataset from a CSV file into a DataFrame named df.
df.head(): Displays the first 5 rows of the DataFrame. This gives you a quick visual inspection of the data, including column names and a few sample entries.
df.info(): Provides a summary of the DataFrame, including:
- The number of entries (rows).
- A list of all columns.
- The Non-Null Count for each column, which immediately tells us where missing values exist (e.g., Age has 9 non-null values out of 10, meaning 1 missing; Cabin has only 2 non-null values, meaning 8 missing).
- The Dtype (data type) of each column (e.g., int64, float64, object for strings).
- Memory usage.
df.describe(): Generates descriptive statistics for numerical columns. This includes count, mean, std (standard deviation), min, 25% (first quartile), 50% (median), 75% (third quartile), and max. This helps you understand the central tendency, spread, and range of your numerical data.

Dummy \'titanic.csv\' created for demonstration.

--- First 5 rows (df.head()) ---

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0           PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  

----------------------------------------

--- DataFrame Info (df.info()) ---

<class \'pandas.core.frame.DataFrame\'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  10 non-null     int64  
 1   Survived     10 non-null     int64  
 2   Pclass       10 non-null     int64  
 3   Name         10 non-null     object 
 4   Sex          10 non-null     object 
 5   Age          9 non-null      float64
 6   SibSp        10 non-null     int64  
 7   Parch        10 non-null     int64  
 8   Ticket       10 non-null     object 
 9   Fare         10 non-null     float64
 10  Cabin        2 non-null      object 
 11  Embarked     10 non-null     object 
dtypes: float64(2), int64(5), object(5)
memory usage: 1.1+ KB

----------------------------------------

--- Descriptive Statistics (df.describe()) ---

       PassengerId   Survived     Pclass        Age      SibSp      Parch  \
count    10.000000  10.000000  10.000000   9.000000  10.000000  10.000000   
mean      5.500000   0.500000   2.300000  28.111111   0.700000   0.300000   
std       3.027650   0.527046   0.948683  14.586689   1.059301   0.674945   
min       1.000000   0.000000   1.000000   2.000000   0.000000   0.000000   
25%       3.250000   0.000000   1.250000  22.000000   0.000000   0.000000   
50%       5.500000   0.500000   3.000000  27.000000   0.500000   0.000000   
75%       7.750000   1.000000   3.000000  35.000000   1.000000   0.000000   
max      10.000000   1.000000   3.000000  54.000000   3.000000   2.000000   

             Fare  
count   10.000000  
mean    29.729310  
std     22.779848  
min      7.250000  
25%      8.050000  
50%     16.266650  
75%     50.672500  
max     71.283300  

----------------------------------------

Dummy \'titanic.csv\' removed.

Initial Observations: From df.info(), we immediately see that Age and Cabin columns have missing values. Cabin has a significant number of missing values (only 2 non-null out of 10 entries), which might make it difficult to use directly. Age has fewer missing values, which we can likely impute. df.describe() gives us a sense of the range and distribution of numerical features like Age and Fare.

Topic: Step 2: Clean the Data

What is it?

Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset. It is a crucial step in the data science pipeline, as dirty data can lead to inaccurate models and misleading insights.

Why is it important?

Machine learning models are highly sensitive to the quality of the data they are trained on. Missing values, inconsistent formats, or erroneous entries can cause models to perform poorly or even fail to train. Cleaning the data ensures that our dataset is reliable and suitable for analysis and modeling.

How do we use it?

Based on our initial exploration, we identified missing values in the Age and Cabin columns. The Embarked column might also have a few missing values in the full dataset. We will focus on Age and Embarked as Cabin has too many missing values to be easily imputed for a beginner project.

Let's start by recreating our dummy titanic.csv file to ensure we have a consistent starting point with missing values.

import pandas as pd
import numpy as np
import os

# Re-simulate creating a titanic.csv file for demonstration purposes
titanic_data = {
    "PassengerId": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
    "Survived": [0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1],
    "Pclass": [3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1],
    "Name": [
        "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Florence Briggs Thayer)",
        "Heikkinen, Miss. Laina", "Futrelle, Mrs. Jacques Heath (Lily May Peel)",
        "Allen, Mr. William Henry", "Moran, Mr. James",
        "McCarthy, Mr. Timothy J", "Palsson, Master. Gosta Leonard",
        "Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)", "Nasser, Mrs. Nicholas (Adele Achem)",
        "Bonnell, Miss. Elizabeth", "Saundercock, Mr. William Henry"
    ],
    "Sex": ["male", "female", "female", "female", "male", "male", "male", "male", "female", "female", "female", "male"],
    "Age": [22.0, 38.0, 26.0, 35.0, 35.0, np.nan, 54.0, 2.0, 27.0, 14.0, 58.0, np.nan],
    "SibSp": [1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 0, 0],
    "Parch": [0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0],
    "Ticket": ["A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "373450", "330877", "17463", "349909", "347742", "237736", "113781", "A/5. 2151"],
    "Fare": [7.25, 71.2833, 7.925, 53.1, 8.05, 8.4583, 51.8625, 21.075, 11.1333, 30.0708, 26.55, 8.05],
    "Cabin": [np.nan, "C85", np.nan, "C123", np.nan, np.nan, "E46", np.nan, np.nan, np.nan, "C103", np.nan],
    "Embarked": ["S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", np.nan, "S"]
}
df = pd.DataFrame(titanic_data)
csv_file_path = "titanic.csv"
df.to_csv(csv_file_path, index=False)
print(f"Dummy \'{csv_file_path}\' created for demonstration.\n")

# Load the dataset
df = pd.read_csv(csv_file_path)

# Use .isnull().sum() to find missing values (especially in \'Age\' and \'Embarked\')
print(f"Missing values before cleaning:\n{df.isnull().sum()}\n")

# Demonstrate filling the missing \'Age\' values with the median age.
# The median is often preferred over the mean for skewed distributions or when outliers are present.
median_age = df["Age"].median()
df["Age"].fillna(median_age, inplace=True) # inplace=True modifies the DataFrame directly
print(f"Median Age used for imputation: {median_age}\n")

# Demonstrate filling the missing \'Embarked\' values with the mode.
# The mode is the most frequent value, suitable for categorical data.
mode_embarked = df["Embarked"].mode()[0] # .mode() can return multiple modes, so we take the first one
df["Embarked"].fillna(mode_embarked, inplace=True)
print(f"Mode Embarked port used for imputation: {mode_embarked}\n")

print(f"Missing values after cleaning:\n{df.isnull().sum()}\n")

print("DataFrame after cleaning (first few rows):\n")
print(df.head())

# Clean up the dummy file (optional)
os.remove(csv_file_path)
print(f"\nDummy \'{csv_file_path}\' removed.")

Code Explanation & Output:

Re-simulating titanic.csv: We create a slightly larger dummy dataset to better illustrate the missing values in Age and Embarked.
df.isnull().sum(): This is our first check to confirm the presence of missing values in Age and Embarked before we start cleaning.
median_age = df["Age"].median(): Calculates the median of the Age column. The median is the middle value in a sorted list of numbers and is robust to outliers, making it a good choice for imputing numerical data.
df["Age"].fillna(median_age, inplace=True): This line fills all NaN values in the Age column with the calculated median_age. inplace=True means the DataFrame df is modified directly, without needing to reassign the result.
mode_embarked = df["Embarked"].mode()[0]: Calculates the mode (most frequent value) of the Embarked column. Since mode() can return multiple values if there's a tie, we select the first one [0].
df["Embarked"].fillna(mode_embarked, inplace=True): Fills NaN values in the Embarked column with the calculated mode_embarked.
The final df.isnull().sum() check confirms that the Age and Embarked columns no longer have missing values. We still see missing values in Cabin because we chose not to impute it due to the high number of missing entries.

Dummy \'titanic.csv\' created for demonstration.

Missing values before cleaning:
PassengerId     0
Survived        0
Pclass          0
Name            0
Sex             0
Age             2
SibSp           0
Parch           0
Ticket          0
Fare            0
Cabin          10
Embarked        1
dtype: int64

Median Age used for imputation: 28.5

Mode Embarked port used for imputation: S

Missing values after cleaning:
PassengerId     0
Survived        0
Pclass          0
Name            0
Sex             0
Age             0
SibSp           0
Parch           0
Ticket          0
Fare            0
Cabin          10
Embarked        0
dtype: int64

DataFrame after cleaning (first few rows):
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0           PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  

Dummy \'titanic.csv\' removed.

Choosing Imputation Strategy: The choice of imputation method (mean, median, mode, or more advanced techniques) depends on the data distribution and the context. For numerical data, median is often preferred for skewed distributions or presence of outliers, while mean is suitable for normally distributed data. For categorical data, mode is a common choice.

Topic: Step 3: Visualize the Data to Find Insights (EDA)

What is it?

After loading and cleaning our data, the next critical step is to visualize it. Data visualization, as we learned in Module 3, is about creating graphical representations of our data to uncover patterns, relationships, and distributions that are difficult to discern from raw numbers alone. This is a key part of Exploratory Data Analysis (EDA).

Why is it important?

Visualizations help us to:

Understand distributions: How are ages distributed? What is the spread of fares?
Identify relationships: Is there a relationship between passenger class and survival? Does gender play a role?
Spot anomalies: Are there any unusual data points that need further investigation?
Communicate findings: Presenting data visually makes it easier for others to understand your insights.

It's like drawing a map of a new territory – it helps you navigate and understand the landscape of your data.

How do we use it?

We will use Seaborn and Matplotlib to create various plots to gain insights from the cleaned Titanic dataset. First, let's ensure we have a fresh, cleaned DataFrame to work with.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

# Re-simulate creating a titanic.csv file for demonstration purposes
titanic_data = {
    "PassengerId": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
    "Survived": [0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1],
    "Pclass": [3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1],
    "Name": [
        "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Florence Briggs Thayer)",
        "Heikkinen, Miss. Laina", "Futrelle, Mrs. Jacques Heath (Lily May Peel)",
        "Allen, Mr. William Henry", "Moran, Mr. James",
        "McCarthy, Mr. Timothy J", "Palsson, Master. Gosta Leonard",
        "Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)", "Nasser, Mrs. Nicholas (Adele Achem)",
        "Bonnell, Miss. Elizabeth", "Saundercock, Mr. William Henry"
    ],
    "Sex": ["male", "female", "female", "female", "male", "male", "male", "male", "female", "female", "female", "male"],
    "Age": [22.0, 38.0, 26.0, 35.0, 35.0, np.nan, 54.0, 2.0, 27.0, 14.0, 58.0, np.nan],
    "SibSp": [1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 0, 0],
    "Parch": [0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0],
    "Ticket": ["A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "373450", "330877", "17463", "349909", "347742", "237736", "113781", "A/5. 2151"],
    "Fare": [7.25, 71.2833, 7.925, 53.1, 8.05, 8.4583, 51.8625, 21.075, 11.1333, 30.0708, 26.55, 8.05],
    "Cabin": [np.nan, "C85", np.nan, "C123", np.nan, np.nan, "E46", np.nan, np.nan, np.nan, "C103", np.nan],
    "Embarked": ["S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", np.nan, "S"]
}
df = pd.DataFrame(titanic_data)
csv_file_path = "titanic.csv"
df.to_csv(csv_file_path, index=False)

# Load the dataset and clean it (as done in Step 2)
df = pd.read_csv(csv_file_path)
median_age = df["Age"].median()
df["Age"].fillna(median_age, inplace=True)
mode_embarked = df["Embarked"].mode()[0]
df["Embarked"].fillna(mode_embarked, inplace=True)

print("Cleaned DataFrame (first 5 rows):\n", df.head())
print("\n" + "-"*40 + "\n")

# Use Seaborn\`s countplot to see the distribution of survivors.
plt.figure(figsize=(6, 4))
sns.countplot(data=df, x="Survived")
plt.title("Distribution of Survivors (0=No, 1=Yes)")
plt.xlabel("Survived")
plt.ylabel("Count")
plt.show()

print("\n" + "-"*40 + "\n")

# Use Seaborn\`s countplot with the hue parameter to see how survival relates to \`Sex\`, \`Pclass\`, and \`Embarked\`.
plt.figure(figsize=(6, 4))
sns.countplot(data=df, x="Sex", hue="Survived")
plt.title("Survival Count by Sex")
plt.xlabel("Sex")
plt.ylabel("Count")
plt.legend(title="Survived", labels=["No", "Yes"])
plt.show()

print("\n" + "-"*40 + "\n")

plt.figure(figsize=(7, 5))
sns.countplot(data=df, x="Pclass", hue="Survived")
plt.title("Survival Count by Passenger Class")
plt.xlabel("Passenger Class")
plt.ylabel("Count")
plt.legend(title="Survived", labels=["No", "Yes"])
plt.show()

print("\n" + "-"*40 + "\n")

plt.figure(figsize=(7, 5))
sns.countplot(data=df, x="Embarked", hue="Survived")
plt.title("Survival Count by Embarked Port")
plt.xlabel("Embarked Port")
plt.ylabel("Count")
plt.legend(title="Survived", labels=["No", "Yes"])
plt.show()

print("\n" + "-"*40 + "\n")

# Use Matplotlib\`s hist to visualize the distribution of \`Age\` and \`Fare\`.
plt.figure(figsize=(8, 5))
plt.hist(df["Age"], bins=20, edgecolor=\'black\')
plt.title("Distribution of Age")
plt.xlabel("Age")
plt.ylabel("Frequency")
plt.show()

print("\n" + "-"*40 + "\n")

plt.figure(figsize=(8, 5))
plt.hist(df["Fare"], bins=30, edgecolor=\'black\')
plt.title("Distribution of Fare")
plt.xlabel("Fare")
plt.ylabel("Frequency")
plt.show()

print("\n" + "-"*40 + "\n")

# Use Seaborn\`s heatmap to show a correlation matrix of the numerical columns.
# Select only numerical columns for correlation calculation
numerical_df = df.select_dtypes(include=[np.number])
correlation_matrix = numerical_df.corr()

plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap=\'coolwarm\', fmt=\'.2f\')
plt.title("Correlation Matrix of Numerical Features")
plt.show()

# Clean up the dummy file (optional)
os.remove(csv_file_path)
print(f"\nDummy \'{csv_file_path}\' removed.")

Code Explanation & Output:

Setup: We re-create and clean the dummy titanic.csv to ensure consistency.
sns.countplot(data=df, x="Survived"): This plot shows the number of passengers who survived (1) versus those who did not (0). It gives us a quick overview of the target variable's distribution.
sns.countplot(data=df, x="Sex", hue="Survived"): By adding hue="Survived", we can see the survival counts broken down by gender. This is a powerful way to visually explore relationships between categorical variables and survival. We also add a plt.legend to clarify what 0 and 1 represent for Survived.
sns.countplot(data=df, x="Pclass", hue="Survived"): Similar to the above, this plot visualizes survival counts across different passenger classes (1st, 2nd, 3rd).
sns.countplot(data=df, x="Embarked", hue="Survived"): This plot shows survival counts based on the port of embarkation.
plt.hist(df["Age"], bins=20, edgecolor=\'black\'): We use Matplotlib's hist function to visualize the distribution of the Age column. bins controls the number of bars, and edgecolor adds borders for better visibility. This helps us understand the age demographics of the passengers.
plt.hist(df["Fare"], bins=30, edgecolor=\'black\'): Similarly, this plots the distribution of Fare, showing how ticket prices are distributed.
numerical_df = df.select_dtypes(include=[np.number]): Before calculating correlations, it's good practice to select only the numerical columns from the DataFrame, as correlation is a numerical concept.
correlation_matrix = numerical_df.corr(): This calculates the pairwise correlation between all numerical columns in numerical_df. The correlation coefficient ranges from -1 to 1, where 1 indicates a perfect positive correlation, -1 a perfect negative correlation, and 0 no linear correlation.
sns.heatmap(correlation_matrix, annot=True, cmap=\'coolwarm\', fmt=\'.2f\'): This creates a heatmap of the correlation matrix. annot=True displays the correlation values on the map, cmap=\'coolwarm\' sets a diverging color map (red for negative, blue for positive, white for zero), and fmt=\'.2f\' formats the annotation to two decimal places. This visualization quickly highlights which numerical features are strongly correlated with each other.

Cleaned DataFrame (first 5 rows):
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0           PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  

----------------------------------------
# A plot window will appear showing the distribution of survivors.

----------------------------------------
# A plot window will appear showing survival counts by sex.

----------------------------------------
# A plot window will appear showing survival counts by passenger class.

----------------------------------------
# A plot window will appear showing survival counts by embarked port.

----------------------------------------
# A plot window will appear showing the distribution of age.

----------------------------------------
# A plot window will appear showing the distribution of fare.

----------------------------------------
# A plot window will appear showing the correlation matrix heatmap.

Dummy \'titanic.csv\' removed.

Insights from Visualizations:

The Survived countplot shows that more passengers did not survive than survived.

The Sex vs. Survived plot clearly indicates that a much higher proportion of females survived compared to males. This is a very strong indicator.

The Pclass vs. Survived plot suggests that passengers in 1st class had a higher survival rate than those in 2nd or 3rd class, indicating socioeconomic status played a role.

The Embarked vs. Survived plot might show some differences, but generally, Southampton (S) had the most passengers, and its survival rate might be lower due to the sheer number of people.

The Age histogram shows a distribution skewed towards younger ages, with a peak in the 20s and 30s. There are also some very young passengers.

The Fare histogram shows that most passengers paid lower fares, with a long tail indicating a few very expensive tickets.

The correlation heatmap helps us quickly see relationships between numerical features. For example, Fare and Pclass might show a negative correlation (higher class, lower Pclass number, higher fare). SibSp and Parch might be positively correlated, as they both relate to family size.

Topic: Step 4: Preprocess the Data for a Future Model

What is it?

Data preprocessing is the final stage of preparing our data before it can be fed into a machine learning model. This involves transforming categorical variables into numerical ones and scaling numerical features to ensure they are on a similar range. These steps are crucial because most machine learning algorithms require numerical input and perform better when features are scaled.

Why is it important?

As discussed in Module 4, machine learning algorithms are mathematical constructs that operate on numbers. They cannot directly interpret text-based categories or handle features with vastly different scales. Preprocessing ensures that our data is in the optimal format for the algorithms to learn effectively, leading to more accurate and robust models.

How do we use it?

We will apply the techniques learned in Module 4 (one-hot encoding and scaling) to our cleaned Titanic dataset. We will continue from the cleaned DataFrame we prepared in Step 2.

Let's ensure we have a fresh, cleaned DataFrame to work with.

import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, StandardScaler
import os

# Re-simulate creating a titanic.csv file for demonstration purposes
titanic_data = {
    "PassengerId": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
    "Survived": [0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1],
    "Pclass": [3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1],
    "Name": [
        "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Florence Briggs Thayer)",
        "Heikkinen, Miss. Laina", "Futrelle, Mrs. Jacques Heath (Lily May Peel)",
        "Allen, Mr. William Henry", "Moran, Mr. James",
        "McCarthy, Mr. Timothy J", "Palsson, Master. Gosta Leonard",
        "Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)", "Nasser, Mrs. Nicholas (Adele Achem)",
        "Bonnell, Miss. Elizabeth", "Saundercock, Mr. William Henry"
    ],
    "Sex": ["male", "female", "female", "female", "male", "male", "male", "male", "female", "female", "female", "male"],
    "Age": [22.0, 38.0, 26.0, 35.0, 35.0, np.nan, 54.0, 2.0, 27.0, 14.0, 58.0, np.nan],
    "SibSp": [1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 0, 0],
    "Parch": [0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0],
    "Ticket": ["A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "373450", "330877", "17463", "349909", "347742", "237736", "113781", "A/5. 2151"],
    "Fare": [7.25, 71.2833, 7.925, 53.1, 8.05, 8.4583, 51.8625, 21.075, 11.1333, 30.0708, 26.55, 8.05],
    "Cabin": [np.nan, "C85", np.nan, "C123", np.nan, np.nan, "E46", np.nan, np.nan, np.nan, "C103", np.nan],
    "Embarked": ["S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", np.nan, "S"]
}
df = pd.DataFrame(titanic_data)
csv_file_path = "titanic.csv"
df.to_csv(csv_file_path, index=False)

# Load the dataset and clean it (as done in Step 2)
df = pd.read_csv(csv_file_path)
median_age = df["Age"].median()
df["Age"].fillna(median_age, inplace=True)
mode_embarked = df["Embarked"].mode()[0]
df["Embarked"].fillna(mode_embarked, inplace=True)

print("Cleaned DataFrame (first 5 rows):\n", df.head())
print("\n" + "-"*40 + "\n")

# Use Pandas to convert the \`Sex\` column to 0s and 1s.
# We can use map or replace for this binary categorical variable.
# Let\`s map \


'male' to 0 and 'female' to 1.
df["Sex"] = df["Sex"].map({"male": 0, "female": 1})
print(f"DataFrame after converting \"Sex\" to numerical:\n{df[["Sex"]].head()}\n")

print("\n" + "-"*40 + "\n")

# Use Scikit-learn\`s OneHotEncoder on the \`Embarked\` column.
# First, identify the categorical column to encode
categorical_cols = ["Embarked"]

# Initialize OneHotEncoder
# handle_unknown=\'ignore\' is important for deployment if new categories might appear
# sparse_output=False ensures a dense array output
ohe = OneHotEncoder(handle_unknown=\'ignore\', sparse_output=False)

# Fit and transform the categorical column
# We use .values.reshape(-1, 1) to ensure it's a 2D array as expected by OneHotEncoder
encoded_features = ohe.fit_transform(df[categorical_cols])

# Create a DataFrame from the encoded features with proper column names
encoded_df = pd.DataFrame(encoded_features, columns=ohe.get_feature_names_out(categorical_cols))

# Concatenate the original DataFrame (dropping the original 'Embarked' column) with the new encoded columns
df = pd.concat([df.drop(categorical_cols, axis=1), encoded_df], axis=1)

print(f"DataFrame after One-Hot Encoding \"Embarked\":\n{df[["Embarked_C", "Embarked_Q", "Embarked_S"]].head()}\n")

print("\n" + "-"*40 + "\n")

# Use Scikit-learn\`s StandardScaler on the \`Age\` and \`Fare\` columns.
# Identify numerical columns to scale
numerical_cols = ["Age", "Fare"]

# Initialize StandardScaler
scaler = StandardScaler()

# Fit and transform the numerical columns
scaled_features = scaler.fit_transform(df[numerical_cols])

# Create a DataFrame from the scaled features with proper column names
scaled_df = pd.DataFrame(scaled_features, columns=[col + "_Scaled" for col in numerical_cols])

# Concatenate the original DataFrame (dropping the original 'Age' and 'Fare' columns) with the new scaled columns
df = pd.concat([df.drop(numerical_cols, axis=1), scaled_df], axis=1)

print(f"DataFrame after StandardScaler on \"Age\" and \"Fare\":\n{df[["Age_Scaled", "Fare_Scaled"]].head()}\n")

# Clean up the dummy file (optional)
os.remove(csv_file_path)
print(f"\nDummy \'{csv_file_path}\' removed.")

Code Explanation & Output:

Converting Sex:
- df["Sex"].map({"male": 0, "female": 1}): We use the .map() method to replace the string values "male" and "female" with their numerical equivalents, 0 and 1 respectively. This is suitable for binary categorical variables.
One-Hot Encoding Embarked:
- ohe = OneHotEncoder(handle_unknown=\'ignore\', sparse_output=False): Initializes the OneHotEncoder. handle_unknown=\'ignore\' is important for robustness in real-world scenarios where new, unseen categories might appear. sparse_output=False ensures the output is a dense NumPy array.
- encoded_features = ohe.fit_transform(df[categorical_cols]): The encoder learns the unique categories in Embarked (fit) and then transforms the column into new binary columns (transform). We pass df[categorical_cols] (double brackets) because OneHotEncoder expects a 2D array-like input.
- encoded_df = pd.DataFrame(encoded_features, columns=ohe.get_feature_names_out(categorical_cols)): Converts the NumPy array output from the encoder into a Pandas DataFrame, using get_feature_names_out() to get the appropriate new column names (e.g., Embarked_C, Embarked_Q, Embarked_S).
- df = pd.concat([df.drop(categorical_cols, axis=1), encoded_df], axis=1): The original Embarked column is dropped, and the newly created one-hot encoded columns are concatenated back into the DataFrame.
Scaling Age and Fare:
- scaler = StandardScaler(): Initializes the StandardScaler.
- scaled_features = scaler.fit_transform(df[numerical_cols]): The scaler learns the mean and standard deviation of Age and Fare (fit) and then transforms their values to have a mean of 0 and a standard deviation of 1 (transform).
- scaled_df = pd.DataFrame(scaled_features, columns=[col + "_Scaled" for col in numerical_cols]): Creates a new DataFrame for the scaled features with new column names (e.g., Age_Scaled).
- df = pd.concat([df.drop(numerical_cols, axis=1), scaled_df], axis=1): The original Age and Fare columns are dropped, and the scaled versions are added to the DataFrame.

Cleaned DataFrame (first 5 rows):
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0           PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  

----------------------------------------
DataFrame after converting "Sex" to numerical:
   Sex
0    0
1    1
2    1
3    1
4    0

----------------------------------------
DataFrame after One-Hot Encoding "Embarked":
   Embarked_C  Embarked_Q  Embarked_S
0         0.0         0.0         1.0
1         1.0         0.0         0.0
2         0.0         0.0         1.0
3         0.0         0.0         1.0
4         0.0         0.0         1.0

----------------------------------------
DataFrame after StandardScaler on "Age" and "Fare":
   Age_Scaled  Fare_Scaled
0   -0.589857    -0.589664
1    0.600856     1.961574
2   -0.262143    -0.559400
3    0.350686     1.376940
4    0.350686    -0.539235

Dummy \'titanic.csv\' removed.

Final Topic: Our Data is Ready!

What is it?

After all the steps of loading, cleaning, exploring, and preprocessing, we now have a DataFrame where all relevant features are numerical and scaled appropriately. This is the format that machine learning algorithms expect and can work with effectively.

Why is it important?

This final, transformed DataFrame is the culmination of our data preparation efforts. It represents a dataset that is clean, consistent, and optimized for machine learning. Without these steps, building accurate and reliable models would be significantly more challenging, if not impossible.

How do we use it?

Let's display the final, transformed DataFrame to see the result of all our preprocessing efforts. We will also check its info() to confirm data types and non-null counts.

import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, StandardScaler
import os

# Re-simulate creating a titanic.csv file for demonstration purposes
titanic_data = {
    "PassengerId": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
    "Survived": [0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1],
    "Pclass": [3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1],
    "Name": [
        "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Florence Briggs Thayer)",
        "Heikkinen, Miss. Laina", "Futrelle, Mrs. Jacques Heath (Lily May Peel)",
        "Allen, Mr. William Henry", "Moran, Mr. James",
        "McCarthy, Mr. Timothy J", "Palsson, Master. Gosta Leonard",
        "Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)", "Nasser, Mrs. Nicholas (Adele Achem)",
        "Bonnell, Miss. Elizabeth", "Saundercock, Mr. William Henry"
    ],
    "Sex": ["male", "female", "female", "female", "male", "male", "male", "male", "female", "female", "female", "male"],
    "Age": [22.0, 38.0, 26.0, 35.0, 35.0, np.nan, 54.0, 2.0, 27.0, 14.0, 58.0, np.nan],
    "SibSp": [1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 0, 0],
    "Parch": [0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0],
    "Ticket": ["A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "373450", "330877", "17463", "349909", "347742", "237736", "113781", "A/5. 2151"],
    "Fare": [7.25, 71.2833, 7.925, 53.1, 8.05, 8.4583, 51.8625, 21.075, 11.1333, 30.0708, 26.55, 8.05],
    "Cabin": [np.nan, "C85", np.nan, "C123", np.nan, np.nan, "E46", np.nan, np.nan, np.nan, "C103", np.nan],
    "Embarked": ["S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", np.nan, "S"]
}
df = pd.DataFrame(titanic_data)
csv_file_path = "titanic.csv"
df.to_csv(csv_file_path, index=False)

# Load the dataset and clean it (as done in Step 2)
df = pd.read_csv(csv_file_path)
median_age = df["Age"].median()
df["Age"].fillna(median_age, inplace=True)
mode_embarked = df["Embarked"].mode()[0]
df["Embarked"].fillna(mode_embarked, inplace=True)

# Convert Sex to numerical
df["Sex"] = df["Sex"].map({"male": 0, "female": 1})

# One-Hot Encode Embarked
categorical_cols = ["Embarked"]
ohe = OneHotEncoder(handle_unknown=\'ignore\', sparse_output=False)
encoded_features = ohe.fit_transform(df[categorical_cols])
encoded_df = pd.DataFrame(encoded_features, columns=ohe.get_feature_names_out(categorical_cols))
df = pd.concat([df.drop(categorical_cols, axis=1), encoded_df], axis=1)

# Scale Age and Fare
numerical_cols = ["Age", "Fare"]
scaler = StandardScaler()
scaled_features = scaler.fit_transform(df[numerical_cols])
scaled_df = pd.DataFrame(scaled_features, columns=[col + "_Scaled" for col in numerical_cols])
df = pd.concat([df.drop(numerical_cols, axis=1), scaled_df], axis=1)

# Drop columns that are not needed for modeling (e.g., Name, Ticket, Cabin, PassengerId)
# Cabin has too many missing values, Name and Ticket are identifiers/strings
df_final_processed = df.drop(columns=["Name", "Ticket", "Cabin", "PassengerId"], errors=\'ignore\')

print(f"Final Processed DataFrame (first 5 rows):\n{df_final_processed.head()}\n")
print("\n" + "-"*40 + "\n")
print("Final Processed DataFrame Info:\n")
df_final_processed.info()

# Clean up the dummy file (optional)
os.remove(csv_file_path)
print(f"\nDummy \'{csv_file_path}\' removed.")

Code Explanation & Output:

Re-running all preprocessing steps: The code block re-executes all the cleaning and preprocessing steps (imputing missing values, converting Sex, one-hot encoding Embarked, and scaling Age and Fare) to arrive at the final processed DataFrame.
df_final_processed = df.drop(columns=["Name", "Ticket", "Cabin", "PassengerId"], errors=\'ignore\'): We drop columns that are typically not used directly in machine learning models. Name and Ticket are unique identifiers or strings that don't directly contribute to numerical models. Cabin was dropped due to a high number of missing values. PassengerId is an identifier and not a feature for the model.
df_final_processed.head(): Shows the first few rows of the DataFrame after all transformations. You can see that Sex is now numerical, Embarked has been replaced by Embarked_C, Embarked_Q, Embarked_S (one-hot encoded), and Age and Fare have their scaled versions.
df_final_processed.info(): Provides a final summary. Notice that all columns now have 12 non-null entries (no missing values in the relevant columns) and are numerical (int64 or float64), which is the ideal format for machine learning algorithms.

Final Processed DataFrame (first 5 rows):
   Survived  Pclass  Sex  SibSp  Parch  Embarked_C  Embarked_Q  Embarked_S  \
0         0       3    0      1      0         0.0         0.0         1.0   
1         1       1    1      1      0         1.0         0.0         0.0   
2         1       3    1      0      0         0.0         0.0         1.0   
3         1       1    1      1      0         0.0         0.0         1.0   
4         0       3    0      0      0         0.0         0.0         1.0   

   Age_Scaled  Fare_Scaled  
0   -0.589857    -0.589664  
1    0.600856     1.961574  
2   -0.262143    -0.559400  
3    0.350686     1.376940  
4    0.350686    -0.539235  

----------------------------------------
Final Processed DataFrame Info:
<class \'pandas.core.frame.DataFrame\'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Survived     12 non-null     int64  
 1   Pclass       12 non-null     int64  
 2   Sex          12 non-null     int64  
 3   SibSp        12 non-null     int64  
 4   Parch        12 non-null     int64  
 5   Embarked_C   12 non-null     float64
 6   Embarked_Q   12 non-null     float64
 7   Embarked_S   12 non-null     float64
 8   Age_Scaled   12 non-null     float64
 9   Fare_Scaled  12 non-null     float64
dtypes: float64(5), int64(5)
memory usage: 1.1 KB

Dummy \'titanic.csv\' removed.

Conclusion:

Congratulations! You have successfully completed a comprehensive data preparation pipeline using Python, Pandas, Matplotlib, Seaborn, and Scikit-learn. This cleaned, transformed, and scaled DataFrame is now perfectly ready to be used as input for any machine learning algorithm. The next step in a real data science project would be to choose and train a machine learning model, evaluate its performance, and then deploy it. You now have the foundational skills to tackle these exciting challenges!

Module 0: Data Science Toolkit!

Module 1: NumPy - The Foundation for Numerical Data

Topic: What is NumPy?

Topic: Creating NumPy Arrays

Creating arrays from Python lists

Using np.arange()

Using np.zeros()

Using np.ones()

Using np.linspace()

Topic: Array Attributes & Operations

Array Attributes

Basic Arithmetic Operations

Topic: Indexing and Slicing

1D array indexing and slicing

2D array indexing and slicing (e.g., array[row, col])

Topic: Key NumPy Functions

Universal Functions: np.sqrt(), np.exp(), np.sin()

Aggregation Functions: np.sum(), np.mean(), np.max(), np.std()

Module 2: Pandas - The Ultimate Data Manipulation Tool

Topic: What is Pandas?

Topic: Core Pandas Structures

The Series (a labeled 1D array)

The DataFrame (our primary tool, a 2D table)

Topic: Reading and Writing Data

Reading a CSV file with pd.read_csv()

Topic: First Look at Your Data (Exploratory Data Analysis - EDA)

Topic: Selecting Data

Selecting columns

Selecting rows with .loc[] (label-based)

Selecting rows with .iloc[] (index-based)

Topic: Filtering Data

Topic: Handling Missing Data

Topic: Grouping and Aggregating

Module 3: Data Visualization - Telling Stories with Data

Topic: Part A - Matplotlib, the Grandparent of Plots

Anatomy of a Plot (Figure, Axes, Title, Labels)

Creating Basic Plots: plt.plot() (line), plt.scatter() (scatter), plt.bar() (bar), plt.hist() (histogram).

Topic: Part B - Seaborn, for Beautiful Statistical Plots

Key Seaborn Plots: sns.scatterplot() (with hue), sns.countplot(), sns.boxplot(), sns.heatmap() (for correlations).

Module 4: The Bridge to Machine Learning - Data Preparation with Scikit-learn

Topic: Why Do We Need to Prepare Data?

Topic: Handling Categorical Data

The Problem: ML models don't understand text like 'Male', 'Female', or 'USA'.

The Solution: One-Hot Encoding with sklearn.preprocessing.OneHotEncoder.

Topic: Scaling Numerical Data

The Problem: Features on different scales (e.g., age from 0-100 vs. salary from 30k-200k) can confuse models.

The Solution 1: Standardization with sklearn.preprocessing.StandardScaler.

The Solution 2: Normalization with sklearn.preprocessing.MinMaxScaler.

Module 5: Hands-On Capstone Project - Analyzing the Titanic Dataset

Topic: The Goal

Topic: The Dataset

Topic: Step 1: Load and Explore the Data

Topic: Step 2: Clean the Data

Topic: Step 3: Visualize the Data to Find Insights (EDA)

Topic: Step 4: Preprocess the Data for a Future Model

Final Topic: Our Data is Ready!

Using `np.arange()`

Using `np.zeros()`

Using `np.ones()`

Using `np.linspace()`

Universal Functions: `np.sqrt()`, `np.exp()`, `np.sin()`

Aggregation Functions: `np.sum()`, `np.mean()`, `np.max()`, `np.std()`

Reading a CSV file with `pd.read_csv()`

Selecting rows with `.loc[]` (label-based)

Selecting rows with `.iloc[]` (index-based)

Creating Basic Plots: `plt.plot()` (line), `plt.scatter()` (scatter), `plt.bar()` (bar), `plt.hist()` (histogram).

Key Seaborn Plots: `sns.scatterplot()` (with hue), `sns.countplot()`, `sns.boxplot()`, `sns.heatmap()` (for correlations).

The Solution: One-Hot Encoding with `sklearn.preprocessing.OneHotEncoder`.

The Solution 1: Standardization with `sklearn.preprocessing.StandardScaler`.

The Solution 2: Normalization with `sklearn.preprocessing.MinMaxScaler`.