Pandas Lab Assignment: Data Manipulation and Analysis

Objective: To solve a series of problems using the pandas library, testing your ability to manipulate, aggregate, and analyze data.

Part 1: General Pandas Problems

Question 1: Time Series Resampling

Given the time-series DataFrame `ts_df`, resample the data to find the mean of the 'Value' column for each 3-hour interval.

import pandas as pd import numpy as np # Data Setup data = { 'DateTime': pd.to_datetime([ '2025-08-15 00:00:00', '2025-08-15 01:00:00', '2025-08-15 02:30:00', '2025-08-15 03:30:00', '2025-08-15 04:00:00', '2025-08-15 07:00:00' ]), 'Value': [10, 12, 8, 15, 18, 20] } ts_df = pd.DataFrame(data) # Your code here:

Expected Output:

DateTime 2025-08-15 00:00:00 10.0 2025-08-15 03:00:00 16.5 2025-08-15 06:00:00 20.0 Name: Value, dtype: float64

Question 2: Using `pipe()` for Chainable Operations

Create two functions: `add_revenue_col(df)` and `apply_discount(df, discount=0.10)`. Then, use `df.pipe()` to apply them sequentially to `pipe_df`.

# Data Setup data = {'Price': [100, 150, 200], 'Quantity': [5, 10, 4]} pipe_df = pd.DataFrame(data) # Your function definitions here: # Your code here using .pipe():

Expected Output:

Price Quantity Revenue 0 90.0 5 500 1 135.0 10 1500 2 180.0 4 800

Question 3: Group-wise Percentage with `transform()`

In `transform_df`, create a new column `'Pct_of_Category_Value'` that contains the percentage of each 'Value' relative to the sum of values within its own 'Category'.

# Data Setup data = {'Category': ['A', 'A', 'B', 'B'], 'Value': [10, 30, 20, 80]} transform_df = pd.DataFrame(data) # Your code here:

Expected Output:

Category Value Pct_of_Category_Value 0 A 10 0.25 1 A 30 0.75 2 B 20 0.20 3 B 80 0.80

Question 4: Creating a MultiIndex

Convert `multi_df` to use a `MultiIndex` composed of the 'Year' and 'Quarter' columns.

# Data Setup data = { 'Year': [2024, 2024, 2025, 2025], 'Quarter': ['Q1', 'Q2', 'Q1', 'Q2'], 'Sales': [1000, 1200, 1100, 1300] } multi_df = pd.DataFrame(data) # Your code here:

Expected Output:

Sales Year Quarter 2024 Q1 1000 Q2 1200 2025 Q1 1100 Q2 1300

Question 5: `agg()` vs. `transform()`

a) Create `avg_value_df` showing the single average 'Value' for each 'Group'.
b) Add a `'Group_Avg_Value'` column to `agg_trans_df` showing the group average for each row.

# Data Setup data = {'Group': ['X', 'X', 'Y', 'Y'], 'Value': [15, 25, 30, 40]} agg_trans_df = pd.DataFrame(data) # a) Your code for the aggregated DataFrame: # b) Your code to add the transformed column: print("--- Aggregated DataFrame (a) ---") # print(avg_value_df) print("\n--- Original DataFrame with Transformed Column (b) ---") # print(agg_trans_df)

Expected Output:

--- Aggregated DataFrame (a) --- Value Group X 20.0 Y 35.0 --- Original DataFrame with Transformed Column (b) --- Group Value Group_Avg_Value 0 X 15 20.0 1 X 25 20.0 2 Y 30 35.0 3 Y 40 35.0

Part 2: Dataset-Based Problems

Use the DataFrame `sales_df` created by the setup cell below for all questions in this part.

# --- DATASET SETUP --- # Execute this cell to create the DataFrame for Part 2 np.random.seed(42) dates = pd.date_range(start='2025-01-01', end='2025-06-30', freq='D') regions = ['North', 'South', 'East', 'West'] products = ['Electronics', 'Clothing', 'Books', 'Home Goods'] data = { 'Date': np.random.choice(dates, 500), 'Region': np.random.choice(regions, 500), 'Product': np.random.choice(products, 500), 'Units Sold': np.random.randint(10, 100, 500), 'Price': np.round(np.random.uniform(20, 500, 500), 2) } sales_df = pd.DataFrame(data).sort_values(by='Date').reset_index(drop=True) print("Sales DataFrame created. Here are the first 5 rows:") print(sales_df.head())

Question 6: Filtered Aggregation

Calculate the total revenue (Units Sold * Price) for each region, but only for the 'Electronics' product category.

Expected Output:

Region East 307409.52 North 188659.88 South 252758.55 West 277568.10 dtype: float64

Question 7: Time-Series Growth Calculation

First, create a 'Revenue' column. Then, for the 'North' region only, find the month with the highest percentage sales growth over the previous month.

Expected Output:

Timestamp('2025-03-31 00:00:00', freq='M')

Question 8: Pivot Table Creation

Create a pivot table showing the average 'Units Sold' for each 'Product' (rows) across each 'Region' (columns).

Expected Output:

Region East North South West Product Books 53.250000 54.677419 54.333333 53.238095 Clothing 59.000000 51.687500 52.896552 49.500000 Electronics 54.403509 56.029412 55.511111 52.921569 Home Goods 52.483871 49.300000 51.583333 58.916667

Question 9: Group-wise N-Largest

Find the top 2 transactions (rows) with the highest 'Units Sold' within each region.

Expected Output:

Date Region Product Units Sold Price 102 2025-02-12 East Clothing 99 131.06 368 2025-05-18 East Books 98 331.07 28 2025-01-14 North Home Goods 99 357.11 365 2025-05-16 North Books 98 472.07 39 2025-01-18 South Books 99 107.03 453 2025-06-15 South Clothing 99 424.36 238 2025-04-05 West Books 99 136.26 195 2025-03-22 West Electronics 98 422.37

Question 10: Filtering with a Transformed Group Value

Find all transactions where 'Units Sold' was greater than the average 'Units Sold' for that specific product category. Display the shape of the resulting DataFrame.

Expected Output:

(253, 5)

Question 11: Missing Data Imputation

Load the `aps_failure_sample.csv` dataset from the provided URL. The file uses 'na' as the missing value marker. First, calculate and display the total number of missing values. Then, create a new DataFrame `df_imputed` where all missing values are filled with the mean of their respective columns.

Download Dataset File
# Steps to follow: # 1. Load the dataset # 2. Calculate and print total missing values # 3. Create df_imputed and fill missing values with column means

Expected Output:

Total missing values: 71649
... (The imputed DataFrame will be too large to display, but your code should create it)