Part 1: General Pandas Problems
Question 1: Time Series Resampling
Given the time-series DataFrame `ts_df`, resample the data to find the mean of the 'Value' column for each 3-hour interval.
import pandas as pd
import numpy as np
# Data Setup
data = {
'DateTime': pd.to_datetime([
'2025-08-15 00:00:00', '2025-08-15 01:00:00', '2025-08-15 02:30:00',
'2025-08-15 03:30:00', '2025-08-15 04:00:00', '2025-08-15 07:00:00'
]),
'Value': [10, 12, 8, 15, 18, 20]
}
ts_df = pd.DataFrame(data)
# Your code here:
Expected Output:
Question 2: Using `pipe()` for Chainable Operations
Create two functions: `add_revenue_col(df)` and `apply_discount(df, discount=0.10)`. Then, use `df.pipe()` to apply them sequentially to `pipe_df`.
# Data Setup
data = {'Price': [100, 150, 200], 'Quantity': [5, 10, 4]}
pipe_df = pd.DataFrame(data)
# Your function definitions here:
# Your code here using .pipe():
Expected Output:
Question 3: Group-wise Percentage with `transform()`
In `transform_df`, create a new column `'Pct_of_Category_Value'` that contains the percentage of each 'Value' relative to the sum of values within its own 'Category'.
# Data Setup
data = {'Category': ['A', 'A', 'B', 'B'], 'Value': [10, 30, 20, 80]}
transform_df = pd.DataFrame(data)
# Your code here:
Expected Output:
Question 4: Creating a MultiIndex
Convert `multi_df` to use a `MultiIndex` composed of the 'Year' and 'Quarter' columns.
# Data Setup
data = {
'Year': [2024, 2024, 2025, 2025],
'Quarter': ['Q1', 'Q2', 'Q1', 'Q2'],
'Sales': [1000, 1200, 1100, 1300]
}
multi_df = pd.DataFrame(data)
# Your code here:
Expected Output:
Question 5: `agg()` vs. `transform()`
a) Create `avg_value_df` showing the single average 'Value' for each 'Group'.
b) Add a `'Group_Avg_Value'` column to `agg_trans_df` showing the group average for each row.
# Data Setup
data = {'Group': ['X', 'X', 'Y', 'Y'], 'Value': [15, 25, 30, 40]}
agg_trans_df = pd.DataFrame(data)
# a) Your code for the aggregated DataFrame:
# b) Your code to add the transformed column:
print("--- Aggregated DataFrame (a) ---")
# print(avg_value_df)
print("\n--- Original DataFrame with Transformed Column (b) ---")
# print(agg_trans_df)
Expected Output:
Part 2: Dataset-Based Problems
Use the DataFrame `sales_df` created by the setup cell below for all questions in this part.
# --- DATASET SETUP ---
# Execute this cell to create the DataFrame for Part 2
np.random.seed(42)
dates = pd.date_range(start='2025-01-01', end='2025-06-30', freq='D')
regions = ['North', 'South', 'East', 'West']
products = ['Electronics', 'Clothing', 'Books', 'Home Goods']
data = {
'Date': np.random.choice(dates, 500),
'Region': np.random.choice(regions, 500),
'Product': np.random.choice(products, 500),
'Units Sold': np.random.randint(10, 100, 500),
'Price': np.round(np.random.uniform(20, 500, 500), 2)
}
sales_df = pd.DataFrame(data).sort_values(by='Date').reset_index(drop=True)
print("Sales DataFrame created. Here are the first 5 rows:")
print(sales_df.head())
Question 6: Filtered Aggregation
Calculate the total revenue (Units Sold * Price) for each region, but only for the 'Electronics' product category.
Expected Output:
Question 7: Time-Series Growth Calculation
First, create a 'Revenue' column. Then, for the 'North' region only, find the month with the highest percentage sales growth over the previous month.
Expected Output:
Question 8: Pivot Table Creation
Create a pivot table showing the average 'Units Sold' for each 'Product' (rows) across each 'Region' (columns).
Expected Output:
Question 9: Group-wise N-Largest
Find the top 2 transactions (rows) with the highest 'Units Sold' within each region.
Expected Output:
Question 10: Filtering with a Transformed Group Value
Find all transactions where 'Units Sold' was greater than the average 'Units Sold' for that specific product category. Display the shape of the resulting DataFrame.
Expected Output:
Question 11: Missing Data Imputation
Load the `aps_failure_sample.csv` dataset from the provided URL. The file uses 'na' as the missing value marker. First, calculate and display the total number of missing values. Then, create a new DataFrame `df_imputed` where all missing values are filled with the mean of their respective columns.
Download Dataset File
# Steps to follow:
# 1. Load the dataset
# 2. Calculate and print total missing values
# 3. Create df_imputed and fill missing values with column means
Expected Output:
... (The imputed DataFrame will be too large to display, but your code should create it)