AI LAB 4: Pandas Assignment

Part 1: General Pandas Problems

Question 1: Time Series Resampling

Given the time-series DataFrame `ts_df`, resample the data to find the mean of the 'Value' column for each 3-hour interval.


import pandas as pd
import numpy as np

# Data Setup
data = {
    'DateTime': pd.to_datetime([
        '2025-08-15 00:00:00', '2025-08-15 01:00:00', '2025-08-15 02:30:00',
        '2025-08-15 03:30:00', '2025-08-15 04:00:00', '2025-08-15 07:00:00'
    ]),
    'Value': [10, 12, 8, 15, 18, 20]
}
ts_df = pd.DataFrame(data)

# Your code here:

Expected Output:

DateTime 2025-08-15 00:00:00 10.0 2025-08-15 03:00:00 16.5 2025-08-15 06:00:00 20.0 Name: Value, dtype: float64

Question 2: Using `pipe()` for Chainable Operations

Create two functions: `add_revenue_col(df)` and `apply_discount(df, discount=0.10)`. Then, use `df.pipe()` to apply them sequentially to `pipe_df`.


# Data Setup
data = {'Price': [100, 150, 200], 'Quantity': [5, 10, 4]}
pipe_df = pd.DataFrame(data)

# Your function definitions here:


# Your code here using .pipe():

Expected Output:

Price Quantity Revenue 0 90.0 5 500 1 135.0 10 1500 2 180.0 4 800

Question 3: Group-wise Percentage with `transform()`

In `transform_df`, create a new column `'Pct_of_Category_Value'` that contains the percentage of each 'Value' relative to the sum of values within its own 'Category'.


# Data Setup
data = {'Category': ['A', 'A', 'B', 'B'], 'Value': [10, 30, 20, 80]}
transform_df = pd.DataFrame(data)

# Your code here:

Expected Output:

Category Value Pct_of_Category_Value 0 A 10 0.25 1 A 30 0.75 2 B 20 0.20 3 B 80 0.80

Question 4: Creating a MultiIndex

Convert `multi_df` to use a `MultiIndex` composed of the 'Year' and 'Quarter' columns.


# Data Setup
data = {
    'Year': [2024, 2024, 2025, 2025],
    'Quarter': ['Q1', 'Q2', 'Q1', 'Q2'],
    'Sales': [1000, 1200, 1100, 1300]
}
multi_df = pd.DataFrame(data)

# Your code here:

Expected Output:

Sales Year Quarter 2024 Q1 1000 Q2 1200 2025 Q1 1100 Q2 1300

Question 5: `agg()` vs. `transform()`

a) Create `avg_value_df` showing the single average 'Value' for each 'Group'.
b) Add a `'Group_Avg_Value'` column to `agg_trans_df` showing the group average for each row.


# Data Setup
data = {'Group': ['X', 'X', 'Y', 'Y'], 'Value': [15, 25, 30, 40]}
agg_trans_df = pd.DataFrame(data)

# a) Your code for the aggregated DataFrame:


# b) Your code to add the transformed column:


print("--- Aggregated DataFrame (a) ---")
# print(avg_value_df)
print("\n--- Original DataFrame with Transformed Column (b) ---")
# print(agg_trans_df)

Expected Output:

--- Aggregated DataFrame (a) --- Value Group X 20.0 Y 35.0 --- Original DataFrame with Transformed Column (b) --- Group Value Group_Avg_Value 0 X 15 20.0 1 X 25 20.0 2 Y 30 35.0 3 Y 40 35.0

Part 2: Dataset-Based Problems

Use the DataFrame `sales_df` created by the setup cell below for all questions in this part.


# --- DATASET SETUP ---
# Execute this cell to create the DataFrame for Part 2
np.random.seed(42)
dates = pd.date_range(start='2025-01-01', end='2025-06-30', freq='D')
regions = ['North', 'South', 'East', 'West']
products = ['Electronics', 'Clothing', 'Books', 'Home Goods']

data = {
    'Date': np.random.choice(dates, 500),
    'Region': np.random.choice(regions, 500),
    'Product': np.random.choice(products, 500),
    'Units Sold': np.random.randint(10, 100, 500),
    'Price': np.round(np.random.uniform(20, 500, 500), 2)
}
sales_df = pd.DataFrame(data).sort_values(by='Date').reset_index(drop=True)
print("Sales DataFrame created. Here are the first 5 rows:")
print(sales_df.head())

Question 6: Filtered Aggregation

Calculate the total revenue (Units Sold * Price) for each region, but only for the 'Electronics' product category.

Expected Output:

Region East 307409.52 North 188659.88 South 252758.55 West 277568.10 dtype: float64

Question 7: Time-Series Growth Calculation

First, create a 'Revenue' column. Then, for the 'North' region only, find the month with the highest percentage sales growth over the previous month.

Expected Output:

Timestamp('2025-03-31 00:00:00', freq='M')

Question 8: Pivot Table Creation

Create a pivot table showing the average 'Units Sold' for each 'Product' (rows) across each 'Region' (columns).

Expected Output:

Region East North South West Product Books 53.250000 54.677419 54.333333 53.238095 Clothing 59.000000 51.687500 52.896552 49.500000 Electronics 54.403509 56.029412 55.511111 52.921569 Home Goods 52.483871 49.300000 51.583333 58.916667

Question 9: Group-wise N-Largest

Find the top 2 transactions (rows) with the highest 'Units Sold' within each region.

Expected Output:

Date Region Product Units Sold Price 102 2025-02-12 East Clothing 99 131.06 368 2025-05-18 East Books 98 331.07 28 2025-01-14 North Home Goods 99 357.11 365 2025-05-16 North Books 98 472.07 39 2025-01-18 South Books 99 107.03 453 2025-06-15 South Clothing 99 424.36 238 2025-04-05 West Books 99 136.26 195 2025-03-22 West Electronics 98 422.37

Question 10: Filtering with a Transformed Group Value

Find all transactions where 'Units Sold' was greater than the average 'Units Sold' for that specific product category. Display the shape of the resulting DataFrame.

Expected Output:

(253, 5)

Question 11: Missing Data Imputation

Load the `aps_failure_sample.csv` dataset from the provided URL. The file uses 'na' as the missing value marker. First, calculate and display the total number of missing values. Then, create a new DataFrame `df_imputed` where all missing values are filled with the mean of their respective columns.

Download Dataset File


# Steps to follow:
# 1. Load the dataset
# 2. Calculate and print total missing values
# 3. Create df_imputed and fill missing values with column means

Expected Output:

Total missing values: 71649
... (The imputed DataFrame will be too large to display, but your code should create it)