Skip to content

Pandas Basics — the core DataFrame operations

A guided tour of the pandas operations you reach for every day, split into one cell per concept. Running cell N re-uses the artifact from cell N-1, so edits stay fast and the DAG stays honest.

What it shows

  • A linear chain where each cell reads the previous cell's output.
  • Cache-hit behavior: re-running a cell after its upstream hasn't changed finishes in a few milliseconds.
  • How Strata's staleness propagation works — edit cell 2 and cells 3-7 turn yellow automatically.

Cells

Cell What it does
create_data Builds a small sales DataFrame as the root of the chain.
select_filter Column selection + boolean indexing.
add_columns Derived columns (e.g. total = price * quantity).
groupby groupby + aggregate.
pivot Pivot from long to wide.
merge Join two DataFrames.
summary Describe and basic stats.

Running

From the project root:

uv run strata-server --host 127.0.0.1 --port 8765

Then open examples/pandas_basics from the Strata home page.

Try this

  1. Run all cells top-to-bottom.
  2. Edit create_data (for example, change a price).
  3. Watch cells 2-7 turn stale automatically.
  4. Run cell 7. Strata re-executes only the cells that need it — you should see cache hits reported for any intermediate cell whose inputs didn't actually change.

Create sample sales dataset

kind python

# @name Create sample sales dataset
import numpy as np
import pandas as pd

np.random.seed(42)
n = 200

sales = pd.DataFrame(
    {
        "date": pd.date_range("2025-01-01", periods=n, freq="D"),
        "region": np.random.choice(["North", "South", "East", "West"], n),
        "product": np.random.choice(["Widget", "Gadget", "Doohickey"], n),
        "units": np.random.randint(1, 50, n),
        "price": np.round(np.random.uniform(5, 100, n), 2),
    }
)

print(f"Created {len(sales)} rows")
sales.head(10)

Filter high-value orders

kind python

# @name Filter high-value orders
# Filter to high-value orders (units > 20 and price > 30).
high_value = sales[(sales["units"] > 20) & (sales["price"] > 30)].copy()

print(
    f"High-value orders: {len(high_value)} / {len(sales)} ({100 * len(high_value) / len(sales):.1f}%)"
)
high_value.head()

Add computed columns

kind python

# @name Add computed columns
sales["revenue"] = sales["units"] * sales["price"]
sales["month"] = sales["date"].dt.to_period("M").astype(str)

print(f"Total revenue: ${sales['revenue'].sum():,.2f}")
sales[["date", "region", "product", "units", "price", "revenue", "month"]].head()

Revenue by region and product

kind python

# @name Revenue by region and product
region_product = (
    sales.groupby(["region", "product"])["revenue"]
    .agg(["sum", "mean", "count"])
    .round(2)
    .sort_values("sum", ascending=False)
)

print(region_product)

Monthly revenue pivot

kind python

# @name Monthly revenue pivot
monthly_pivot = sales.pivot_table(
    values="revenue",
    index="month",
    columns="region",
    aggfunc="sum",
).round(2)

print(monthly_pivot)

Merge with region metadata

kind python

# @name Merge with region metadata
region_info = pd.DataFrame(
    {
        "region": ["North", "South", "East", "West"],
        "manager": ["Alice", "Bob", "Carol", "Dave"],
        "target_revenue": [50000, 45000, 55000, 40000],
    }
)

region_summary = (
    sales.groupby("region")["revenue"].sum().reset_index().merge(region_info, on="region")
)

region_summary["pct_of_target"] = (
    100 * region_summary["revenue"] / region_summary["target_revenue"]
).round(1)

print(region_summary)

Top product per region

kind python

# @name Top product per region
# Top product by revenue in each region.
top_products = (
    sales.groupby(["region", "product"])["revenue"]
    .sum()
    .reset_index()
    .sort_values("revenue", ascending=False)
    .drop_duplicates(subset="region")
    .sort_values("region")
)

print("Top product by revenue per region:")
print(top_products.to_string(index=False))