Pandas Basics — the core DataFrame operations¶

A guided tour of the pandas operations you reach for every day, split into one cell per concept. Running cell N re-uses the artifact from cell N-1, so edits stay fast and the DAG stays honest.

What it shows¶

A linear chain where each cell reads the previous cell's output.
Cache-hit behavior: re-running a cell after its upstream hasn't changed finishes in a few milliseconds.
How Strata's staleness propagation works — edit cell 2 and cells 3-7 turn yellow automatically.

Cells¶

Cell	What it does
`create_data`	Builds a small sales DataFrame as the root of the chain.
`select_filter`	Column selection + boolean indexing.
`add_columns`	Derived columns (e.g. `total = price * quantity`).
`groupby`	`groupby` + aggregate.
`pivot`	Pivot from long to wide.
`merge`	Join two DataFrames.
`summary`	Describe and basic stats.

Running¶

From the project root:

uv run strata-server --host 127.0.0.1 --port 8765

Then open examples/pandas_basics from the Strata home page.

Try this¶

Run all cells top-to-bottom.
Edit create_data (for example, change a price).
Watch cells 2-7 turn stale automatically.
Run cell 7. Strata re-executes only the cells that need it — you should see cache hits reported for any intermediate cell whose inputs didn't actually change.

Create sample sales dataset¶

_{kind python}

# @name Create sample sales dataset
import numpy as np
import pandas as pd

np.random.seed(42)
n = 200

sales = pd.DataFrame(
    {
        "date": pd.date_range("2025-01-01", periods=n, freq="D"),
        "region": np.random.choice(["North", "South", "East", "West"], n),
        "product": np.random.choice(["Widget", "Gadget", "Doohickey"], n),
        "units": np.random.randint(1, 50, n),
        "price": np.round(np.random.uniform(5, 100, n), 2),
    }
)

print(f"Created {len(sales)} rows")
sales.head(10)

Filter high-value orders¶

_{kind python}

# @name Filter high-value orders
# Filter to high-value orders (units > 20 and price > 30).
high_value = sales[(sales["units"] > 20) & (sales["price"] > 30)].copy()

print(
    f"High-value orders: {len(high_value)} / {len(sales)} ({100 * len(high_value) / len(sales):.1f}%)"
)
high_value.head()

Add computed columns¶

_{kind python}

# @name Add computed columns
sales["revenue"] = sales["units"] * sales["price"]
sales["month"] = sales["date"].dt.to_period("M").astype(str)

print(f"Total revenue: ${sales['revenue'].sum():,.2f}")
sales[["date", "region", "product", "units", "price", "revenue", "month"]].head()

Revenue by region and product¶

_{kind python}

# @name Revenue by region and product
region_product = (
    sales.groupby(["region", "product"])["revenue"]
    .agg(["sum", "mean", "count"])
    .round(2)
    .sort_values("sum", ascending=False)
)

print(region_product)

Monthly revenue pivot¶

_{kind python}

# @name Monthly revenue pivot
monthly_pivot = sales.pivot_table(
    values="revenue",
    index="month",
    columns="region",
    aggfunc="sum",
).round(2)

print(monthly_pivot)

Merge with region metadata¶

_{kind python}

# @name Merge with region metadata
region_info = pd.DataFrame(
    {
        "region": ["North", "South", "East", "West"],
        "manager": ["Alice", "Bob", "Carol", "Dave"],
        "target_revenue": [50000, 45000, 55000, 40000],
    }
)

region_summary = (
    sales.groupby("region")["revenue"].sum().reset_index().merge(region_info, on="region")
)

region_summary["pct_of_target"] = (
    100 * region_summary["revenue"] / region_summary["target_revenue"]
).round(1)

print(region_summary)

Top product per region¶

_{kind python}

# @name Top product per region
# Top product by revenue in each region.
top_products = (
    sales.groupby(["region", "product"])["revenue"]
    .sum()
    .reset_index()
    .sort_values("revenue", ascending=False)
    .drop_duplicates(subset="region")
    .sort_values("region")
)

print("Top product by revenue per region:")
print(top_products.to_string(index=False))