Pandas Basics — the core DataFrame operations¶
A guided tour of the pandas operations you reach for every day, split into one cell per concept. Running cell N re-uses the artifact from cell N-1, so edits stay fast and the DAG stays honest.
What it shows¶
- A linear chain where each cell reads the previous cell's output.
- Cache-hit behavior: re-running a cell after its upstream hasn't changed finishes in a few milliseconds.
- How Strata's staleness propagation works — edit cell 2 and cells 3-7 turn yellow automatically.
Cells¶
| Cell | What it does |
|---|---|
create_data |
Builds a small sales DataFrame as the root of the chain. |
select_filter |
Column selection + boolean indexing. |
add_columns |
Derived columns (e.g. total = price * quantity). |
groupby |
groupby + aggregate. |
pivot |
Pivot from long to wide. |
merge |
Join two DataFrames. |
summary |
Describe and basic stats. |
Running¶
From the project root:
Then open examples/pandas_basics from the Strata home page.
Try this¶
- Run all cells top-to-bottom.
- Edit
create_data(for example, change a price). - Watch cells 2-7 turn stale automatically.
- Run cell 7. Strata re-executes only the cells that need it — you should see cache hits reported for any intermediate cell whose inputs didn't actually change.
Create sample sales dataset¶
kind python
# @name Create sample sales dataset
import numpy as np
import pandas as pd
np.random.seed(42)
n = 200
sales = pd.DataFrame(
{
"date": pd.date_range("2025-01-01", periods=n, freq="D"),
"region": np.random.choice(["North", "South", "East", "West"], n),
"product": np.random.choice(["Widget", "Gadget", "Doohickey"], n),
"units": np.random.randint(1, 50, n),
"price": np.round(np.random.uniform(5, 100, n), 2),
}
)
print(f"Created {len(sales)} rows")
sales.head(10)
Filter high-value orders¶
kind python
# @name Filter high-value orders
# Filter to high-value orders (units > 20 and price > 30).
high_value = sales[(sales["units"] > 20) & (sales["price"] > 30)].copy()
print(
f"High-value orders: {len(high_value)} / {len(sales)} ({100 * len(high_value) / len(sales):.1f}%)"
)
high_value.head()
Add computed columns¶
kind python
# @name Add computed columns
sales["revenue"] = sales["units"] * sales["price"]
sales["month"] = sales["date"].dt.to_period("M").astype(str)
print(f"Total revenue: ${sales['revenue'].sum():,.2f}")
sales[["date", "region", "product", "units", "price", "revenue", "month"]].head()
Revenue by region and product¶
kind python
# @name Revenue by region and product
region_product = (
sales.groupby(["region", "product"])["revenue"]
.agg(["sum", "mean", "count"])
.round(2)
.sort_values("sum", ascending=False)
)
print(region_product)
Monthly revenue pivot¶
kind python
# @name Monthly revenue pivot
monthly_pivot = sales.pivot_table(
values="revenue",
index="month",
columns="region",
aggfunc="sum",
).round(2)
print(monthly_pivot)
Merge with region metadata¶
kind python
# @name Merge with region metadata
region_info = pd.DataFrame(
{
"region": ["North", "South", "East", "West"],
"manager": ["Alice", "Bob", "Carol", "Dave"],
"target_revenue": [50000, 45000, 55000, 40000],
}
)
region_summary = (
sales.groupby("region")["revenue"].sum().reset_index().merge(region_info, on="region")
)
region_summary["pct_of_target"] = (
100 * region_summary["revenue"] / region_summary["target_revenue"]
).round(1)
print(region_summary)
Top product per region¶
kind python
# @name Top product per region
# Top product by revenue in each region.
top_products = (
sales.groupby(["region", "product"])["revenue"]
.sum()
.reset_index()
.sort_values("revenue", ascending=False)
.drop_duplicates(subset="region")
.sort_values("region")
)
print("Top product by revenue per region:")
print(top_products.to_string(index=False))