Titanic — feature engineering + model comparison¶

A canonical "first Kaggle notebook" rewritten for Strata. Shows a feature-engineering stage that fans out into two model trainers whose metrics get compared in a final cell.

What it shows¶

Shared upstream with two downstream branches — features feeds both train_model and a second branch (implicit via compare), so editing features invalidates both branches.
Ordered display outputs — explore produces multiple charts side-by-side, each rendered as a separate display output.
Typed primitives round-trip — accuracy floats, confusion matrices, feature importances all flow through the artifact store without pickle.

Cells¶

Cell	What it does
`load_data`	Loads Titanic CSV (seaborn built-in).
`explore`	Survival rate by sex, class, age bucket.
`features`	Engineered features (family size, title from name, age bins).
`train_model`	Fits a random forest.
`evaluate`	Accuracy, precision, recall, confusion matrix.
`compare`	Side-by-side metrics for the trained model vs a baseline.

Running¶

From the project root:

uv run strata-server --host 127.0.0.1 --port 8765

Then open examples/titanic_ml from the Strata home page.

Try this¶

Add a feature in features (e.g. fare_per_person). train_model and evaluate go stale but the exploration charts stay ready.
Run evaluate alone. Strata re-runs only the chain that's changed.

Load Titanic dataset¶

_{kind python}

# @name Load Titanic dataset
# Bundled with seaborn — no network fetch needed.
import seaborn as sns

df = sns.load_dataset("titanic")

print(f"Loaded {len(df)} passengers")
print(f"Survival rate: {df['survived'].mean():.1%}")
print(f"\nMissing values:\n{df.isnull().sum()[df.isnull().sum() > 0]}")
df.head()

Survival rates by class and sex¶

_{kind python}

# @name Survival rates by class and sex
survival_rates = df.groupby(["pclass", "sex"])["survived"].mean().round(3)
print(survival_rates)
print(f"\nOverall: {df['survived'].mean():.3f}")

Feature engineering¶

_{kind python}

# @name Feature engineering
from sklearn.model_selection import train_test_split

feature_cols = ["pclass", "sex", "age", "sibsp", "parch", "fare"]
clean = df[feature_cols + ["survived"]].dropna().copy()

# Encode sex as numeric
clean["sex"] = clean["sex"].map({"male": 0, "female": 1})

X = clean[feature_cols]
y = clean["survived"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

print(f"Features: {feature_cols}")
print(f"Train: {len(X_train)}, Test: {len(X_test)}")
print(f"Dropped {len(df) - len(clean)} rows with missing values")

Model the survivors¶

We've engineered features from the raw passenger data. Now train three classifiers (logistic regression, random forest, gradient boosting), evaluate the best one, and compare feature importance.

Each classifier runs once and gets cached. Tweaking features.py invalidates all three downstream cells; the provenance hash on each training cell changes because its X_train input changed.

Train classifiers¶

_{kind python}

# @name Train classifiers
# Train logistic regression, random forest, and gradient boosting; keep all three.
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression

models = {
    "Logistic Regression": LogisticRegression(max_iter=1000, random_state=42),
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
    "Gradient Boosting": GradientBoostingClassifier(n_estimators=100, random_state=42),
}

results = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    train_acc = model.score(X_train, y_train)
    test_acc = model.score(X_test, y_test)
    results[name] = {"train": train_acc, "test": test_acc, "model": model}
    print(f"{name:25s}  train={train_acc:.3f}  test={test_acc:.3f}")

Evaluate best model¶

_{kind python}

# @name Evaluate best model
# Detailed evaluation of the best model.
from sklearn.metrics import classification_report

best_name = max(results, key=lambda k: results[k]["test"])
best_model = results[best_name]["model"]
y_pred = best_model.predict(X_test)

print(f"Best model: {best_name} (test accuracy: {results[best_name]['test']:.3f})\n")
print(classification_report(y_test, y_pred, target_names=["Did not survive", "Survived"]))

Feature importance from the best model¶

_{kind python}

# @name Feature importance from the best model
import matplotlib
import pandas as pd

matplotlib.use("Agg")
import matplotlib.pyplot as plt

if hasattr(best_model, "feature_importances_"):
    importance = pd.Series(best_model.feature_importances_, index=feature_cols).sort_values(
        ascending=True
    )

    fig, ax = plt.subplots(figsize=(8, 4))
    importance.plot.barh(ax=ax, color="#89b4fa")
    ax.set_title(f"Feature Importance ({best_name})")
    ax.set_xlabel("Importance")
    plt.tight_layout()
    plt.savefig("/tmp/titanic_importance.png", dpi=100)
    print("Saved feature importance to /tmp/titanic_importance.png")
    print(importance.sort_values(ascending=False))
else:
    print(f"{best_name} does not expose feature_importances_")