Skip to content

Titanic — feature engineering + model comparison

A canonical "first Kaggle notebook" rewritten for Strata. Shows a feature-engineering stage that fans out into two model trainers whose metrics get compared in a final cell.

What it shows

  • Shared upstream with two downstream branchesfeatures feeds both train_model and a second branch (implicit via compare), so editing features invalidates both branches.
  • Ordered display outputsexplore produces multiple charts side-by-side, each rendered as a separate display output.
  • Typed primitives round-trip — accuracy floats, confusion matrices, feature importances all flow through the artifact store without pickle.

Cells

Cell What it does
load_data Loads Titanic CSV (seaborn built-in).
explore Survival rate by sex, class, age bucket.
features Engineered features (family size, title from name, age bins).
train_model Fits a random forest.
evaluate Accuracy, precision, recall, confusion matrix.
compare Side-by-side metrics for the trained model vs a baseline.

Running

From the project root:

uv run strata-server --host 127.0.0.1 --port 8765

Then open examples/titanic_ml from the Strata home page.

Try this

  • Add a feature in features (e.g. fare_per_person). train_model and evaluate go stale but the exploration charts stay ready.
  • Run evaluate alone. Strata re-runs only the chain that's changed.

Load Titanic dataset

kind python

# @name Load Titanic dataset
# Bundled with seaborn — no network fetch needed.
import seaborn as sns

df = sns.load_dataset("titanic")

print(f"Loaded {len(df)} passengers")
print(f"Survival rate: {df['survived'].mean():.1%}")
print(f"\nMissing values:\n{df.isnull().sum()[df.isnull().sum() > 0]}")
df.head()

Survival rates by class and sex

kind python

# @name Survival rates by class and sex
survival_rates = df.groupby(["pclass", "sex"])["survived"].mean().round(3)
print(survival_rates)
print(f"\nOverall: {df['survived'].mean():.3f}")

Feature engineering

kind python

# @name Feature engineering
from sklearn.model_selection import train_test_split

feature_cols = ["pclass", "sex", "age", "sibsp", "parch", "fare"]
clean = df[feature_cols + ["survived"]].dropna().copy()

# Encode sex as numeric
clean["sex"] = clean["sex"].map({"male": 0, "female": 1})

X = clean[feature_cols]
y = clean["survived"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

print(f"Features: {feature_cols}")
print(f"Train: {len(X_train)}, Test: {len(X_test)}")
print(f"Dropped {len(df) - len(clean)} rows with missing values")

Model the survivors

We've engineered features from the raw passenger data. Now train three classifiers (logistic regression, random forest, gradient boosting), evaluate the best one, and compare feature importance.

Each classifier runs once and gets cached. Tweaking features.py invalidates all three downstream cells; the provenance hash on each training cell changes because its X_train input changed.

Train classifiers

kind python

# @name Train classifiers
# Train logistic regression, random forest, and gradient boosting; keep all three.
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression

models = {
    "Logistic Regression": LogisticRegression(max_iter=1000, random_state=42),
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
    "Gradient Boosting": GradientBoostingClassifier(n_estimators=100, random_state=42),
}

results = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    train_acc = model.score(X_train, y_train)
    test_acc = model.score(X_test, y_test)
    results[name] = {"train": train_acc, "test": test_acc, "model": model}
    print(f"{name:25s}  train={train_acc:.3f}  test={test_acc:.3f}")

Evaluate best model

kind python

# @name Evaluate best model
# Detailed evaluation of the best model.
from sklearn.metrics import classification_report

best_name = max(results, key=lambda k: results[k]["test"])
best_model = results[best_name]["model"]
y_pred = best_model.predict(X_test)

print(f"Best model: {best_name} (test accuracy: {results[best_name]['test']:.3f})\n")
print(classification_report(y_test, y_pred, target_names=["Did not survive", "Survived"]))

Feature importance from the best model

kind python

# @name Feature importance from the best model
import matplotlib
import pandas as pd

matplotlib.use("Agg")
import matplotlib.pyplot as plt

if hasattr(best_model, "feature_importances_"):
    importance = pd.Series(best_model.feature_importances_, index=feature_cols).sort_values(
        ascending=True
    )

    fig, ax = plt.subplots(figsize=(8, 4))
    importance.plot.barh(ax=ax, color="#89b4fa")
    ax.set_title(f"Feature Importance ({best_name})")
    ax.set_xlabel("Importance")
    plt.tight_layout()
    plt.savefig("/tmp/titanic_importance.png", dpi=100)
    print("Saved feature importance to /tmp/titanic_importance.png")
    print(importance.sort_values(ascending=False))
else:
    print(f"{best_name} does not expose feature_importances_")