Titanic — feature engineering + model comparison¶
A canonical "first Kaggle notebook" rewritten for Strata. Shows a feature-engineering stage that fans out into two model trainers whose metrics get compared in a final cell.
What it shows¶
- Shared upstream with two downstream branches —
featuresfeeds bothtrain_modeland a second branch (implicit viacompare), so editingfeaturesinvalidates both branches. - Ordered display outputs —
exploreproduces multiple charts side-by-side, each rendered as a separate display output. - Typed primitives round-trip — accuracy floats, confusion matrices, feature importances all flow through the artifact store without pickle.
Cells¶
| Cell | What it does |
|---|---|
load_data |
Loads Titanic CSV (seaborn built-in). |
explore |
Survival rate by sex, class, age bucket. |
features |
Engineered features (family size, title from name, age bins). |
train_model |
Fits a random forest. |
evaluate |
Accuracy, precision, recall, confusion matrix. |
compare |
Side-by-side metrics for the trained model vs a baseline. |
Running¶
From the project root:
Then open examples/titanic_ml from the Strata home page.
Try this¶
- Add a feature in
features(e.g.fare_per_person).train_modelandevaluatego stale but the exploration charts stay ready. - Run
evaluatealone. Strata re-runs only the chain that's changed.
Load Titanic dataset¶
kind python
# @name Load Titanic dataset
# Bundled with seaborn — no network fetch needed.
import seaborn as sns
df = sns.load_dataset("titanic")
print(f"Loaded {len(df)} passengers")
print(f"Survival rate: {df['survived'].mean():.1%}")
print(f"\nMissing values:\n{df.isnull().sum()[df.isnull().sum() > 0]}")
df.head()
Survival rates by class and sex¶
kind python
# @name Survival rates by class and sex
survival_rates = df.groupby(["pclass", "sex"])["survived"].mean().round(3)
print(survival_rates)
print(f"\nOverall: {df['survived'].mean():.3f}")
Feature engineering¶
kind python
# @name Feature engineering
from sklearn.model_selection import train_test_split
feature_cols = ["pclass", "sex", "age", "sibsp", "parch", "fare"]
clean = df[feature_cols + ["survived"]].dropna().copy()
# Encode sex as numeric
clean["sex"] = clean["sex"].map({"male": 0, "female": 1})
X = clean[feature_cols]
y = clean["survived"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42, stratify=y
)
print(f"Features: {feature_cols}")
print(f"Train: {len(X_train)}, Test: {len(X_test)}")
print(f"Dropped {len(df) - len(clean)} rows with missing values")
Model the survivors¶
We've engineered features from the raw passenger data. Now train three classifiers (logistic regression, random forest, gradient boosting), evaluate the best one, and compare feature importance.
Each classifier runs once and gets cached. Tweaking
features.py invalidates all three downstream cells; the
provenance hash on each training cell changes because its X_train
input changed.
Train classifiers¶
kind python
# @name Train classifiers
# Train logistic regression, random forest, and gradient boosting; keep all three.
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
models = {
"Logistic Regression": LogisticRegression(max_iter=1000, random_state=42),
"Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
"Gradient Boosting": GradientBoostingClassifier(n_estimators=100, random_state=42),
}
results = {}
for name, model in models.items():
model.fit(X_train, y_train)
train_acc = model.score(X_train, y_train)
test_acc = model.score(X_test, y_test)
results[name] = {"train": train_acc, "test": test_acc, "model": model}
print(f"{name:25s} train={train_acc:.3f} test={test_acc:.3f}")
Evaluate best model¶
kind python
# @name Evaluate best model
# Detailed evaluation of the best model.
from sklearn.metrics import classification_report
best_name = max(results, key=lambda k: results[k]["test"])
best_model = results[best_name]["model"]
y_pred = best_model.predict(X_test)
print(f"Best model: {best_name} (test accuracy: {results[best_name]['test']:.3f})\n")
print(classification_report(y_test, y_pred, target_names=["Did not survive", "Survived"]))
Feature importance from the best model¶
kind python
# @name Feature importance from the best model
import matplotlib
import pandas as pd
matplotlib.use("Agg")
import matplotlib.pyplot as plt
if hasattr(best_model, "feature_importances_"):
importance = pd.Series(best_model.feature_importances_, index=feature_cols).sort_values(
ascending=True
)
fig, ax = plt.subplots(figsize=(8, 4))
importance.plot.barh(ax=ax, color="#89b4fa")
ax.set_title(f"Feature Importance ({best_name})")
ax.set_xlabel("Importance")
plt.tight_layout()
plt.savefig("/tmp/titanic_importance.png", dpi=100)
print("Saved feature importance to /tmp/titanic_importance.png")
print(importance.sort_values(ascending=False))
else:
print(f"{best_name} does not expose feature_importances_")