Import¶
strata import converts a Jupyter .ipynb file into a Strata
notebook directory. The result is a normal Strata notebook, same
DAG analysis, same artifact store, same execution model, that you
can open in the UI, run headlessly, edit, or export.
When to use it¶
- You arrived from Jupyter and want to try Strata on a notebook you already have. One command and you're working in Strata's model (content-addressed caching, explicit DAG, distributed workers, prompt cells, …) without rewriting from scratch.
- You're picking up someone else's analysis that lives as an
.ipynbon GitHub or Kaggle. Import it once, iterate normally. - You're stress-testing Strata against real-world notebooks: the import + corpus-runner combo is the validation harness Strata itself uses pre-release.
The import is one-shot, not a live sync. The .ipynb is treated
as input; the resulting notebook directory is the source of truth
from then on. There's no "save back to ipynb", that's a separate
planned feature.
Usage¶
Options¶
| Flag | Description |
|---|---|
--out <dir> |
Target notebook directory. Defaults to a sibling directory named after the .ipynb stem. |
Examples¶
# Side-by-side default: ./my_analysis.ipynb → ./my_analysis/
strata import my_analysis.ipynb
# Place the result elsewhere
strata import ~/Downloads/kaggle_titanic.ipynb --out ~/work/titanic
# Then open or run normally
strata-server --notebook-storage-dir ~/work # UI
strata run ~/work/titanic # headless
REST¶
| Field | Type | Description |
|---|---|---|
file |
file (required) | The .ipynb upload. Hard cap of 50 MB per request. |
name |
string | Override the notebook name. Defaults to the upload's filename stem. |
parent_path |
string | Override the storage location. Must lie inside the configured storage root. |
Returns the same notebook state shape as POST /v1/notebooks/create
plus an import_report field with the conversion details. On
malformed input, invalid JSON, non-nbformat structure, path
traversal in name, collision with an existing notebook, returns a
clean 400 or 409 with detail set.
What the converter does¶
Cell-by-cell, in source order¶
| Jupyter cell | Strata cell |
|---|---|
| Markdown | Markdown cell with the source verbatim |
| Code | Python cell (after magic translation, see below) |
| Raw | Skipped; counted in the report |
Variable rebinding (df = df.dropna(), df = df[df.col > 0], …)
is a first-class pattern. Strata's DAG analyser handles read-before-
write semantics correctly: the cell appears as both a producer
and a consumer of df, so the upstream edge is drawn and downstream
cells see the post-mutation view.
;-display-suppression preserved¶
df; in Jupyter evaluates df but suppresses the auto-displayed
value. Strata's harness auto-displays the last bare expression too,
so the converter detects the trailing ; (with or without an
adjacent comment) and appends a pass so the harness skips display.
The cell still runs.
Magic translation table¶
Single dict in strata.notebook.jupyter_import. Adding a row
extends support; rows can be:
| Magic | Action |
|---|---|
%matplotlib inline, %matplotlib notebook |
Dropped. Strata captures figures via the display protocol. |
%load_ext, %reload_ext, %autoreload, %config, %colors, %rerun |
Dropped. |
%capture, %xmode, %pdb, %debug, %tb |
Dropped. |
%who, %who_ls, %whos, %lsmagic, %magic, %history, %alias, %alias_magic |
Dropped (interactive REPL inspection has no Strata equivalent). |
%timeit, %time (line form) |
Magic stripped, body kept. |
%pip install <pkgs>, !pip install <pkgs>, %conda install <pkgs> |
Packages captured into pyproject.toml. |
%env KEY=VAL, %set_env KEY=VAL |
Translated to a # @env KEY=VAL cell annotation. |
%run script.py |
Translated to an exec(Path("script.py").read_text()) (with a self-contained Path import). |
%%bash, %%sh, %%script |
Body wrapped in subprocess.run(..., shell=True). |
%%writefile <path>, %%file <path> |
Translated to Path(<path>).write_text(<body>). |
%%timeit, %%time, %%capture |
Recurse on body as plain code. |
%%javascript, %%js, %%html, %%latex, %%svg, %%markdown |
Dropped. Strata has no equivalent renderer. |
%%R, %%ruby, %%perl, %%cython, %%fortran, %%sql |
Dropped (non-Python languages and embedded SQL aren't auto-translatable; use Strata's SQL cell type for %%sql content). |
| Anything unrecognized | Dropped with a # strata: unsupported magic '<name>' dropped comment at the original location. |
!shell commands¶
Auto-running shell from an untrusted notebook is a real hazard (stress-testing public Kaggle notebooks is a primary use case), so the converter is conservative:
!pip install pkg1 pkg2: packages captured to deps, source line removed.var = !cmd(assignment-form shell escape) command dropped,varstub-bound to[]so downstream code still parses. Restore manually withsubprocess.run(...)if the shell escape matters.- Other
!cmd: dropped with a marker comment. Surfaces in the import report's Shell commands dropped section.
Dependencies¶
Four sources, in priority order, earlier sources shadow later ones,
so a version-pinned spec from requirements.txt wins over a bare
inferred-from-imports entry:
- Sibling
requirements.txtnext to the.ipynb. - Sibling
pyproject.tomlnext to the.ipynb. %pip install/!pip installlines extracted from cells.- Bare imports in cell source. AST-walk each cell, collect
top-level import names, filter stdlib (via
sys.stdlib_module_names) and local sibling modules (anything that resolves to a*.pyor*/__init__.pynext to the notebook). Map import names to PyPI names via a small hand-maintained dict for common mismatches:cv2 → opencv-python,sklearn → scikit-learn,PIL → Pillow,bs4 → beautifulsoup4,yaml → PyYAML, etc. Anything not in the dict is assumed to use the same name on PyPI (right ~95% of the time).
The combined set is deduped with PEP 503-normalized package names
(so scikit-learn and scikit_learn collapse to one entry) and
filtered to PEP 508 specifiers only: pyproject.toml's
dependencies won't accept editable installs (-e .), bare URLs
(git+https://…), or local paths. Skipped specs land in the import
report so you can address them by hand.
The deps are written to the new notebook's pyproject.toml. First
uv sync (which runs automatically when you open the notebook in
the UI, or when you invoke strata run) resolves them. The
importer doesn't call uv add itself, that's slow, networked, and
partial-failure-prone.
The import report¶
Every import writes <notebook_dir>/import_report.md and the same
content is returned inline on the REST response (import_report.report_text).
Sections appear only when relevant, a clean notebook with no magics, no shell, and no deps produces a short report with just the counts. Sections that surface when applicable:
- Counts: markdown / code cells,
;-suppression instances, skipped cell types. - Magics translated: list of every magic the converter rewrote or absorbed.
- Magics dropped: magics with no Strata equivalent; the source carries a marker comment where each one used to live.
- Shell commands dropped:
!cmdlines (except!pip install) that the converter removed. - Dependencies captured: what landed in
pyproject.toml, ready foruv syncto resolve. - Warnings: anything noteworthy; e.g. pip-only specs skipped for not being valid PEP 508.
Limitations by design¶
- No output preservation. Imported cells start blank. Running the notebook through Strata produces fresh outputs from a known environment, which is exactly the signal the corpus-runner stress test is built to validate. Pre-loaded author outputs would mask execution incompatibilities.
- No widgets.
application/vnd.jupyter.widget-view+jsonoutputs and%%javascriptcells are dropped with a warning. Strata's display protocol doesn't model interactive widgets. - No round-trip back to
.ipynb. Tracked as a follow-up; out of scope for the import path itself. - Kaggle hard-coded paths. Notebooks that reference
/kaggle/input/...directly will fail on first run outside Kaggle. The path lives unchanged in the imported source; surface it in the report or rewrite by hand.
Stress-testing with the corpus runner¶
The same import_notebook machinery the CLI and REST endpoint use
backs the corpus runner under tests/notebook/test_jupyter_corpus.py.
Five hand-crafted smoke fixtures get scored through:
parse: can we read the .ipynb at all?
convert: did jupyter_import produce a valid Strata notebook dir?
dag: does the DAG build with no cycles / unbound references?
run: does `strata run` complete with no exceptions?
artifact: do leaf cells produce non-empty artifacts?
The fast tier (parse + convert + dag) runs on every PR. The full
tier (+ run + artifact) is opt-in:
The runner does a uv sync per fixture and a real strata run,
so it's slow (network) and intended for nightly schedules or
manual pre-release verification.
Extended corpus + nightly CI¶
tests/notebook/jupyter_corpus/extended.yaml lists public .ipynb
URLs pinned to specific commit SHAs. The jupyter-corpus GitHub
Actions workflow fetches each one nightly (04:00 UTC) and runs it
through the same rubric. Failures are report-only, the workflow
uploads a junit XML artifact, doesn't gate other CI. The cache is
keyed by the manifest's content hash so once a URL has been
fetched it's served from disk on subsequent runs.
Anyone can trigger the workflow from the Actions tab via
workflow_dispatch. Adding a new entry: pick a stable repo, pin to
a commit SHA, run STRATA_CORPUS_RUN=1 pytest -k <name> locally to
verify the entry clears the bar you set, commit the manifest line.
Round-trip back to .ipynb¶
Out of scope today. Tracked as a follow-up because preserving the
import-lossy parts (Jupyter outputs, widget metadata, original
magic text) would compromise the import itself. See
docs/internal/design-jupyter-import.md if you're picking this up.