Memory and Performance ====================== This page is about choosing between serial and parallel mode, and how to avoid running out of memory when processing big batches. If you're not sure what those words mean, the short answer is: **Leave parallel mode OFF.** It's the default. Read on if you want to know when it's safe to turn it on, or if you've hit memory problems. What "serial" and "parallel" mean here -------------------------------------- exo2micro processes one ``(sample, dye)`` combination at a time as a "task". When you have many samples and many dyes, you have many tasks. There are two ways to run them: - **Serial mode** (``parallel=False``, the default): exo2micro processes one task, then the next, then the next. Each task fully finishes before the next one starts. - **Parallel mode** (``parallel=True``): exo2micro launches multiple worker processes that each process one task at a time. If you set ``n_workers=4``, four tasks run at the same time. Parallel mode can be a lot faster when you have many tasks. But it comes with a real memory cost. Why parallel mode uses more memory ---------------------------------- Each parallel worker is a separate Python process. Each one holds its own copy of the current sample's images, padded canvases, alignment buffers, and so on. If a single sample uses 12 GB of RAM at peak, **four parallel workers use 48 GB**. Worse: when you exceed your computer's physical RAM, the operating system starts using disk as "virtual memory" (called *swapping* on Linux/Mac, *paging* on Windows). Disk is hundreds of times slower than RAM, so a run that's swapping will be vastly slower than the same run in serial mode — and may also crash the Python process entirely if it runs out completely. When to leave parallel mode OFF ------------------------------- **Leave it off if any of these are true:** - You have 16 GB of RAM or less. - You're not sure how much RAM you have. - Your images are large (typical exo2micro images are 30,000 × 25,000 pixels; one of those is about 2.3 GB on disk and even larger in memory). - You're going to use the computer for other things while the batch runs. - You only have a handful of tasks (3 or fewer). The overhead of starting worker processes usually makes serial just as fast. A specific anti-pattern: **don't** set ``parallel=True, n_workers=1`` to "try parallel mode safely". That gives you the worst of both worlds — the spawn overhead of parallel mode without any of the speed benefit, and serial mode's explicit memory cleanup doesn't run between tasks the way it does in actual serial mode. If you have RAM for only one worker, just use ``parallel=False``. When to turn parallel mode ON ----------------------------- Turn it on when **all** of these are true: - You have **5 or more** tasks to process. - You have **enough RAM** to hold multiple full-resolution images at once (see the table below). - You're not planning to use the computer for anything else during the run. How many workers ~~~~~~~~~~~~~~~~ Start conservative. Rough rule of thumb based on your total RAM: ============== ====================== Total RAM Recommended max workers ============== ====================== 8 GB or less 1 (use ``parallel=False``) 16 GB 1 (use ``parallel=False``) 32 GB 2 64 GB 4 128 GB 8 ============== ====================== Also never exceed **(your CPU core count) − 1**, so your computer stays responsive for system tasks. Checking your RAM ----------------- - **macOS**: Apple menu → About This Mac → Memory shows installed RAM. Applications → Utilities → Activity Monitor → Memory tab shows live usage. - **Windows**: Settings → System → About → Device specifications → Installed RAM. Ctrl+Shift+Esc → Task Manager → Performance tab → Memory shows live usage. Watching memory during a run ---------------------------- The first time you run a large batch in parallel, open your system's task/activity monitor and watch memory usage. If RAM usage climbs past 90%, or if your computer becomes sluggish: 1. Click the **Abort** button in the GUI (or interrupt the kernel in JupyterLab: Kernel → Interrupt Kernel). 2. Reduce ``n_workers`` and try again. 3. If even ``n_workers=2`` runs out of memory, switch to ``parallel=False``. Already-completed tasks are preserved when you abort. The pipeline saves checkpoints after each stage of each task, so re-running will pick up where you left off rather than starting over. Pre-flight resource checks (new in 2.4) --------------------------------------- Starting in 2.4, both :func:`~exo2micro.run_batch` and :meth:`SampleDye.run` run a quick resource check before any task starts. The check reads only the raw TIFF *headers* (no pixel data, fast even on networked drives), estimates the peak RAM and total disk output your batch will produce, and compares each to the available headroom on the machine. Three severity levels per resource: - **≤ 80% of available — silent.** Run proceeds normally. - **80%-100% — warning, run proceeds.** A "⚠️ HIGH" line is printed; you should consider closing other applications or reducing ``n_workers`` but the run continues. - **> 100% — hard fail.** :class:`MemoryError` (for RAM) or :class:`OSError` (for disk) is raised before any task runs. The error message includes a remediation list with concrete suggestions (reduce ``n_workers``, reduce ``pad``, switch ``checkpoint_format`` to one format only, free disk, etc.) with your current values inline. A typical successful check looks like this:: === Pre-flight resource check === RAM: estimated peak 2.8 GB vs 16.0 GB available (17%) ✓ Disk: estimated total 4.1 GB vs 412 GB free (1%) ✓ ================================= This catches the case that previously caused most "kernel dies mid-batch" reports: starting an 8-worker batch on a 32 GB machine that needs 6 GB per task. Before 2.4, you'd see the kernel die with no useful diagnostic. In 2.4, the same configuration raises ``MemoryError`` immediately with a message telling you exactly how many workers your machine can handle and why. Overriding the check ~~~~~~~~~~~~~~~~~~~~ If you know the estimate is conservative for your specific data — for example you've cleared other applications since the estimate was computed, or your samples are unusually compressible — pass ``force_run=True`` to downgrade the hard fail to a warning:: results = e2m.run_batch( samples=['CD070', 'CD063'], dyes=['SybrGld'], n_workers=8, force_run=True, ) This is not recommended for normal use. If a run that's flagged ``❌ EXCEEDS AVAILABLE`` actually does OOM-kill the Python process mid-batch, you may end up with corrupted checkpoint files (a half-written TIFF that the next run can't read), so the default behavior is to refuse the run rather than risk that. The 6× factor ~~~~~~~~~~~~~ The RAM estimate is:: peak per task ≈ (H + 2·pad) × (W + 2·pad) × 4 bytes × 6 The 6× factor reflects how many full-resolution float32 image copies coexist at the worst point of a single task (stage 2 or stage 3, where padded post + padded pre + downsampled working copies + warp output buffer + SIFT internals all live in memory simultaneously). It's a conservative estimate. If you find the check is consistently refusing batches that actually fit on your machine, the constant ``PEAK_FACTOR_PER_TASK`` at the top of the memory-diagnostics section in ``exo2micro/utils.py`` can be tuned. We expect most users won't need to touch it. Subprocess mode for low-RAM machines (new in 2.4) ------------------------------------------------- Even in serial mode, some memory can accumulate across tasks that ``gc.collect()`` between tasks can't fully reclaim: matplotlib figure state held by the pyplot module, Jupyter ``Out[]`` cell references, cv2/tifffile internal caches. If you're seeing your collaborator's kernel die partway through a serial batch even though the pre-flight check passed, the cause is likely one of these slow accumulating leaks. The fix is **subprocess mode**: run each task in a fresh Python subprocess, exited and reclaimed by the OS between tasks. :: results = e2m.run_batch( samples=['CD070', 'CD063'], dyes=['SybrGld', 'DAPI'], parallel='subprocess', ) This is a third value for the ``parallel`` argument, alongside ``False`` (serial in-process, the default) and ``True`` (multiprocessing pool). Each task runs in a fresh process. Tasks run one at a time (not concurrently — for that, use ``parallel=True``). When to use subprocess mode: - Your pre-flight check passes (per-task RAM fits) but the kernel still dies after a few tasks complete successfully. - The :class:`~exo2micro.MemoryTracker` summary (below) shows RSS climbing monotonically across tasks. - You want overnight unattended batches to be robust to wedged tasks (see ``timeout_per_task`` below). Important: subprocess mode is **not** the same as ``parallel=True, n_workers=1``. That uses :class:`multiprocessing.Pool`, which keeps a single worker process alive across every task, so leaks accumulate in it just as they do in serial mode. Subprocess mode spawns a new process *per task* and tears it down after. Subprocess mode adds ~1-2 seconds of process-spawn overhead per task. For typical exo2micro tasks that take minutes to align, this is invisible. Timeouts and OOM detection ~~~~~~~~~~~~~~~~~~~~~~~~~~ In subprocess mode you can also set ``timeout_per_task`` to abort any task that runs too long:: results = e2m.run_batch( samples=samples, dyes=dyes, parallel='subprocess', timeout_per_task=1800, # 30 minutes per task ) Recommended for unattended overnight batches so a wedged task doesn't block the rest. If a subprocess gets killed by the OS (most often SIGKILL from the kernel's OOM killer), the parent detects this and records the task as ``'error: subprocess killed (likely OOM)'`` rather than crashing the batch. The remaining tasks continue normally. Diagnosing memory issues (new in 2.4) ------------------------------------- If you've hit a memory problem and you're not sure whether it's a per-task peak overrun or an accumulating leak, the :class:`~exo2micro.MemoryTracker` class can tell you. Pass ``memory_debug=True`` to :func:`run_batch`:: results = e2m.run_batch( samples=['CD070', 'CD063'], dyes=['SybrGld', 'DAPI'], memory_debug=True, ) This prints RSS (resident set size) snapshots before and after each task, with an explicit ``gc.collect()`` pass in between:: [mem] 2.34 GB batch start [mem] 2.34 GB before CD070/SybrGld [mem] 8.91 GB after gc CD070/SybrGld [mem] 8.91 GB before CD070/DAPI [mem] 14.22 GB after gc CD070/DAPI [mem] 14.22 GB before CD063/SybrGld ... [mem] === memory summary === [mem] baseline: 2.34 GB [mem] peak: 14.22 GB (+11.88 GB) [mem] final: 14.22 GB (+11.88 GB) [mem] WARNING: final RSS is >0.5 GB above baseline. ... The pattern of those numbers tells you which problem you have: - **RSS climbs monotonically and never returns to baseline** → real leak. ``gc.collect()`` isn't recovering memory between tasks. Use subprocess mode (above) — that's the only reliable cure. - **RSS spikes during each task but returns to baseline between them** → no leak. Per-task peak just exceeds your RAM. Reduce ``n_workers``, reduce ``pad``, or close other applications. The pre-flight check tries to predict the second case before any task runs, but the tracker is what you want when you've gotten past pre-flight and still have problems. Requires the optional ``psutil`` dependency:: pip install psutil Without psutil, ``memory_debug=True`` no-ops with a one-time warning. What exo2micro does on its own to manage memory ----------------------------------------------- A few things happen automatically that you don't need to think about: - **In serial mode**, the pipeline explicitly closes all matplotlib figures and runs Python's garbage collector between tasks. This is more aggressive than relying on Python's default cleanup and is the main reason serial mode is the right choice on low-RAM machines. - **Within a task**, intermediate image data is released as soon as each pipeline stage finishes. Stage 2's alignment debug data (downsampled images used for the diagnostic plots) is dropped as soon as those plots are saved. Stage 3's warp matrices are dropped at the end of stage 4. Only the small scalar scale estimates survive into the returned result. - **All intermediate images are float32 on disk** (4 bytes per pixel) rather than float64 (8 bytes), which halves the working memory footprint without sacrificing visible precision. You shouldn't normally need to do anything to make these happen. They're built into the pipeline. They just mean that for the same hardware, exo2micro can usually process larger batches than a naive implementation could.