Utils Module

utils.py

Shared utility functions used across the exo2micro sub-modules.

Includes:

Image I/O helpers (TIFF / FITS reading and writing with metadata)
Preprocessing (median subtraction, normalisation, padding, trimming)
Masking helpers (tissue masks, joint masks, fill holes)
Gaussian smoothing with NaN conservation
Intensity equalisation for image pairs
Display helpers (robust vmax, RGB overlays)

class exo2micro.utils.MemoryTracker(enabled=False)[source]

Bases: object

Track resident-set-size (RSS) across pipeline tasks.

Use this to confirm whether memory is actually being released between tasks in a batch. If RSS climbs monotonically across tasks, there is a leak somewhere (matplotlib figures, Jupyter Out[] history, retained widget state, unfreed numpy temporaries). If RSS returns to baseline after the explicit gc.collect() in each task footer, then per-task peak is just exceeding available RAM and the answer is reducing pad, using a smaller working resolution, or switching to subprocess-per-task mode (see exo2micro.parallel.run_batch_subprocess()).

When enabled=False (the default) all methods are cheap no-ops, so leaving the calls in production code costs essentially nothing.

Requires psutil to actually do anything. If psutil is missing the tracker prints a one-time warning and no-ops.

Example

>>> from exo2micro.utils import MemoryTracker
>>> tracker = MemoryTracker(enabled=True)
>>> tracker.snapshot('start')
>>> for sample, dye in tasks:
...     tracker.snapshot(f'before {sample}/{dye}')
...     SampleDye(sample, dye).run()
...     tracker.collect_and_snapshot(f'after gc {sample}/{dye}')
>>> tracker.summary()

collect_and_snapshot(label)[source]: Run gc.collect() twice then snapshot. Use between tasks.

snapshot(label)[source]: Record current RSS with a label and print it.

summary()[source]: Print a summary table at end of batch.

class exo2micro.utils.MemoryWatchdog(min_available_gb=0.5, poll_interval_sec=5.0, verbose=False)[source]

Bases: object

Background thread that polls available RAM and signals when it falls below a threshold.

Usage

>>> from exo2micro.utils import MemoryWatchdog
>>> wd = MemoryWatchdog(min_available_gb=0.5)
>>> wd.start()
>>> try:
...     for sample, dye in tasks:
...         wd.check_or_raise(f"before {sample}/{dye}")
...         SampleDye(sample, dye).run()
... finally:
...     wd.stop()

The watchdog itself doesn’t interrupt anything — it just sets a flag. The caller must poll via check_or_raise() at safe abort points (typically stage boundaries) to actually halt.

param min_available_gb:: Threshold in gibibytes. When psutil.virtual_memory().available drops below this, the watchdog trips. Default 0.5 GB.
type min_available_gb:: float
param poll_interval_sec:: How often to poll RAM. Default 5.0 seconds. Polling too fast wastes CPU; too slow misses fast-growing leaks.
type poll_interval_sec:: float
param verbose:: If True, prints a warning line each time RAM drops below the threshold. Default False (silent until tripped).
type verbose:: bool

check_or_raise(label='')[source]

Raise MemoryError if the watchdog is tripped.

Call at safe abort points (e.g. between pipeline stages). label is included in the error message to identify where the abort happened.

is_tripped()[source]: Return True if the watchdog has crossed its threshold.

reset()[source]: Clear the tripped flag (e.g. after handling a low-RAM event).

start()[source]: Start the background polling thread.

stop()[source]: Stop the background polling thread.

class exo2micro.utils.TeeStdout(log_path)[source]

Bases: object

File-like object that writes to both stdout and a log file.

Used as a context manager during pipeline runs to capture every line of pipeline output (including text from inside library functions that use print() directly) into the persistent run log, without disturbing the normal stdout flow that the GUI’s widgets.Output context manager captures.

Usage:

with TeeStdout(log_path):
    with widget_output:
        run.run()  # all prints go to widget AND log file

Failures writing to the file are silently swallowed; the underlying stdout writes always succeed.

flush()[source]

write(data)[source]

exo2micro.utils.append_to_run_log(output_dir, message)[source]

Append a line to the persistent run log.

Failures (e.g. permission errors, missing directory) are silently ignored — the log is a best-effort persistence aid, not a critical path.

exo2micro.utils.build_clean_tissue_mask(post, pre)[source]

Build a clean tissue mask using binary_fill_holes only (no dilation).

This is the mask used for residual histograms and as the base for the signal-only fitting mask.

Parameters:

post (ndarray) – Post-stain image (float).
pre (ndarray) – Pre-stain aligned image (float).

Returns:

tissue_mask

Return type:

ndarray of bool

exo2micro.utils.build_tissue_mask(post_im, pre_im, signal_threshold=0, dilation_iters=50)[source]

Build a joint tissue mask from post-stain and pre-stain images.

Each image is independently thresholded, dilated, and hole-filled, then the two masks are intersected.

Parameters:

post_im (ndarray) – Post-stain image (2D).
pre_im (ndarray) – Pre-stain aligned image (2D).
signal_threshold (float) – Pixels <= this are excluded (default 0).
dilation_iters (int) – Morphological dilation iterations (default 50).

Returns:

joint_mask – Intersection of post and pre tissue masks.

Return type:

ndarray of bool

exo2micro.utils.checkpoint_exists(filepath)[source]

Check whether a checkpoint file exists.

Parameters:: filepath (str) – Base filepath WITHOUT extension.
Return type:: bool

exo2micro.utils.classify_raw_files(sample_dir)[source]

Classify TIFF files in a sample directory by stain type and dye.

Filename rules

A valid raw image filename must:

End with .tif or .tiff (case-insensitive).
Contain pre or post (case-insensitive) somewhere in the basename, marking it as a pre-stain or post-stain image.
End with _<DyeName>.tif (or .tiff), where <DyeName> is the dye identifier and contains no underscores. The dye name is the substring between the last underscore and the extension.

Examples of valid filenames:

Sample001_PreStain_SybrGld.tif        -> pre,  dye=SybrGld
Sample001_PostStain_SybrGld.tiff      -> post, dye=SybrGld
my_2024_pre_run3_DAPI.tif             -> pre,  dye=DAPI
whatever_post_Cy5.tiff                -> post, dye=Cy5

Examples of invalid filenames (will be flagged in warnings):

Sample001_PreStain_SybrGld_microbe.tif – dye name contains an underscore. Would be parsed as dye microbe. Rename SybrGld_microbe to SybrGldmicrobe or similar.
Sample001_pre_post_SybrGld.tif – contains both pre and post. Ambiguous, skipped.
Sample001_SybrGld.tif – contains neither pre nor post. Cannot be classified, skipped.

This function is non-raising: it returns whatever it could parse plus a list of human-readable warnings about anything it couldn’t. Callers that need to fail hard on missing or duplicate pairs (e.g. load_image_pair()) should check the returned structures themselves.

param sample_dir:

Path to a single sample’s directory.

type sample_dir:

str

returns:

pairs (dict) – Maps each detected dye name to a dict of candidate file paths:

{
    'SybrGld': {'pre': ['/.../...PreStain_SybrGld.tif'],
                'post': ['/.../...PostStain_SybrGld.tif']},
    'DAPI':    {'pre': ['/.../...Pre_DAPI.tiff'],
                'post': ['/.../...Post_DAPI.tiff']},
}

Each list may have 0, 1, or many entries. Callers decide what to do about duplicates and missing sides.

warnings (list of str) – Human-readable problem descriptions for individual files that couldn’t be classified. One entry per problematic file.

exo2micro.utils.clear_run_log(output_dir)[source]: Delete the persistent run log file, if it exists.

exo2micro.utils.diagnose_raw_layout(raw_dir='raw')[source]

Diagnose the layout of a raw image directory and return a structured report.

Catches the common “I don’t see any images” failure modes before the pipeline gets a chance to fail confusingly downstream:

raw_dir doesn’t exist at all.
raw_dir exists but is empty.
raw_dir contains TIFF files directly (no per-sample folders). This is the most common mistake — users dump all their files in one place instead of separating by sample.
raw_dir contains subdirectories but none of them contain any TIFF files.

When the layout looks correct, returns ok=True and a short informational summary. When something is wrong, returns ok=False and a multi-line message explaining what’s wrong and how the directory should be structured.

Parameters:

raw_dir (str) – Path to the raw image directory (default 'raw').

Returns:

report – Keys:

ok (bool): True if the layout looks usable.
message (str): Human-readable multi-line message. Empty string when ok=True and there’s nothing to report.
raw_dir (str): The directory that was inspected.
exists (bool): Whether raw_dir itself exists.
subdirs (list of str): Subdirectory names found (sorted).
loose_tiffs (list of str): TIFF filenames found directly in raw_dir (sorted). Non-empty implies the layout is wrong even if subdirs is also non-empty.
empty_subdirs (list of str): Subdirectory names that contain no TIFF files (sorted). Informational only.

Return type:

dict

exo2micro.utils.discover_tasks(samples, dyes, raw_dir='raw')[source]

Resolve a (samples, dyes) request into the actual list of tasks present on disk.

Given a list of sample names and a list of dye names the user wants to process, walk each sample directory and return:

present: list of (sample, dye) tuples that have both a pre-stain and a post-stain file. These are runnable.
skipped: list of (sample, dye, reason) tuples that were requested but can’t run, with a short human-readable reason.
warnings: list of (sample, warning_str) tuples for filename problems encountered along the way (ambiguous, no underscore, etc. — same wording as classify_raw_files() returns).

This is the canonical “what tasks should we actually run?” helper. Both the batch processor (exo2micro.parallel.build_task_list()) and the GUI use it so they share one source of truth.

Parameters:

samples (list of str) – Sample names requested by the user.
dyes (list of str) – Dye names requested by the user.
raw_dir (str) – Root directory containing per-sample subdirectories (default 'raw').

Returns:

result – Keys: present (list of (sample, dye)), skipped (list of (sample, dye, reason)), warnings (list of (sample, warning_str)), and layout_ok (bool — False if diagnose_raw_layout() flagged a fatal layout problem; in that case present and skipped will both be empty and a single layout warning is added to warnings).

Return type:

dict

exo2micro.utils.equalize_pair(post, pre)[source]

Intensity-equalize a pair of images for registration.

Histogram-matches the pre-stain image’s intensity distribution to the post-stain’s, then jointly normalises both to [0, 1] using the shared 99th percentile of post-stain nonzero pixels.

Parameters:

post (ndarray) – Post-stain image as float32 (2D).
pre (ndarray) – Pre-stain image as float32 (2D).

Returns:

post_eq (ndarray) – Post-stain normalised to [0, 1].
pre_eq (ndarray) – Pre-stain histogram-matched and normalised.

exo2micro.utils.estimate_gauss_sigma(im, down_scale, sparse_threshold=0.1, sparse_sigma=5, dense_sigma=0)[source]

Estimate an appropriate Gaussian pre-smoothing sigma for ECC registration based on image density.

Parameters:

im (ndarray) – Full-resolution image (2D).
down_scale (float) – Downsample factor that will be applied before ECC.
sparse_threshold (float) – Nonzero pixel fraction below which the image is sparse (default 0.1).
sparse_sigma (float) – Sigma for sparse images at downsampled resolution (default 5).
dense_sigma (float) – Sigma for dense images; 0 disables smoothing (default 0).

Returns:

Recommended gauss_sigma value.

Return type:

float

exo2micro.utils.estimate_pipeline_memory(sample_dye_pairs, raw_dir='raw', pad=2000, n_workers=1)[source]

Estimate peak RAM required to process a list of (sample, dye) pairs.

Reads only TIFF headers (no pixel data) to get raw image dimensions, inflates by the padding, multiplies by float32 bytes/pixel and by PEAK_FACTOR_PER_TASK to account for the several full-resolution arrays that coexist at peak. For parallel runs, multiplies by n_workers.

The estimate is the worst-case peak across tasks, not the sum, because tasks run sequentially in serial mode (only one task in memory at a time) and concurrently in parallel mode (n_workers tasks at a time, but each could be the worst one).

Parameters:

sample_dye_pairs (list of tuple) – (sample, dye) combinations.
raw_dir (str) – Root raw image directory.
pad (int) – Padding parameter (default 2000). Larger padding inflates the memory estimate substantially.
n_workers (int) – Number of concurrent worker processes. 1 for serial.

Returns:

estimate – Keys: peak_bytes (int, worst-case single-task peak), concurrent_peak_bytes (int, that times n_workers), per_task_bytes (list of int, one per pair), warnings (list of str), and n_resolvable (int).

Return type:

dict

exo2micro.utils.estimate_pipeline_output_size(sample_dye_pairs, raw_dir='raw', pad=2000, save_all_intermediates=False, n_scale_methods=1, checkpoint_format='tiff')[source]

Estimate the on-disk footprint of a pipeline run.

Returns a best-effort estimate of how much disk space the pipeline will consume if run on the given (sample, dye) combinations with the given parameters. Used by the GUI to pre-warn users when the estimate would exceed available disk space.

The estimate is based on the raw TIFF dimensions: exo2micro pads each raw image by pad pixels on every side, converts to float32 (4 bytes per pixel), and saves intermediates at each pipeline stage. Approximate breakdown per (sample, dye):

Stage 1: padded post + padded pre (2 files, float32)
Stage 2: ICP-aligned pre (1 file); +coarse-aligned pre if save_all_intermediates=True
Stage 3: interior-aligned pre (1 file)
Stage 4: difference image (n_scale_methods files, one per active scale method: Moffat-only = 1, Moffat+percentile = 2, Moffat+manual = 2, all three = 3)

Each intermediate can be written as TIFF, FITS, or both depending on checkpoint_format. TIFF-only and FITS-only runs use roughly half the disk space of 'both'. Diagnostic PNG plots add a small fixed overhead (~10 MB per (sample, dye) regardless).

Parameters:

sample_dye_pairs (list of tuple) – List of (sample, dye) combinations to estimate.
raw_dir (str) – Root raw image directory (default 'raw').
pad (int) – Padding value (default 2000).
save_all_intermediates (bool) – If True, adds the stage-2 coarse intermediate to the estimate.
n_scale_methods (int) – How many difference images stage 4 will produce (1-3).
checkpoint_format ({'tiff', 'fits', 'both'}) – Which file format(s) each checkpoint gets written as. TIFF and FITS are roughly the same size on disk; 'both' doubles the per-file footprint.

Returns:

estimate –

``{‘bytes_per_task’: [list], ‘total_bytes’: int,: ’n_tasks’: int, ‘n_resolvable’: int, ‘warnings’: [list]}``

Return type:

dict

exo2micro.utils.filter_nan_gaussian_conserving(arr, sigma)[source]

Apply a Gaussian smooth to an array that may contain NaNs, conserving total intensity. NaN positions remain NaN in the output.

Parameters:

arr (ndarray) – Input 2D array, may contain NaNs.
sigma (float) – Gaussian smoothing sigma in pixels.

Returns:

Smoothed array with NaNs preserved.

Return type:

ndarray

exo2micro.utils.format_bytes(n)[source]: Format a byte count as a human-readable string.

exo2micro.utils.get_available_memory()[source]: Return available RAM in bytes, or None if psutil unavailable.

exo2micro.utils.get_free_disk_space(path)[source]: Return free disk space at path in bytes.

exo2micro.utils.get_run_log_path(output_dir)[source]

Return the path to the persistent run log file.

The log lives at {output_dir}/.exo2micro_run_log.txt. The leading dot keeps it out of casual file listings since it’s mostly for recovery/debugging, not regular browsing.

exo2micro.utils.load_checkpoint(filepath)[source]

Load a checkpoint image from TIFF.

Parameters:: filepath (str) – Base filepath WITHOUT extension (same as passed to save_checkpoint).
Returns:: image – The loaded image, or None if not found.
Return type:: ndarray or None

exo2micro.utils.load_image_pair(sample, dye, raw_dir='raw')[source]

Load a pre-stain and post-stain image pair for a given sample and dye.

Automatically detects which RGB channel carries the fluorescence signal and extracts it at full 8-bit precision, rather than using PIL.Image.convert() which loses ~41% of the dynamic range.

Filename convention

Each sample directory must contain exactly one pre-stain file and exactly one post-stain file per dye, named so that:

The filename ends with .tif or .tiff (case-insensitive).
The basename contains pre or post (case-insensitive) to mark the stain type.
The basename ends with _<dye>.<ext>, where <dye> matches the dye argument and contains no underscores.

See classify_raw_files() for full details and examples.

Behaviour on problems

This function is strict: it raises rather than returning placeholder values when anything goes wrong. The exception message is multi-line and tells the user exactly what to fix.

Missing sample directory -> FileNotFoundError
No file matches the requested dye -> FileNotFoundError
Only one side of the pair found -> FileNotFoundError
Multiple pre-stain or post-stain files for the same dye -> ValueError

When other dyes in the same directory are misnamed (ambiguous, no underscore, etc.), warnings about them are printed but do not block loading the requested dye.

param sample:

Sample name, e.g. 'CD070'. Must match the name of a subdirectory under raw_dir.

type sample:

str

param dye:

Dye name, e.g. 'SybrGld' or 'DAPI'. Must match the substring after the last underscore in the raw filenames.

type dye:

str

param raw_dir:

Base directory containing sample subdirectories (default 'raw').

type raw_dir:

str

returns:

post_im (ndarray) – Post-stain image as a 2-D numpy array.
pre_im (ndarray) – Pre-stain image as a 2-D numpy array.
post_path (str) – Path to the post-stain file.
pre_path (str) – Path to the pre-stain file.

raises FileNotFoundError:

If the sample directory is missing, the requested dye has no matching files, or only one side of the pair exists.

raises ValueError:

If the requested dye matches more than one pre-stain or post-stain file in the directory.

exo2micro.utils.make_rgb_overlay(post, pre, post_edges=None, pre_edges=None)[source]

Build a 3-channel RGB overlay for alignment assessment.

Post-stain in Red, pre-stain in Green. Overlap appears yellow. Optional boundary edges drawn in cyan (post) and magenta (pre).

Parameters:

post (ndarray) – Post-stain image (float32, 2D).
pre (ndarray) – Pre-stain image (float32, 2D).
post_edges (ndarray or None) – Post-stain boundary ring.
pre_edges (ndarray or None) – Pre-stain boundary ring.

Returns:

rgb – uint8 array of shape (H, W, 3).

Return type:

ndarray

exo2micro.utils.normalize_image(image, norm_percentile=None)[source]

Normalize an image to its maximum or to a specified percentile value.

Parameters:

image (ndarray) – 2D image array.
norm_percentile (float or None) – If None, normalize by the image maximum. Otherwise normalize by this percentile value.

Returns:

Normalized image.

Return type:

ndarray

exo2micro.utils.pad_images(post_im, pre_im, pad=50)[source]

Pad two images with zeros onto a common canvas plus a border.

The extra border gives the registration algorithm room to shift the pre-stain image without it falling off the canvas edge.

Parameters:

post_im (ndarray) – Post-stain image (2D).
pre_im (ndarray) – Pre-stain image (2D).
pad (int) – Number of zero-padding pixels on each side (default 50).

Returns:

post_im_pad (ndarray) – Zero-padded post-stain image.
pre_im_pad (ndarray) – Zero-padded pre-stain image on the same canvas.

exo2micro.utils.preflight_check(sample_dye_pairs, output_dir='processed', raw_dir='raw', pad=2000, n_workers=1, checkpoint_format='tiff', n_scale_methods=1, save_all_intermediates=False, force_run=False)[source]

Combined RAM + disk pre-flight check before a batch or single run.

Estimates both peak RAM (across all concurrent tasks) and total disk output. Compares each against the available headroom on the system and either warns or raises MemoryError / OSError depending on severity.

Severity bands (each resource checked independently):

estimate ≤ 80% of available — silent.
80%-100% — print a warning, proceed.
> 100% — raise MemoryError (RAM) or OSError (disk).

Callers can pass force_run=True to downgrade the hard fail to a warning. Useful when the estimate is known to be conservative or when the user has already cleared other processes.

Parameters:

sample_dye_pairs (list of tuple) – Tasks to check.
output_dir (str) – Where checkpoints will be written. Free disk space is measured at this path.
raw_dir (str) – Source raw images, needed to read TIFF dimensions.
pad (int) – Padding parameter.
n_workers (int) – Concurrent workers. 1 for serial.
checkpoint_format – Passed through to estimate_pipeline_output_size().
n_scale_methods – Passed through to estimate_pipeline_output_size().
save_all_intermediates – Passed through to estimate_pipeline_output_size().
force_run (bool) – If True, hard-fail conditions are downgraded to warnings.

Raises:

MemoryError – When the RAM estimate exceeds 100% of available and force_run=False.
OSError – When the disk estimate exceeds 100% of free space and force_run=False.

exo2micro.utils.read_run_log_tail(output_dir, max_lines=500)[source]

Read the tail of the persistent run log.

Parameters:

output_dir (str) – Directory containing .exo2micro_run_log.txt.
max_lines (int) – Maximum number of lines to return (most recent). Reading the whole file into memory is fine for typical log sizes (~megabytes), but we cap it defensively.

Returns:

text – The last max_lines lines of the file, joined into a single string, or None if the file doesn’t exist.

Return type:

str or None

exo2micro.utils.robust_vmax(im, n_mad=5)[source]

Compute a display vmax robust to bright outliers.

Uses median + n_mad * MAD over nonzero pixels.

Parameters:

im (ndarray) – 2D image array.
n_mad (float) – Number of median absolute deviations above the median (default 5).

Returns:

Robust display maximum.

Return type:

float

exo2micro.utils.save_checkpoint(image, filepath, sample='', dye='', stage='', params=None, extra_headers=None)[source]

Save an intermediate image as both TIFF and FITS, with metadata.

The TIFF is saved in the ‘tiff/’ subdirectory and the FITS in the ‘fits/’ subdirectory of the same parent.

Parameters:

image (ndarray) – 2D image array to save.
filepath (str) – Base filepath WITHOUT extension, e.g. ‘processed/CD070/SybrGld_microbe/01_padded_post’. The function appends .tiff and .fits and places them in the appropriate subdirectories.
sample (str) – Sample name for FITS header.
dye (str) – Dye name for FITS header.
stage (str) – Pipeline stage name for FITS header.
params (dict or None) – Non-default parameters to record in FITS header.
extra_headers (dict or None) – Additional FITS header keywords (e.g., warp matrix elements).

exo2micro.utils.subtract_median(image, region=(0, 5000, 0, 5000))[source]

Subtract the median background level estimated from a rectangular region.

Parameters:

image (ndarray) – 2D image array.
region (tuple of 4 ints) – (row_min, row_max, col_min, col_max) region for background estimation.

Returns:

Background-subtracted image.

Return type:

ndarray

exo2micro.utils.survey_raw_channels(raw_dir='raw', crop_size=1000)[source]

Survey all raw TIFF files to report which RGB channels carry signal.

Reads a small centre crop from each file to avoid loading full images into memory.

Parameters:

raw_dir (str) – Root directory containing sample subdirectories (default ‘raw’).
crop_size (int) – Side length of the centre crop to inspect (default 1000).

Returns:

results – One entry per file with keys: ‘path’, ‘size’, ‘mode’, ‘channels’. ‘channels’ is a dict mapping channel name (‘R’, ‘G’, ‘B’ or ‘gray’) to {‘max’: int, ‘mean’: float, ‘nonzero’: int}.

Return type:

list of dict

Notes

If raw_dir is missing, empty, or has TIFFs in the wrong place (e.g. directly in raw_dir rather than in per-sample subdirectories), this function prints a human-readable layout diagnosis via diagnose_raw_layout() and returns an empty list.

exo2micro.utils.tiff_to_fits(tiff_file, return_data=False)[source]

Convert a three-channel RGB TIFF file to a FITS file.

Each colour channel is stored as a named image extension (RED1, GREEN2, BLUE3).

Parameters:

tiff_file (str) – Path to the source TIFF file.
return_data (bool) – If True, also return the raw TIFF array (default False).

Returns:

fits_filename (str) – Path to the generated FITS file.
tiff_data (ndarray) – Raw uint8 array of shape (H, W, 3); only if return_data=True.

exo2micro.utils.tifffile_save(image, path)[source]: Save image as TIFF using tifffile for full-precision support.

exo2micro.utils.trim_to_signal(post_im, pre_im, threshold=0)[source]

Trim both images to the bounding box of their combined nonzero signal.

Discards large empty margins before padding and registration. This is critical when images have significant zero-padded borders, because those empty regions confuse phase correlation and ECC.

Parameters:

post_im (ndarray) – Post-stain image (2D).
pre_im (ndarray) – Pre-stain image (2D).
threshold (float) – Pixel values <= this are treated as empty background (default 0).

Returns:

post_trimmed (ndarray)
pre_trimmed (ndarray)
bbox (tuple) – (row_min, row_max, col_min, col_max) bounding box applied.