Scale Estimation Methods ======================== This page explains the math behind exo2micro's scale estimation methods. For a lower-level user-facing version, see :doc:`../users/scale_methods`. The goal -------- After alignment, we need a scalar ``s`` such that: .. code-block:: text diff = post − s × pre cleanly separates background from signal. Specifically we want ``s`` to equal the ratio between post-stain and pre-stain autofluorescent background, so that background pixels (where ``post ≈ s × pre``) subtract out near zero and microbe pixels (where ``post > s × pre``) stand out as positive excess. The Moffat-fit method (default) ------------------------------- The canonical v2.3 method fits a Moffat profile to the left wing of the log-ratio distribution. The setup ~~~~~~~~~ Compute per-pixel ratios for the "both have signal" pixels:: r = post[both] / pre[both] and take ``log₁₀``:: x = log10(r) For pixels where the post-stain is pure autofluorescence (same as pre-stain × true ratio), ``r`` clusters tightly around the true background scale. For pixels where the post-stain also contains microbe signal, ``r`` is larger — these pixels live in the right tail of the log-ratio distribution. So the distribution of ``x`` has: - A peak at ``log10(true_background_scale)``. - A left wing from noise fluctuations around that peak. - A right tail from microbe-contaminated pixels. We want to find the peak centre, ignoring the right tail. The Moffat profile ~~~~~~~~~~~~~~~~~~ .. code-block:: text M(x; amp, μ, α, β) = amp × (1 + ((x − μ) / α)²)^(−β) Moffat profiles were originally developed for astronomical point-spread functions. They have: - A sharper peak than a Gaussian (when ``β`` is small), which matches empirical microscopy noise better. - Power-law wings rather than exponential ones, which handles outliers more gracefully. The family smoothly interpolates: ``β = 1`` is a Lorentzian (heavy tails), large ``β`` approaches a Gaussian. We fit ``β`` freely and let the data decide. Why not just fit a Gaussian? Because the pre/post ratio distribution in real microscopy data consistently has sharper peaks and longer tails than a Gaussian would predict. The pre-v2.2 code tried a Voigt profile (Gaussian × Lorentzian convolution) and it was OK, but Moffat is cleaner to fit — one fewer parameter, and ``curve_fit`` converges more reliably. Why not use a simple histogram peak finder? The log-ratio histogram has quantization artifacts near ``log10(1) = 0`` from integer pixel values (e.g. ``post = 5, pre = 5`` produces exactly zero). These form a spurious spike that can fool a naive peak finder. The fitting procedure ~~~~~~~~~~~~~~~~~~~~~ The ``plot_ratio_histogram_simple`` function in ``exo2micro.plotting`` does: 1. Histogram ``x`` with 200 bins over its full range. 2. Smooth with a uniform filter (kernel size ``2*sigma+1`` where ``sigma`` is a constant 3). This gives a clean peak to find. 3. Exclude bins within ``3 × bin_width`` of ``x = 0`` to avoid the quantization spike. 4. Find the peak of the smoothed, spike-excluded histogram. Call this ``μ₀`` — the initial guess. 5. Select the left wing: bins with ``x ≤ μ₀`` excluding the near-zero band. 6. Mirror it across ``μ₀``: for each ``(x, y)`` point in the left wing, synthesize ``(2μ₀ − x, y)``. 7. Fit a Moffat profile to the combined real-left-wing + synthetic-right-wing data using :func:`scipy.optimize.curve_fit`. 8. The fitted ``μ`` is the refined peak centre. 9. ``scale = 10^μ``. Why mirror the left wing? Because the right side of the real distribution is contaminated by microbe signal (the long positive tail we want to ignore). Mirroring the left wing gives ``curve_fit`` a symmetric target that represents what the noise distribution *would* look like without the microbe contribution, and that's what we want to match. Failure modes ~~~~~~~~~~~~~ - **``curve_fit`` doesn't converge.** Rare with 200 bins, but possible on pathological data. The pipeline falls back to the smoothed-histogram peak ``μ₀`` and prints a message. Still a usable scale, just less refined. - **Peak lands in the wrong place.** Can happen when the data has multiple modes (e.g. two distinct tissue types with different autofluorescence ratios). In that case, check the ``ratio_histogram.png`` plot — if the peak is clearly wrong, fall back to a percentile-based method. - **Very few pixels.** If ``len(both) < ~1000``, the histogram is too sparse for a meaningful fit. This usually indicates a failed alignment. The ratio percentile method --------------------------- Sometimes you want a simpler, more transparent alternative. The ``scale_percentile`` parameter produces one. Mathematically:: x = log10(post[both] / pre[both]) scale = 10^percentile(x, p) for a user-chosen percentile ``p``. This is a one-line reduction with no fitting, no histogramming, no assumptions about noise distribution shape. **When it beats the Moffat fit.** When the log-ratio distribution is multi-modal or non-symmetric in a way that the Moffat assumption can't handle, the percentile method can be more robust — you're explicitly telling it "I want the value at this fraction of the sorted distribution", which bypasses any notion of "peak". The median (p = 50) in particular is a very stable estimator that just asks "what's the typical ratio?". **When it loses.** When microbe signal pushes the upper tail around, extreme percentiles (p > 90) can land in the microbe cluster. Always prefer moderate percentiles (20-70) unless you have a strong reason. **Implementation.** See ``SampleDye._compute_scale_percentile`` in ``pipeline.py``. It's three lines — compute both-signal mask, take the per-pixel ratio, take the percentile. The manual method ----------------- ``manual_scale`` is the simplest: the user specifies an exact scale factor and exo2micro uses it verbatim. No estimation, no fitting. Use cases: - Reproducing a published result that specified a particular scale factor. - Sensitivity analysis (running the same sample with several nearby scale values). - Bypassing the Moffat fit when it's clearly misbehaving on an unusual dataset. Coexistence ----------- All three methods can be active simultaneously: - The Moffat fit is always computed (it drives the canonical difference image and the scale line on the excess heatmap). - If ``scale_percentile`` is set, an additional difference image is produced at that percentile scale. - If ``manual_scale`` is set, an additional difference image is produced at that exact scale. The excess heatmap overplots all active scale lines in different colors, so you can visually compare them side-by-side: - **Green** (``#00cc88``) — Moffat fit - **Orange** (``#ff9933``) — ratio percentile - **Pink** (``#ff3366``) — manual value This is the recommended workflow for a new dataset: run with all three active, compare in the excess heatmap, and decide which method you trust most for that sample type. Internal code paths ------------------- The stage-4 diagnostic code in ``pipeline.py`` calls: - :func:`exo2micro.plot_ratio_histogram_simple` — returns the Moffat-fit scale and saves the histogram PNG. - :meth:`SampleDye._compute_scale_percentile` — returns the percentile scale (or ``None`` if no valid pixels). - :meth:`SampleDye._save_difference` — produces TIFF + FITS + PNG for a single scale value, labelled ``'moffat'``, ``'percentile_p'``, or ``'manual'``. - :func:`exo2micro.plot_excess_heatmap` with a ``scales=`` list of ``(label, value, colour)`` tuples, which draws one line per active scale. Legacy scaling methods ---------------------- exo2micro 2.1 and earlier had two additional methods, least-squares (LS) and robust-percentile, implemented in the removed ``scaling.py`` module. Both are preserved verbatim in ``exo2micro.legacy`` for back-compat but are not called by the v2.3 pipeline. The reasons they were dropped: - **LS.** Minimizes squared residuals over a tissue mask. This is biased upward by bright microbe pixels, which pull the dot-product estimate toward higher scale values. On well-aligned images with visible microbe signal, LS commonly oversubtracted by ~10-30%. - **Robust percentile.** The old ``robust_percentile=90`` default was too aggressive — with well-aligned images, the 90th percentile of the ratio distribution consistently landed in the microbe tail and produced scale estimates ~4× too high. Moderate percentiles worked fine; the current ``scale_percentile`` parameter is the same math with a clearer default (``None``, meaning "don't use this method unless you explicitly set a percentile"). If you need the old behaviour for reproducing an earlier result, use ``exo2micro.legacy.optimize_subtraction`` directly:: from exo2micro.legacy import optimize_subtraction opt_scale, scale_sig, tissue_mask, plot_data = optimize_subtraction( post_im, pre_im, method='least_squares')