Scale Estimation Methods

This page explains the math behind exo2micro’s scale estimation methods. For a lower-level user-facing version, see Scale Methods.

The goal

After alignment, we need a scalar s such that:

diff = post − s × pre

cleanly separates background from signal. Specifically we want s to equal the ratio between post-stain and pre-stain autofluorescent background, so that background pixels (where post ≈ s × pre) subtract out near zero and microbe pixels (where post > s × pre) stand out as positive excess.

The Moffat-fit method (default)

The canonical v2.3 method fits a Moffat profile to the left wing of the log-ratio distribution.

The setup

Compute per-pixel ratios for the “both have signal” pixels:

r = post[both] / pre[both]

and take log₁₀:

x = log10(r)

For pixels where the post-stain is pure autofluorescence (same as pre-stain × true ratio), r clusters tightly around the true background scale. For pixels where the post-stain also contains microbe signal, r is larger — these pixels live in the right tail of the log-ratio distribution.

So the distribution of x has:

A peak at log10(true_background_scale).
A left wing from noise fluctuations around that peak.
A right tail from microbe-contaminated pixels.

We want to find the peak centre, ignoring the right tail.

The Moffat profile

M(x; amp, μ, α, β) = amp × (1 + ((x − μ) / α)²)^(−β)

Moffat profiles were originally developed for astronomical point-spread functions. They have:

A sharper peak than a Gaussian (when β is small), which matches empirical microscopy noise better.
Power-law wings rather than exponential ones, which handles outliers more gracefully.

The family smoothly interpolates: β = 1 is a Lorentzian (heavy tails), large β approaches a Gaussian. We fit β freely and let the data decide.

Why not just fit a Gaussian? Because the pre/post ratio distribution in real microscopy data consistently has sharper peaks and longer tails than a Gaussian would predict. The pre-v2.2 code tried a Voigt profile (Gaussian × Lorentzian convolution) and it was OK, but Moffat is cleaner to fit — one fewer parameter, and curve_fit converges more reliably.

Why not use a simple histogram peak finder? The log-ratio histogram has quantization artifacts near log10(1) = 0 from integer pixel values (e.g. post = 5, pre = 5 produces exactly zero). These form a spurious spike that can fool a naive peak finder.

The fitting procedure

The plot_ratio_histogram_simple function in exo2micro.plotting does:

Histogram x with 200 bins over its full range.
Smooth with a uniform filter (kernel size 2*sigma+1 where sigma is a constant 3). This gives a clean peak to find.
Exclude bins within 3 × bin_width of x = 0 to avoid the quantization spike.
Find the peak of the smoothed, spike-excluded histogram. Call this μ₀ — the initial guess.
Select the left wing: bins with x ≤ μ₀ excluding the near-zero band.
Mirror it across μ₀: for each (x, y) point in the left wing, synthesize (2μ₀ − x, y).
Fit a Moffat profile to the combined real-left-wing + synthetic-right-wing data using scipy.optimize.curve_fit().
The fitted μ is the refined peak centre.
scale = 10^μ.

Why mirror the left wing? Because the right side of the real distribution is contaminated by microbe signal (the long positive tail we want to ignore). Mirroring the left wing gives curve_fit a symmetric target that represents what the noise distribution would look like without the microbe contribution, and that’s what we want to match.

Failure modes

``curve_fit`` doesn’t converge. Rare with 200 bins, but possible on pathological data. The pipeline falls back to the smoothed-histogram peak μ₀ and prints a message. Still a usable scale, just less refined.
Peak lands in the wrong place. Can happen when the data has multiple modes (e.g. two distinct tissue types with different autofluorescence ratios). In that case, check the ratio_histogram.png plot — if the peak is clearly wrong, fall back to a percentile-based method.
Very few pixels. If len(both) < ~1000, the histogram is too sparse for a meaningful fit. This usually indicates a failed alignment.

The ratio percentile method

Sometimes you want a simpler, more transparent alternative. The scale_percentile parameter produces one.

Mathematically:

x = log10(post[both] / pre[both])
scale = 10^percentile(x, p)

for a user-chosen percentile p. This is a one-line reduction with no fitting, no histogramming, no assumptions about noise distribution shape.

When it beats the Moffat fit. When the log-ratio distribution is multi-modal or non-symmetric in a way that the Moffat assumption can’t handle, the percentile method can be more robust — you’re explicitly telling it “I want the value at this fraction of the sorted distribution”, which bypasses any notion of “peak”. The median (p = 50) in particular is a very stable estimator that just asks “what’s the typical ratio?”.

When it loses. When microbe signal pushes the upper tail around, extreme percentiles (p > 90) can land in the microbe cluster. Always prefer moderate percentiles (20-70) unless you have a strong reason.

Implementation. See SampleDye._compute_scale_percentile in pipeline.py. It’s three lines — compute both-signal mask, take the per-pixel ratio, take the percentile.

The manual method

manual_scale is the simplest: the user specifies an exact scale factor and exo2micro uses it verbatim. No estimation, no fitting.

Use cases:

Reproducing a published result that specified a particular scale factor.
Sensitivity analysis (running the same sample with several nearby scale values).
Bypassing the Moffat fit when it’s clearly misbehaving on an unusual dataset.

Coexistence

All three methods can be active simultaneously:

The Moffat fit is always computed (it drives the canonical difference image and the scale line on the excess heatmap).
If scale_percentile is set, an additional difference image is produced at that percentile scale.
If manual_scale is set, an additional difference image is produced at that exact scale.

The excess heatmap overplots all active scale lines in different colors, so you can visually compare them side-by-side:

Green (#00cc88) — Moffat fit
Orange (#ff9933) — ratio percentile
Pink (#ff3366) — manual value

This is the recommended workflow for a new dataset: run with all three active, compare in the excess heatmap, and decide which method you trust most for that sample type.

Internal code paths

The stage-4 diagnostic code in pipeline.py calls:

exo2micro.plot_ratio_histogram_simple() — returns the Moffat-fit scale and saves the histogram PNG.
SampleDye._compute_scale_percentile() — returns the percentile scale (or None if no valid pixels).
SampleDye._save_difference() — produces TIFF + FITS + PNG for a single scale value, labelled 'moffat', 'percentile_p<value>', or 'manual'.
exo2micro.plot_excess_heatmap() with a scales= list of (label, value, colour) tuples, which draws one line per active scale.

Legacy scaling methods

exo2micro 2.1 and earlier had two additional methods, least-squares (LS) and robust-percentile, implemented in the removed scaling.py module. Both are preserved verbatim in exo2micro.legacy for back-compat but are not called by the v2.3 pipeline. The reasons they were dropped:

LS. Minimizes squared residuals over a tissue mask. This is biased upward by bright microbe pixels, which pull the dot-product estimate toward higher scale values. On well-aligned images with visible microbe signal, LS commonly oversubtracted by ~10-30%.
Robust percentile. The old robust_percentile=90 default was too aggressive — with well-aligned images, the 90th percentile of the ratio distribution consistently landed in the microbe tail and produced scale estimates ~4× too high. Moderate percentiles worked fine; the current scale_percentile parameter is the same math with a clearer default (None, meaning “don’t use this method unless you explicitly set a percentile”).

If you need the old behaviour for reproducing an earlier result, use exo2micro.legacy.optimize_subtraction directly:

from exo2micro.legacy import optimize_subtraction
opt_scale, scale_sig, tissue_mask, plot_data = optimize_subtraction(
    post_im, pre_im, method='least_squares')