Detecting batch effects (Python port)¶

This notebook is a Python port of the detect_batch_effects.Rmd vignette from biosurf/cyCombine. It shows how to call the two batch-effect diagnostics in cycombinepy:

detect_batch_effect_express: a three-panel summary with a mean-EMD heatmap across markers, per-marker density curves colored by batch, and an MDS plot of per-sample medians when a sample column is available.
detect_batch_effect: the same outputs, plus a UMAP of the uncorrected data and a per-marker MAD bar chart.

Both functions validate requested marker names, required batch metadata, optional sample metadata when present, and finite marker matrices before plotting. Run these diagnostics before correction. If batches already mix in the diagnostic views, correction may not improve the analysis.

Cofactor guidance¶

The arcsinh transform compresses the dynamic range of cytometry intensities. Use the cofactor that matches your instrument:

Technology	Cofactor
Mass cytometry	5
Conventional flow	150
Full-spectrum flow	6000

Note

Documentation builds do not execute this notebook. The FCS download path is for interactive runs. CI covers executable documentation with deterministic synthetic snippets.

Setup¶

import os
import tempfile
import warnings

import anndata as ad
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import requests
import scanpy as sc

import cycombinepy as pc

warnings.filterwarnings('ignore')
sc.set_figure_params(figsize=(4, 4), dpi=80)
print('cycombinepy', pc.__version__)

Load the Nuñez PBMC dataset¶

We reuse the two figshare-hosted FCS files from the Main cyCombine workflow (Python port) tutorial. The loader below tries the canonical download first, falls back to pre-staged files in $CYCOMBINEPY_DATA_DIR, and finally synthesizes a two-batch AnnData so the notebook executes even in offline environments.

def _drop_noninformative(a: ad.AnnData) -> ad.AnnData:
    keep = [
        v for v in a.var_names
        if v not in ('Time', 'LD') and '-' not in v
    ]
    return a[:, keep].copy()


def _synthetic_two_batch(n_per_batch: int = 3000, seed: int = 0) -> ad.AnnData:
    local_rng = np.random.default_rng(seed)
    markers = [
        'CD3', 'CD4', 'CD8', 'CD19', 'CD14', 'CD16', 'CD56',
        'CD11c', 'HLADR', 'CD45', 'CD25', 'CD127', 'CD38',
        'CD27', 'CD69',
    ]
    n_markers = len(markers)
    n_types = 6
    type_means = local_rng.normal(1.0, 0.9, (n_types, n_markers)).clip(0, None)

    def _one_batch(shift: float) -> np.ndarray:
        per_type = n_per_batch // n_types
        blocks = []
        for mu in type_means:
            blocks.append(local_rng.normal(mu + shift, 0.35, (per_type, n_markers)))
        return np.clip(np.vstack(blocks), 0, None) * 20

    X1 = _one_batch(0.0)
    X2 = _one_batch(0.8)
    X = np.vstack([X1, X2])
    obs = pd.DataFrame({
        'batch': ['batch1'] * len(X1) + ['batch2'] * len(X2),
        'sample': ['b1_s1'] * len(X1) + ['b2_s1'] * len(X2),
    })
    obs.index = obs.index.astype(str)
    a = ad.AnnData(X=X.astype(float), obs=obs)
    a.var_names = markers
    return a


def load_pbmc_two_batches() -> tuple[ad.AnnData, str]:
    # ---- Canonical figshare download ---------------------------------
    temp_dir_obj = tempfile.TemporaryDirectory()
    data_dir = temp_dir_obj.name
    urls = [
        'https://figshare.com/ndownloader/files/55982654',
        'https://figshare.com/ndownloader/files/55982657',
    ]
    downloaded: list[str] = []
    try:
        for url in urls:
            r = requests.get(url, timeout=10)
            r.raise_for_status()
            cd = r.headers.get('Content-Disposition', '')
            name = (
                cd.split('filename=')[1].strip('"\'')
                if 'filename=' in cd
                else os.path.basename(url) + '.fcs'
            )
            path = os.path.join(data_dir, name)
            with open(path, 'wb') as f:
                f.write(r.content)
            downloaded.append(path)
        source = 'figshare'
    except Exception as exc:
        downloaded = []
        source = f'synthetic (download failed: {type(exc).__name__})'
        staged = os.environ.get('CYCOMBINEPY_DATA_DIR')
        if staged and os.path.isdir(staged):
            found = sorted(
                os.path.join(staged, f)
                for f in os.listdir(staged)
                if f.lower().endswith('.fcs')
            )
            if len(found) >= 2:
                downloaded = found[:2]
                source = f'pre-staged ({staged})'

    if downloaded:
        import readfcs
        adatas = []
        for i, path in enumerate(downloaded, start=1):
            a = readfcs.read(path)
            a = _drop_noninformative(a)
            a.obs['batch'] = f'batch{i}'
            a.obs['sample'] = f'b{i}_s1'
            adatas.append(a)
        adata = ad.concat(adatas, join='outer', index_unique='-')
        # Keep a handle on the temp dir for this session.
        adata.uns['_temp_dir_obj_ref'] = id(temp_dir_obj)
    else:
        adata = _synthetic_two_batch(n_per_batch=3000, seed=0)

    adata.obs['batch'] = adata.obs['batch'].astype('category')
    adata.obs['sample'] = adata.obs['sample'].astype('category')
    adata.obs_names_make_unique()
    return adata, source


adata, data_source = load_pbmc_two_batches()
print('data source:', data_source)
adata

# Arcsinh-transform before diagnostics. cyCombine recommends cofactor 5
# for CyTOF; change to 150 (flow) or 6000 (spectral) as appropriate.
pc.transform_asinh(adata, cofactor=5, derand=True, seed=0)
adata

`detect_batch_effect_express`: three-panel summary¶

Call detect_batch_effect_express to summarize batch imbalance with:

an EMD heatmap of mean Earth Mover’s Distance per marker across all batch pairs,
per-marker density curves colored by batch,
an MDS plot of per-sample medians. sample_key is optional; if it is absent, the returned mds figure reports that no sample column was provided.

Pass out_dir=... to save all three as PNGs. The function also returns a dict of matplotlib figures for inline inspection.

figs_express = pc.detect_batch_effect_express(
    adata,
    batch_key='batch',
    sample_key='sample',
    downsample=5000,
    out_dir=None,
    seed=472,
)
figs_express['emd']

figs_express['density']

figs_express['mds']

`detect_batch_effect`: extended diagnostic¶

detect_batch_effect runs the express diagnostics above and adds a PCA-UMAP colored by batch plus a per-marker MAD bar chart. Use it when the correction decision should consider both marker distributions and a low-dimensional embedding.

figs_full = pc.detect_batch_effect(
    adata,
    batch_key='batch',
    sample_key='sample',
    downsample=5000,
    out_dir=None,
    seed=434,
)
figs_full['umap']

figs_full['mad']

Interpreting the diagnostics¶

EMD heatmap: larger per-marker values indicate markers whose distributions differ more across batches. A marker with near-zero EMD is already aligned by this metric.
Density plots: overlapping curves across batches are consistent with lower batch-associated variation. A split by batch suggests technical variation.
MDS of per-sample medians: samples clustering by batch rather than by biology suggest a batch-associated shift.
UMAP colored by batch: batch-wise separation indicates that downstream embedding or clustering may be affected by technical variation.
Per-marker MAD: larger per-batch dispersion differences can occur with instrument drift or sample-preparation variation.

If these diagnostics show a batch effect, proceed to the Main cyCombine workflow (Python port) tutorial to run the correction pipeline.

Detecting batch effects (Python port)¶

Cofactor guidance¶

Setup¶

Load the Nuñez PBMC dataset¶

detect_batch_effect_express: three-panel summary¶

detect_batch_effect: extended diagnostic¶

Interpreting the diagnostics¶

`detect_batch_effect_express`: three-panel summary¶

`detect_batch_effect`: extended diagnostic¶