Main cyCombine workflow (Python port)¶

This notebook is a Python port of the cyCombine.Rmd vignette from the biosurf/cyCombine R package. It covers the cycombinepy batch-correction pipeline:

Download two cytometry FCS files, one per batch
Load them into a single AnnData
Arcsinh-transform the marker channels
Inspect the batch effect with UMAP colored by batch
Run cycombinepy.batch_correct with a correction report
Reproduce the workflow with the modular API (normalize -> create_som -> correct_data), keeping normalized values in adata.layers["cycombine_normalized"]
Evaluate the correction with EMD reduction
Compare density and UMAP views before and after correction

The API validates requested marker names, missing batch, sample, cluster, and covariate metadata, and non-finite marker matrices before running numerical routines. Correction also records an H5AD-safe report in adata.uns["cycombinepy_correction"]. The report records failed clusters and confounded-design adjustments for later review.

The dataset is the full-spectrum cytometry PBMC sample from Nuñez, Schmid & Power et al. 2023, mirroring the scvi-tools CytoVI tutorial: one donor measured twice across two batches.

Note

Documentation builds do not execute this notebook. The FCS download and optional FlowSOM tutorial cells are for interactive runs. CI covers the documented modular workflow with deterministic synthetic snippets.

Setup¶

import os
import tempfile
import warnings

import anndata as ad
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import requests
import scanpy as sc
import seaborn as sns

import cycombinepy as pc
from cycombinepy.correct import CORRECTED_LAYER
from cycombinepy import plotting as pcpl

warnings.filterwarnings('ignore')
sc.set_figure_params(figsize=(4, 4), dpi=80)
rng = np.random.default_rng(0)
print('cycombinepy', pc.__version__)

Download the data¶

We fetch two FCS files from figshare, one per batch. Each file contains full-spectrum cytometry events from one measurement of the same donor. For this vignette, batch-associated differences are treated as technical variation targeted by the correction step.

The loader first tries the figshare download. If the environment has no outbound network access (for example, in CI), it falls back to pre-staged files in $CYCOMBINEPY_DATA_DIR. If neither source is available, it constructs a small synthetic two-batch dataset so the notebook still renders.

# ---- Canonical download path (works for end users) ----------------------
temp_dir_obj = tempfile.TemporaryDirectory()
data_dir = temp_dir_obj.name

urls = [
    'https://figshare.com/ndownloader/files/55982654',  # batch 1
    'https://figshare.com/ndownloader/files/55982657',  # batch 2
]

downloaded_files: list[str] = []
data_source = 'figshare'
try:
    for url in urls:
        response = requests.get(url, timeout=10)
        response.raise_for_status()
        cd = response.headers.get('Content-Disposition', '')
        if 'filename=' in cd:
            filename = cd.split('filename=')[1].strip('"\'')
        else:
            filename = os.path.basename(url) + '.fcs'
        file_path = os.path.join(data_dir, filename)
        with open(file_path, 'wb') as f:
            f.write(response.content)
        downloaded_files.append(file_path)
except Exception as exc:
    downloaded_files = []
    # ---- Fallback 1: pre-staged files via env var ---------------------
    staged = os.environ.get('CYCOMBINEPY_DATA_DIR')
    if staged and os.path.isdir(staged):
        found = sorted(
            os.path.join(staged, f)
            for f in os.listdir(staged)
            if f.lower().endswith('.fcs')
        )
        if len(found) >= 2:
            downloaded_files = found[:2]
            data_source = f'pre-staged ({staged})'
    if not downloaded_files:
        data_source = f'synthetic (download failed: {type(exc).__name__})'

print('data source:', data_source)
downloaded_files

Load FCS files into AnnData¶

We read each FCS file with readfcs, add a batch id, and concatenate the AnnData objects along the cells axis. To match the scvi-tools tutorial, we drop uninformative channels (Time, LD, and channel names containing -).

If the canonical download failed in the previous cell, the loader uses a synthetic two-batch AnnData. The synthetic data mimic a PBMC panel and include a planted batch shift for the correction example.

def _drop_noninformative(a: ad.AnnData) -> ad.AnnData:
    keep = [
        v for v in a.var_names
        if v not in ('Time', 'LD') and '-' not in v
    ]
    return a[:, keep].copy()


def _synthetic_two_batch(n_per_batch: int = 3000, seed: int = 0) -> ad.AnnData:
    """Synthetic PBMC-like AnnData with two batches and a planted shift."""
    local_rng = np.random.default_rng(seed)
    markers = [
        'CD3', 'CD4', 'CD8', 'CD19', 'CD14', 'CD16', 'CD56',
        'CD11c', 'HLADR', 'CD45', 'CD25', 'CD127', 'CD38',
        'CD27', 'CD69',
    ]
    n_markers = len(markers)
    n_types = 6
    type_means = local_rng.normal(1.0, 0.9, (n_types, n_markers)).clip(0, None)

    def _one_batch(shift: float) -> np.ndarray:
        per_type = n_per_batch // n_types
        blocks = []
        for mu in type_means:
            blocks.append(local_rng.normal(mu + shift, 0.35, (per_type, n_markers)))
        X = np.vstack(blocks)
        return np.clip(X, 0, None) * 20  # push into raw-intensity range

    X1 = _one_batch(0.0)
    X2 = _one_batch(0.8)
    X = np.vstack([X1, X2])
    obs = pd.DataFrame({'batch': ['batch1'] * len(X1) + ['batch2'] * len(X2)})
    obs.index = obs.index.astype(str)
    a = ad.AnnData(X=X.astype(float), obs=obs)
    a.var_names = markers
    return a


if downloaded_files:
    import readfcs
    adatas = []
    for i, path in enumerate(downloaded_files, start=1):
        a = readfcs.read(path)
        a = _drop_noninformative(a)
        a.obs['batch'] = f'batch{i}'
        adatas.append(a)
    adata = ad.concat(adatas, join='outer', index_unique='-')
else:
    adata = _synthetic_two_batch(n_per_batch=3000, seed=0)

adata.obs['batch'] = adata.obs['batch'].astype('category')
print('source:', data_source)
adata

# Optional: subsample so the notebook runs quickly on laptops.
target_per_batch = 3000
parts = []
for b in adata.obs['batch'].cat.categories:
    sub = adata[adata.obs['batch'] == b]
    if sub.n_obs > target_per_batch:
        idx = rng.choice(sub.n_obs, target_per_batch, replace=False)
        sub = sub[idx]
    parts.append(sub.copy())
adata = ad.concat(parts, join='outer')
adata.obs['batch'] = adata.obs['batch'].astype('category')
adata.obs_names_make_unique()
adata

Prepare data: arcsinh transform¶

Mass- and flow-cytometry intensities span several orders of magnitude. The usual preprocessing step is an arcsinh transform with a cofactor chosen for the instrument: 5 for CyTOF (and the cyCombine R default), 150 for conventional flow, and 6000 for full-spectrum flow. This example uses cofactor=5; adjust it for the instrument and panel.

pc.transform_asinh(adata, cofactor=5, derand=True, seed=0)
adata

Inspect the batch effect¶

Before correction, we compute a PCA-UMAP and color cells by batch. A per-batch split suggests that technical variation may affect downstream clustering or embedding.

fig = pcpl.plot_dimred(adata, kind='umap', color='batch', seed=0)
fig.suptitle('Uncorrected: UMAP colored by batch')
plt.show()

Batch correction: one-call API¶

cycombinepy.batch_correct runs the cyCombine pipeline in one call: batch-wise scale normalization, FlowSOM clustering on the normalized view, and per-cluster ComBat on the unnormalized values. The corrected matrix is written to adata.layers['cycombine_corrected'], while adata.X is left unchanged for before/after comparisons.

Pass return_report=True to receive the same H5AD-safe report stored in adata.uns['cycombinepy_correction']. The default policy is strict: ComBat failures raise CombatCorrectionError, and fully confounded covariate or anchor designs raise ConfoundedDesignError. Use error_policy='report' or 'warn', and confound_policy='skip' or 'drop', only when an audited partial result is acceptable.

batch_report = pc.batch_correct(
    adata,
    markers=None,           # use all non-scatter markers by default
    batch_key='batch',
    xdim=6, ydim=6,         # 36-node SOM
    rlen=5,                 # SOM training passes
    seed=473,
    norm_method='scale',    # batch-wise z-score before clustering
    covar=None,
    parametric=True,
    error_policy='raise',
    confound_policy='raise',
    return_report=True,
)
print('layers:', list(adata.layers.keys()))
print('SOM clusters:', adata.obs['cycombine_som'].nunique())
print('report status:', batch_report['status'])
print('report stored:', 'cycombinepy_correction' in adata.uns)

Batch correction: modular API¶

The same workflow can be run step by step with normalize, create_som, and correct_data. This route is useful when changing one component, for example trying method='rank', adding metaclustering, or inspecting an intermediate object.

The relevant API detail is that normalization is used for clustering, not for ComBat. Create a cycombine_normalized layer, normalize that layer, cluster from that layer, and let correct_data read the original post-asinh values from adata.X. This matches the one-shot API and does not require resetting X between steps.

Here we run the modular pipeline on a copy of the AnnData, leaving the previous batch_correct result unchanged.

modular = adata.copy()
if CORRECTED_LAYER in modular.layers:
    del modular.layers[CORRECTED_LAYER]

NORMALIZED_LAYER = 'cycombine_normalized'
modular.layers[NORMALIZED_LAYER] = modular.X.copy()

# 1. Batch-wise normalize a layer for the clustering step only.
pc.normalize(
    modular,
    method='scale',
    batch_key='batch',
    layer=NORMALIZED_LAYER,
)

# 2. FlowSOM clustering on the normalized view.
pc.create_som(
    modular,
    xdim=6, ydim=6,
    rlen=5,
    seed=473,
    layer=NORMALIZED_LAYER,
    label_key='cycombine_som',
)

# 3. Per-cluster ComBat on the original post-asinh values in modular.X.
modular_report = pc.correct_data(
    modular,
    label_key='cycombine_som',
    batch_key='batch',
    covar=None,
    parametric=True,
    error_policy='raise',
    confound_policy='raise',
    return_report=True,
)

print('modular layers:', list(modular.layers.keys()))
print('report status:', modular_report['status'])
print('clusters reported:', len(modular_report['clusters']['label']))

Evaluate: EMD reduction¶

Earth Mover’s Distance (EMD, also called Wasserstein-1) quantifies the distance between the marker-intensity distributions of two batches within the same cell cluster. After correction, these within-cluster batch distributions are expected to be closer. Inspect marker biology and cluster composition separately when interpreting the metric.

The correction reports above record cluster-level status. The metrics below address a separate question: how much did the batch-pair EMD change after correction? compute_emd returns one row per (cluster, marker, batch-pair), and evaluate_emd joins uncorrected and corrected tables and reports percent reduction.

emd_before = pc.compute_emd(
    adata, cell_key='cycombine_som', batch_key='batch', layer=None
)
emd_after = pc.compute_emd(
    adata, cell_key='cycombine_som', batch_key='batch', layer=CORRECTED_LAYER
)
report = pc.evaluate_emd(emd_before, emd_after)
report.head()

per_marker = (
    report.groupby('marker')[['emd_uncorrected', 'emd_corrected', 'reduction_pct']]
    .mean()
    .sort_values('reduction_pct', ascending=False)
)
per_marker.round(3)

fig, ax = plt.subplots(figsize=(6, 4))
long = report.melt(
    id_vars=['cluster', 'marker'],
    value_vars=['emd_uncorrected', 'emd_corrected'],
    var_name='stage',
    value_name='emd',
)
long['stage'] = long['stage'].str.replace('emd_', '', regex=False)
sns.violinplot(
    data=long, x='stage', y='emd',
    order=['uncorrected', 'corrected'], inner='quartile', ax=ax,
)
ax.set_title('EMD distribution across (cluster, marker, batch-pair)')
plt.show()

Visualize: before and after correction¶

We then inspect UMAP coordinates computed from the corrected layer and per-marker density plots. After correction, check whether batches mix in UMAP space and whether per-marker density curves overlap more closely in the corrected row of the density grid.

fig = pcpl.plot_dimred(
    adata, kind='umap', color='batch', layer=CORRECTED_LAYER, seed=0
)
fig.suptitle('Corrected: UMAP colored by batch')
plt.show()

fig = pcpl.plot_density(
    adata,
    batch_key='batch',
    layer=CORRECTED_LAYER,
)
plt.show()

Persisting results¶

Save the corrected AnnData for downstream analysis:

adata.write_h5ad('cycombine_corrected.h5ad')

The uncorrected expression stays in adata.X, the corrected expression is in adata.layers['cycombine_corrected'], SOM cluster assignments are in adata.obs['cycombine_som'], and the correction report is in adata.uns['cycombinepy_correction'].

What next?¶

Inspect adata.uns['cycombinepy_correction'] before interpreting a run; it records skipped clusters, failed clusters, and dropped or confounded terms.
Use Detecting batch effects (Python port) to inspect the residual batch effect after correction.
Pass covar=... or anchor=... to batch_correct when a biological condition should be preserved or a reference sample should be carried across batches.
Prefer the strict defaults for publication work. Relax error_policy or confound_policy only when the report will be reviewed with the downstream results.
Use norm_method='rank' or 'qnorm' instead of 'scale' when the data have strong distributional shifts between batches.