Usage guide

This page introduces the cycombinepy mental model and walks through the recommended workflow. For fully runnable end-to-end examples, see the two tutorial notebooks:

Data conventions

cycombinepy operates on an AnnData object:

Slot

Meaning

adata.X

cells × markers expression (post-asinh)

adata.obs["batch"]

batch assignment (required)

adata.obs["sample"]

sample / patient id (optional)

adata.obs["condition"]

biological condition (optional covariate)

adata.obs["anchor"]

reference sample carried across batches

adata.obs["cycombine_som"]

SOM cluster labels (set by create_som)

adata.layers["cycombine_corrected"]

corrected expression (set by correct_data)

All functions read from adata.X by default; pass layer=... to read or write a named layer instead. Columns other than batch are optional but enable additional features (covariate-aware correction, anchor-based alignment, diagnostic MDS plots, etc.).

The pipeline in one call

The most common entry point is cycombinepy.batch_correct, which runs the full pipeline in one call:

import cycombinepy as pc
from cycombinepy.correct import CORRECTED_LAYER

pc.batch_correct(
    adata,
    batch_key="batch",
    xdim=8, ydim=8,         # 64-node SOM
    rlen=10,
    norm_method="scale",    # batch-wise z-score before clustering
    covar="condition",      # optional: preserve a biological covariate
    seed=473,
)

# The corrected matrix is now in adata.layers["cycombine_corrected"]
# (constant CORRECTED_LAYER). adata.X is unchanged.

Under the hood this is equivalent to running three steps by hand:

pc.normalize(adata, method="scale", batch_key="batch")
pc.create_som(adata, xdim=8, ydim=8, rlen=10, seed=473,
              label_key="cycombine_som")
# Restore the unnormalized view before per-cluster correction.
pc.correct_data(adata, label_key="cycombine_som",
                batch_key="batch", covar="condition")

Use the modular form when you want to swap components (e.g. try norm_method="rank"), reuse an existing clustering, or inspect the intermediate state.

Loading data

From a directory of FCS files

from cycombinepy import io as pcio

adata = pcio.read_fcs_dir(
    "data/",
    metadata="metadata.csv",
    filename_col="Filename",
    batch_key="Batch",
    sample_key="Patient",
    condition_key="condition",
    cofactor=5,            # 5 CyTOF, 150 flow, 6000 spectral
    derand=True,
)

read_fcs_dir wraps pytometry / readfcs and applies the arcsinh transform in one step. Requires pip install "cycombinepy[io]".

From FCS files with no directory metadata

If you only have the FCS files and want to assign batches manually (this is the path used in the Main cyCombine workflow (Python port) tutorial):

import anndata as ad
import readfcs

b1 = readfcs.read("batch1.fcs"); b1.obs["batch"] = "batch1"
b2 = readfcs.read("batch2.fcs"); b2.obs["batch"] = "batch2"
adata = ad.concat([b1, b2], join="outer", index_unique="-")
pc.transform_asinh(adata, cofactor=5)

From an existing AnnData

Nothing to do — just set adata.obs["batch"].

Preprocessing

pc.transform_asinh(adata, cofactor=5, derand=True)
  • cofactor: marker-type-appropriate scale factor. Typical values: 5 (mass cytometry), 150 (conventional flow), 6000 (full-spectrum flow).

  • derand=True applies the cyCombine-style derandomization trick (ceil(x) - Uniform(0, 0.9999)) to break ties at low intensity.

Normalization

pc.normalize(adata, method="scale", batch_key="batch")

Available methods:

Method

Notes

"scale"

Z-score per batch (default; matches cyCombine’s "scale")

"rank"

Percentile rank per batch (matches "rank" with ties handling)

"CLR"

Centered log-ratio, centered on the arithmetic mean of log1p

"CLR_seu"

Seurat-flavor CLR

"CLR_med"

CLR centered on the median

"qnorm"

PCHIP monotone-spline quantile normalization

Normalization is only used to make the SOM clustering resilient to batch shifts; the correction step itself runs on the unnormalized expression.

Clustering

pc.create_som(adata, xdim=8, ydim=8, rlen=10, seed=473)

Trains a FlowSOM on the normalized marker view and writes 1-indexed integer cluster ids to adata.obs["cycombine_som"]. Pass n_clusters=20 to metacluster the SOM nodes via consensus clustering.

Per-cluster ComBat correction

pc.correct_data(
    adata,
    label_key="cycombine_som",
    batch_key="batch",
    covar="condition",      # optional
    anchor="reference_id",  # optional
    ref_batch=None,
    parametric=True,
)

Runs ComBat inside each SOM cluster in isolation, capping the corrected values to the per-cluster min/max of the input (matching R cyCombine lines 524–531). If a cluster contains cells from only one batch — or if the covariate design is confounded with the batch variable — the cluster is left unchanged.

Evaluating a correction

from cycombinepy.correct import CORRECTED_LAYER

emd_before = pc.compute_emd(adata, cell_key="cycombine_som", layer=None)
emd_after  = pc.compute_emd(adata, cell_key="cycombine_som",
                            layer=CORRECTED_LAYER)
report = pc.evaluate_emd(emd_before, emd_after)
report.groupby("marker")["reduction_pct"].mean()

compute_emd uses scipy.stats.wasserstein_distance to compute per-(cluster, marker, batch-pair) 1-D Earth Mover’s Distance; evaluate_emd merges uncorrected vs. corrected tables and reports absolute and percent reduction. compute_mad / evaluate_mad provide the same interface for Median Absolute Deviation.

Detecting batch effects

Before deciding to correct, run the diagnostic plots to confirm there is a batch effect worth correcting:

figs = pc.detect_batch_effect_express(
    adata, batch_key="batch", sample_key="sample",
    downsample=10000, out_dir="before/",
)

# Or the full report (adds UMAP + per-marker MAD):
figs = pc.detect_batch_effect(
    adata, batch_key="batch", sample_key="sample",
    downsample=10000, out_dir="before/", seed=434,
)

Both functions return a dict of matplotlib figures and optionally save them as PNGs under out_dir. See the Detecting batch effects (Python port) notebook for a walkthrough.

Plotting helpers

from cycombinepy import plotting as pcpl
from cycombinepy.correct import CORRECTED_LAYER

pcpl.plot_dimred(adata, kind="umap", color="batch",
                 layer=CORRECTED_LAYER)
pcpl.plot_density(adata, batch_key="batch", layer=CORRECTED_LAYER)
pcpl.plot_emd_heatmap(emd_df)

These are thin, opinionated wrappers around scanpy.pl.umap and seaborn.kdeplot that handle the before/after layer logic. For fine control, call scanpy/seaborn directly.