Data analysis

AI-assisted exploration, QC, and interpretation on scRNA-seq data

Learning objectives

Use AI to accelerate exploratory single-cell analysis without outsourcing the scientific judgement.
Build a QC pass for an scRNA-seq dataset with AI assistance, while owning every threshold.
Evaluate AI-suggested clustering parameters and annotation calls critically.

How this page relates to Module 3

Module 3 is the runnable procedure. This page is the AI-fluency lens on those same steps: where AI helps, where it fails, and which decisions stay yours.

Where AI helps most, and least

AI helps with: boilerplate I/O (sc.read_10x_mtx, sc.datasets.pbmc3k, mt-flagging); standard pipeline scaffolds (Scanpy, Seurat); plot recipes (violin, scatter, UMAP, dotplot); and explaining unfamiliar output (“what does this scrublet histogram mean?”).

AI falls short on: tissue-specific thresholds (the textbook 5% mt cutoff is for PBMCs; neurons and muscle break that rule); reading clustering for biology (Leiden gives clusters, but identity comes from markers and prior knowledge); and borderline cases (a high-mt tail can be ambient RNA, dying cells, real biology, or all three; the AI will tell you a story, don’t let it).

A template workflow

Describe the dataset to the AI: platform, organism, tissue, expected cell types, library version, total cells. Practice the Description D.
Ask for a QC plan, not a QC answer. Critique the plan. Then execute.
Run each step yourself, or via an AI coding assistant you supervise. Look at every plot.
Write interpretations in your own words. Use AI for prose polishing, not reasoning.

Worked example: QC on 10x PBMC 3k

The course dataset is the 10x PBMC 3k: 2,700 PBMCs from a healthy donor, 10x Chromium v1, about 8 well-characterised cell types. Scanpy bundles it as sc.datasets.pbmc3k(). The runnable code is in Module 3. The commentary below proceeds in the same order.

Step 1: frame the dataset for the AI

import scanpy as sc
adata = sc.datasets.pbmc3k()
adata.var_names_make_unique()
print(adata)              # n_obs × n_vars = 2700 × 32738

The AI writes the load, dedup, and print pattern from a one-line description. Your job is to tell the AI explicitly that this is human PBMCs from a 10x Chromium v1 run with a healthy donor, expecting about 8 cell types. Without that context every later suggestion (mt threshold, doublet rate, marker panel, organism prefix) is a guess.

Step 2: calculate QC metrics

adata.var['mt'] = adata.var_names.str.startswith('MT-')   # human prefix
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None,
                           log1p=False, inplace=True)

The prefix is MT- for human and mt- for mouse. An AI that does not know your organism will pick one. On a mouse dataset it silently flags zero mt-genes. Always specify organism in the prompt. Always confirm adata.var['mt'].sum() > 0 before trusting the result.

Step 3: read the violin plots

sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt'],
             jitter=0.4, multi_panel=True)

The AI can describe what these metrics mean. It cannot read your plots. Look at the shape:

A clean PBMC 3k violin has a tight n_genes_by_counts centred at 1,000 to 1,500, a thin upper tail, and pct_counts_mt mostly below 5%.
Bimodal n_genes_by_counts suggests a possible doublet population, cell-type imbalance, or empty-droplet contamination.
A long upper total_counts tail without a corresponding n_genes_by_counts shoulder suggests ambient RNA rather than doublets.

“What threshold should I use?” returns the textbook answer. “Given a clean unimodal violin centred at 1,200 genes with a thin upper shoulder above 2,500, what max-genes threshold removes plausible doublets without trimming real cells?” returns a calibrated one.

Step 4: choose thresholds and filter

sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
adata = adata[adata.obs.n_genes_by_counts < 2500, :]
adata = adata[adata.obs.pct_counts_mt < 5, :]

The AI drafts the four lines. Every number is yours. PBMCs tolerate 5% mt because PBMCs are tough. Neurons and muscle regularly show 15 to 25% mt in healthy cells, and a 5% cutoff would silently delete most of the experiment. Thresholds are Discernment. Code is delegation.

Step 5: doublet detection (Scrublet)

import scrublet as scr
scrub = scr.Scrublet(adata.X)
doublet_scores, predicted_doublets = scrub.scrub_doublets()
adata.obs['doublet_score'] = doublet_scores
adata.obs['predicted_doublet'] = predicted_doublets

Ask whether the predicted doublet rate matches your loading concentration. PBMC 3k loaded about 3,000 cells, with an expected rate of 2 to 4%. If Scrublet flags 12%, something is off: either an over-aggressive call or an over-loaded input. Either way, that is a flag for you, not for the AI to silently accept.

Step 6: normalise, log-transform, and select HVGs

sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata.raw = adata
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
adata = adata[:, adata.var.highly_variable]

target_sum=1e4 is convention, not law. It is appropriate when it matches your median library size. HVG defaults are also convention. Expect about 1,800 to 2,000 HVGs for PBMC 3k. If you get 200 or 8,000, the defaults are wrong and you retune.

Step 7: embed and cluster

sc.pp.scale(adata, max_value=10)
sc.tl.pca(adata, svd_solver='arpack')
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
sc.tl.umap(adata)
sc.tl.leiden(adata, resolution=0.5)
sc.pl.umap(adata, color='leiden')

This is the most useful and most dangerous step. Useful, because the AI remembers the boilerplate, suggests reasonable parameter starts, and explains why n_pcs matters. Dangerous, because it will name your clusters from leiden IDs alone if you let it. That is a confident hallucination. Identity comes from markers and biological context, not cluster numbers. Choose resolution, n_neighbors, and n_pcs yourself. Run the elbow plot. Look at the UMAP. Ask whether the cluster count matches your expected cell-type count (about 8 for PBMC 3k). Pass to Module 5 for annotation, by markers, not by guessing.

Where the 4 D’s showed up

Description: telling the AI it is human PBMCs from 10x Chromium v1 so it picks the right mt prefix and doublet expectation.
Discernment: reading violin plots, doublet score distribution, and UMAP shape yourself, and rejecting one-size-fits-all thresholds.
Diligence: owning every threshold and parameter, and documenting choices in the notebook for the disclosure rubric.
Delegation: boilerplate I/O, plotting, and per-cell metric calls, where AI is reliable when given organism context.

Common failure modes

Trusting the textbook threshold. PBMC 5%, neuron 20%, muscle 25%. Ask for tissue-specific reasoning, not a universal answer.
Asking for a QC answer instead of a QC plan. “Is this dataset clean?” is the wrong question. “What evidence would I need to call this dataset clean?” is the right one.
Letting the AI name your clusters. Cluster identity is markers and prior knowledge, not Leiden numbers.
Wrong organism prefix. MT- for human, mt- for mouse. A mismatch silently flags zero mt-genes and corrupts every threshold downstream.

Exercises

Load PBMC 3k. Without consulting an AI, sketch a QC plan in five bullets before any code. Compare with what the AI suggests. Where do you disagree?
Take a small dataset of your own, or a public one from your subfield. Frame it for an AI in three sentences (platform, organism, expected biology). Then ask for a QC plan. Critique it before running.
Run the worked example end-to-end. Record one decision where you overrode the AI’s first suggestion, and one where you accepted it. The disclosure rubric for the Week 3 mini-project asks for both.

Check your understanding

You are running QC on a mouse lung scRNA-seq dataset. The AI suggests adata.var['mt'] = adata.var_names.str.startswith('MT-'). What goes wrong, and what is the fix?
Your PBMC dataset shows a pct_counts_mt median of 12% with a tail to 25%. The AI recommends a 5% cutoff “as standard for PBMCs”. Should you accept it? Why or why not?
Scrublet on a 3,000-cell PBMC sample flags 11% as predicted doublets. What does that flag, and what would you check before filtering?

Answers: 1. The mouse mt-gene prefix is mt- (lowercase), not MT-. The flag column will be all False, pct_counts_mt will be 0 for every cell, and you will silently fail to filter dying cells. Fix: adata.var_names.str.startswith('mt-'). 2. No. The suggestion ignores your data. A median of 12% means most cells would be filtered. Either your dissociation was harsh (rerun upstream) or 5% is the wrong cutoff for this sample. Inspect the violin and pick a threshold above the main body but below the dying-cell tail. 3. The expected doublet rate for about 3,000 loaded cells is 2 to 4%. 11% is roughly three times that. Check loading concentration (was it over-loaded?), Scrublet’s expected-doublet-rate parameter, and whether two real but transcriptionally similar cell types are being misread as doublets.