Week 3: scRNA-seq I (QC to clustering)

AI-assisted single-cell QC, normalisation, and clustering on PBMC 3k

Learning objectives

Load a 10x scRNA-seq count matrix into Scanpy and exercise the Description D by framing the dataset for an AI assistant.
Apply tissue-aware QC thresholds and exercise the Discernment D by reading the violin plots yourself before accepting an AI suggestion.
Build the PCA, neighbours, UMAP, and Leiden pipeline, using AI for boilerplate while owning parameter choices.
Produce an AI-free baseline of the QC pass and compare it to your AI-assisted iteration.

Suggested pacing

This is a heavier unit than Weeks 1 and 2. Plan on five to seven hours.

Chunk	What	Time
1	Tooling setup: confirm Colab works, or a local environment loads PBMC 3k (run the Python starter)	30 min
2	Read code assistance and data analysis	1 hr
3	Skim Modules 1 and 2; work through Module 3 in Colab	1.5 hrs
4	Work through Module 4 in Colab	1 hr
5	Hands-on practice exercises	1 hr
6	Knowledge check	30 min
7	Project: PBMC 3k mini-project (AI-free baseline plus AI-assisted iteration)	2 hrs (often more; give it a day)

If you have not used Scanpy before, add about an hour for orientation.

Setup

A working Google Colab account (the course default; Modules 3 to 5 run in the browser, no install needed). Or a local Python 3.11 or later environment with scanpy, scrublet, leidenalg, pandas, numpy, and matplotlib, pinned with conda or uv.
A coding assistant set up and working in your editor (Claude Code, Cursor, or VS Code with Copilot).
The course dataset confirmed loadable on your machine. Run the Python starter before starting Module 3.

Course dataset

Weeks 3 and 4 use the 10x PBMC 3k dataset: 2,700 peripheral blood mononuclear cells from a healthy donor, sequenced on Illumina NextSeq 500 with 10x Chromium v1. The matrix is bundled with Scanpy and loads in one call:

import scanpy as sc
adata = sc.datasets.pbmc3k()
# AnnData object with n_obs × n_vars = 2700 × 32738

PBMC 3k is the “hello world” dataset for single-cell analysis. It is small, well-characterised, and runs end-to-end in a free Colab session. The eight cell types are T, B, and NK cells, monocytes, and dendritic cells. We use it for both the Week 3 mini-project and the Week 4 final project.

Run the Python starter to confirm the data loads on your machine. The starter stops at “load, confirm shape, one summary table”. The mini-project baseline is yours to write (see the AI-use policy).

Readings

Read the conceptual pages first, then run the modules.

Code assistance. Focus on the debugging pattern (reproducer, minimal example, ask) and the AI-assisted test-writing example. This is how you’ll fix things when Module 3 inevitably breaks somewhere.
Data analysis. Focus on the AI-fluency lens on QC. The page walks through the same workflow Module 3 runs, but with explicit accept/reject reasoning at each AI suggestion.
Module 1: raw data QC. FastQC interpretation for 10x reads. Reading only, no run required.
Module 2: alignment and count matrix. Cell Ranger or STARsolo. Reading only unless you have HPC and want to run it.

Then run, in Colab or your local environment:

Module 3: preprocessing in Scanpy. Load, QC metrics, violin plots, threshold filter, normalise, log-transform, and HVG selection.
Module 4: clustering and UMAP. PCA, neighbours, UMAP, Leiden.

Hands-on practice

These exercises surface where the AI’s defaults will fail you on real data. Do each one before moving on.

Exercise 3.1: predict, then run

Before running the Python starter, write down somewhere visible (on paper, in a notebook cell, anywhere): what you expect adata.shape to be, what adata.var_names[:5] will look like, and what range you expect pct_counts_mt to fall in for healthy donor PBMC. Then run the starter and compare.

Self-check: Exercise 3.1

You’re calibrated if you got at least two of three right. Common surprises:

The matrix is n_obs × n_vars = 2700 × 32738. The var dimension is the full transcriptome, not just expressed genes. Most of those columns are zero.
The var names are gene symbols (uppercase for human), not Ensembl IDs.
For healthy PBMC, the pct_counts_mt median should be at most 5%, with a tail to about 10%. If your prior was “20% or more”, you were thinking of a damaged or stressed dataset.

The exercise isn’t a quiz. It is about noticing whether your prior matches the data. If you weren’t sure what to predict, that is the signal. Read the data analysis page before doing the project.

Exercise 3.2: threshold defence

In Module 3 you choose threshold values for pct_counts_mt, n_genes_by_counts, and minimum cells per gene. For one of those thresholds, write a one-paragraph defence (eight sentences or fewer) of the value you chose: what does the violin plot show, what tissue-specific knowledge informed the threshold, and what would change your mind?

Self-check: Exercise 3.2

A good defence references the histogram you actually plotted, not a generic “5% is standard”. Strong defences include:

A specific value drawn from the data (median, tail point), not a literature default.
Tissue-specific reasoning. PBMC is a healthy fresh-frozen sample, so mitochondrial enrichment above 10% likely indicates dying cells, not biology.
A counterfactual. What would the violin look like for you to choose 8% instead of 5%? “If the median were 8% with a heavy tail to 25%, I’d suspect a dissociation issue and check upstream QC.”

A weak defence: “I used 5% because the AI suggested it and the data analysis page used 5%”. That is not Discernment. That is deference.

Exercise 3.3: AI-assisted vs. AI-free diff

After completing your AI-free baseline (about 25 lines, no AI) and the AI-assisted iteration (Modules 3 and 4 with AI help), list three substantive differences between the two outputs. For each, classify it as:

Improvement: the AI suggestion was right and you accepted it.
Regression: the AI suggestion was wrong and you (should have) rejected it.
Equivalent: different but both defensible.

Self-check: Exercise 3.3

A useful diff list is specific and reproducible. Examples of strong entries:

“AI used sc.pp.normalize_total(adata, target_sum=1e4) instead of my baseline’s target_sum=None. Improvement. Explicit median scaling is the Scanpy default and makes downstream comparisons stable across cells.”
“AI omitted my min_genes=200 filter on cells. Regression. Without it, the HVG selection picks up empty droplets as ‘cells with unique expression patterns’, which contaminates the PCA.”
“AI used 30 PCs for the neighbour graph; I used 50. Equivalent. The elbow plot is ambiguous between 20 and 50, and both produce stable Leiden clusters at resolution 0.5.”

If your diff list is just “the AI version is shorter” or “the AI version uses different variable names”, you didn’t actually compare outputs. Go back and run both pipelines through the same downstream check (Leiden cluster count, top markers per cluster).

Knowledge check

Knowledge check: Week 3

The AI suggests adata.var['mt'] = adata.var_names.str.startswith('MT-') for a mouse lung dataset. What goes wrong, and what is the fix?
Your PBMC dataset shows pct_counts_mt median 12% with a tail to 25%. The AI recommends a 5% cutoff “as standard for PBMCs”. Should you accept it? Why or why not?
Scrublet on a 3,000-cell PBMC sample flags 11% as predicted doublets. What does that flag, and what would you check before filtering?
The standard Scanpy clustering pipeline is PCA, neighbours, UMAP, then Leiden. Why this order, and what goes wrong if you compute UMAP before neighbours?
You ran sc.pp.highly_variable_genes on un-normalised data and got plausible-looking genes back. Why is this a bug even though the output looks fine?
The AI generates a UMAP and tells you “cluster 4 is T cells based on the embedding shape”. What is the Discernment problem with this sentence?
You’ve done your AI-free baseline, then run the AI-assisted iteration, and the cluster count differs (8 vs. 11). The AI says “this is normal variability”. What do you check before accepting that explanation?

Answers:

Mouse mitochondrial genes are mt-Nd1, mt-Co1, and so on, with a lowercase prefix. The 'MT-' filter matches zero genes in mouse, so your pct_counts_mt is silently 0 for every cell. No cells get filtered for mitochondrial enrichment. Fix: use 'mt-' for mouse, or detect dynamically with startswith(('MT-', 'mt-')). This is one of the most common organism-convention bugs in scRNA-seq.
No. The 5% rule is a heuristic for healthy PBMC. If your median is 12%, the dataset is either stressed or damaged at sample prep, or contains a tissue with naturally higher mitochondrial expression (some immune cells run hotter). Accepting 5% would discard most of your data. Plot the violin, look at the distribution, and pick a cutoff that splits “biological tail” from “dying cells”, likely 20 to 25% here, with an upstream check on what happened at dissociation.
Scrublet flags probable doublets, droplets that received two cells. 11% is high for a 3,000-cell sample (target loading is usually below 5% doublet rate). Before filtering, check: (a) was the 10x loading concentration too high? (b) does Scrublet’s UMAP cluster the predicted doublets in a “between-cluster” position consistent with multiplets? (c) are any predicted-doublet clusters actually rare-but-real cell types? Filtering blindly at 11% may delete real biology.
The order encodes information dependencies. PCA denoises the high-dimensional gene-expression matrix into a few dozen meaningful axes. Neighbours builds a kNN graph in PCA space, which is what UMAP and Leiden both consume. UMAP projects the kNN graph to 2D for visualisation. Leiden clusters on the same kNN graph, independent of UMAP. Computing UMAP before neighbours doesn’t make sense; UMAP needs the graph as input. Computing Leiden on raw gene space (skipping PCA) is a common mistake and produces noise-dominated clusters.
HVG selection on un-normalised data picks up genes with high variance because they have high counts, not because they have biologically informative variability. Library-size effects dominate. The output “looks fine” because real biology genes (B-cell markers, T-cell markers, and so on) are also highly expressed and they survive. So do housekeeping genes that vary purely with sequencing depth. Always normalise and log-transform before HVG selection. The Scanpy docs are explicit. The AI sometimes is not.
UMAP shape is not biological evidence. Cluster 4 might be T cells, but the way to know is to look at marker genes in cluster 4 (CD3D, CD3E, TRAC), not “the embedding looks like a T-cell blob”. UMAP geometry is dominated by neighbour-graph topology, not cell biology. Treating shape as identity is exactly the kind of pattern-match the AI is good at and you have to override.
Cluster count is sensitive to resolution (Leiden’s main hyperparameter), to the number of PCs, to the number of neighbours in the kNN graph, and to random seeds. Before accepting “normal variability”, check: was the seed set the same? Same Leiden resolution? Same PC count? Same neighbour count? If all four match and the count still differs, look at what is in the new clusters. Are they sub-clusters of an existing one (a resolution effect, often defensible) or completely different cells (a real disagreement, worth investigating)? “Normal variability” is rarely the right answer when the numbers are different and the inputs are not.

Project: PBMC 3k mini-project

A scoped AI-assisted scRNA-seq analysis. Produce a reproducible Quarto or Jupyter notebook (or Colab link) containing:

AI-free baseline (about 25 lines). Hand-write sc.datasets.pbmc3k(), an mt-gene flag, three QC violin plots, a threshold filter, and log-normalisation. Run it. Save the output. No AI assistance for this section. See the AI-use policy.
AI-assisted iteration. Extend the baseline with HVG selection, scaling, PCA, neighbours, UMAP, and Leiden clustering. AI assistance is encouraged here. Use Module 4 as your reference.
Diff and discussion. List every substantive AI suggestion. For each, mark accepted or rejected and note why. The strength of your work here is in the quality of your discernment. Accepting and rejecting both have to be defensible.
Reproducibility artifact. A pinned environment file (environment.yml or uv.lock) committed alongside the report, or a working Colab link. State which.
Disclosure statement per the rubric.

Self-rubric: PBMC 3k mini-project

Dimension	0	1
AI-free baseline correctness	Doesn’t run, missing required steps, or shows clear AI fingerprints (style or idioms not yours)	About 25 lines, runs end-to-end, hand-written, with three violins, a filter, and log-normalisation
AI-assisted iteration quality	Steps in wrong order: HVG before normalise, UMAP without neighbours, and so on	Pipeline runs in the correct order, produces stable Leiden clusters, and renders UMAP plus cluster labels
Discernment log	Generic (“accepted AI suggestions”), with no rejected suggestions named	At least 3 specific accepts and 1 specific reject, each with a one- or two-sentence reason
Reproducibility	No environment artifact, or a Colab link that fails to open	A pinned env file or a working Colab link committed. Specify which.
Disclosure	Vague or absent	Tool, version, tier, concrete uses, what was verified, and at least one rejected suggestion

Score 4 or 5 of 5: project is done. Score 2 or 3 of 5: pick one row at a 0 and fix it before moving on. Score 0 or 1 of 5: revisit Module 3 and the data analysis reading. The project is testing your discernment, not your typing.

Going further

If you want to scope a Week 4 final project that builds directly on this one (Path A), draft a one-paragraph plan now, while the dataset is fresh. Note the question, the expected approach, and what the AI will and will not help with. The plan is for you. It costs nothing to write and saves a lot of Week 4.
For a deeper take on QC choices, the original Scanpy tutorial is short and the prose is excellent.
If you want to swap in a different dataset to feel ownership of the choices, the Tabula Muris organ datasets (also Scanpy-bundled) are good next steps. But PBMC 3k is the dataset the rest of the course is calibrated against.