Week 3 starter (Python)

Load PBMC 3k, confirm shape, render one summary table

ImportantWhat this is, and what it is not

This starter loads the PBMC 3k dataset, confirms the AnnData is what you think it is, and renders one summary table. It does not perform the QC analysis. That is the Week 3 mini-project’s AI-free baseline. You write it yourself, by hand, before you involve an AI assistant. See the syllabus AI-use policy.

If you copy from this starter into your baseline, you are short-circuiting the assignment. The starter exists so you do not waste the first 30 minutes of the workshop on installs and shapes.

Setup

The course default is Google Colab. Open colab.research.google.com, create a new notebook, and run the cells below. On the first run, install Scanpy:

!pip install -q scanpy

If you prefer a local install, the repo ships pinned environments. Pick one:

# conda (recommended for Apple Silicon, where igraph wheels can misbehave on pip)
conda env create -f environment.yml
conda activate ai-fluency-for-bio

# or pip / uv
uv pip install -r requirements.txt

environment.yml and requirements.txt install the full course stack (Scanpy, scrublet, leidenalg, igraph, jupyterlab), enough for every starter, module, and the Week 3 mini-project. You only need the bare pip install scanpy above if you are running this single starter on Colab and nothing else.

Load the dataset

Scanpy bundles PBMC 3k, so the load is one call (no download to manage):

import scanpy as sc

adata = sc.datasets.pbmc3k()
adata.var_names_make_unique()

Confirm the shape

print(adata)
# expect: AnnData object with n_obs × n_vars = 2700 × 32738
assert adata.n_obs == 2700, "unexpected cell count"
assert adata.n_vars == 32738, "unexpected gene count"

If either assertion fails, stop. Either Scanpy’s bundled dataset has changed in a new version (check sc.__version__) or you loaded the wrong matrix. The downstream baseline will produce nonsense if the shape is not what you expect.

One summary table

import pandas as pd

# How many genes have at least one count per cell?
nonzero_per_cell = (adata.X > 0).sum(axis=1)
nonzero_per_cell = pd.Series(
    nonzero_per_cell.A1 if hasattr(nonzero_per_cell, 'A1') else nonzero_per_cell
)

summary = pd.DataFrame({
    "metric": ["min", "median", "max"],
    "genes_with_counts_per_cell": [
        int(nonzero_per_cell.min()),
        int(nonzero_per_cell.median()),
        int(nonzero_per_cell.max()),
    ],
})
print(summary)

Read the output. The median should be roughly 800 to 1,000, and the maximum reaches a few thousand. The minimum reveals empty droplets and dying cells. These are the cells your QC will filter out. Recognising this is part of the Discernment move you exercise in the baseline.

That is the end of the starter. From here, the baseline is yours.

Where to go next