Protocol design

AI as a sparring partner for experimental planning

Learning objectives

Use AI to generate, critique, and refine experimental protocols.
Apply the 4 D’s to protocol design: delegate the draft, describe the constraints, discern the flaws, take diligent ownership.
Avoid the characteristic failure mode: superficially complete but scientifically naive protocols.

Where AI helps

Draft generation. A first-pass protocol outline given a goal, a reagent list, and constraints.
Checklist generation. Controls you may have forgotten, QC steps, and a back-of-envelope sample-size calculation.
Critique. “What could go wrong with this protocol?” is a surprisingly strong prompt.
Translation between formats: lab notebook to methods section to SOP.

Where AI reliably falls short

Specific reagent choices for your organism, cell line, or platform.
Anything that depends on recent protocol updates (a new library-prep chemistry version, for example).
Judgements that require tacit lab knowledge (“our freezer is unreliable”, “this supplier is slow”).

A sparring-partner workflow

You draft the scientific question and hypothesis.
AI drafts a candidate protocol given your constraints.
You critique the protocol against your own experience and your PI’s feedback.
AI in critique mode lists failure modes and missing controls.
You reconcile, finalise, and document provenance.

Worked example: designing an scRNA-seq pilot

A postdoc wants to characterise the transcriptional states of lung macrophages following influenza A infection in mice. Three time points (Day 0, Day 4, Day 8), two biological replicates each, six samples total, 10x Chromium platform, fresh lung tissue. Here is the sparring-partner workflow from question to finalised protocol outline.

Step 1: describe the question and constraints precisely

The postdoc writes:

I am planning a small scRNA-seq pilot to characterize transcriptional states of lung macrophages following influenza A infection in mice. Timeline: Day 0 (uninfected), Day 4, Day 8 post-infection. Two biological replicates per time point = 6 samples total. Platform: 10x Chromium. Budget: 6 capture reactions. Target cell types: alveolar and interstitial macrophages. Processing window: ~30 minutes after lung harvest. What is a first-pass protocol I should critique?

This is the Description step: cell type, timeline, replicate number, platform, budget, and time constraint all in the same message. Without them the AI defaults to a generic tissue-dissociation protocol that fits no one in particular.

Step 2: AI drafts the candidate protocol

The AI returns a structured protocol that goes from lung harvest through mincing and enzymatic digestion (collagenase IV plus DNase I, 37°C, 30 min), mechanical dissociation, RBC lysis, cell counting, dilution to target, Chromium capture, library prep, and sequencing. The enzyme choices and RBC lysis step are appropriate. The protocol specifies a cell loading target of 10,000 cells per capture reaction on Chromium v3.1 chemistry.

Both numbers look authoritative. Both require scrutiny.

Step 3: human critique against tacit knowledge

The first place the AI was confidently wrong is the chemistry version. It recommends “Chromium v3.1 chemistry (CG000204)” because that version dominates its training data. The current kit as of late 2024 is the GEM-X Single Cell 3’ v4 (CG000731). These are not drop-in substitutes. GEM-X v4 uses a different bead design, different loading parameters, and different reagent volumes. Using v3.1 cell-loading concentrations on v4 hardware degrades capture efficiency. The AI did not know the product line had changed. It cited a version number with full confidence.

Verify against the current 10x user guide before ordering reagents, not the AI’s output.

The second place the AI was wrong is on biology, in the cell loading target. 10,000 cells per reaction is the standard recommendation for smaller cell types (PBMCs, tumour organoid dissociations). Macrophages are large cells, around 15 to 20 µm. Loading at 10,000 cells per mL over-represents them in the inlet, increases chip clogging risk, and elevates doublet rates. A more appropriate target for large primary cells is 5,000 to 7,000 cells per reaction. The AI gave the textbook number, not the cell-type-appropriate one.

The postdoc also catches that the protocol has no stated cell viability cutoff. Loading low-viability preparations increases ambient RNA contamination, a known confounder in scRNA-seq. She adds a minimum viability requirement (above 85% by trypan exclusion or DAPI gate) before the loading step.

Step 4: AI critique mode

The postdoc sends the draft back:

Critique this protocol. List potential failure modes and controls that are missing.

The AI is usefully right in one place. It flags that the six samples are processed sequentially but come from the same experimental condition per day, and it suggests considering hashtag oligonucleotide (HTO) multiplexing: stain all six samples with distinct HTOs, pool before capture, and demultiplex computationally. This halves the number of Chromium reactions needed (three pooled runs instead of six solo), reduces batch effects from separate capture events, and enables doublet detection from HTO signal. The postdoc had not planned for this. It is a genuine improvement.

The AI also misses something. It does not catch the chemistry version error. It critiques the protocol it generated using v3.1 parameters, and it is not aware that those parameters are outdated.

Step 5: reconcile and document

The finalised protocol outline incorporates:

Change	Source	Rationale
GEM-X v4 (CG000731), not v3.1	Human (verified against 10x docs)	Product line updated; loading parameters differ
Cell loading 5,000 to 7,000 per reaction	Human (tacit knowledge)	Macrophage size increases doublet risk at 10k
Viability above 85% gate before loading	Human	Controls ambient RNA contamination
HTO multiplexing, 3 pooled captures	AI critique, accepted	Reduces batch effects, enables doublet detection, saves reagent cost
No-enzyme digest negative control	AI critique, rejected	Non-standard for primary tissue scRNA-seq; cost not justified for a pilot

Document the provenance of each decision. The disclosure rubric will ask which steps were AI-assisted.

Where the 4 D’s showed up

Description: specifying cell type, timeline, platform, budget, and processing window in one prompt drove the quality of the AI’s first draft. Generic prompts return generic protocols.
Delegation: draft generation, checklist enumeration, and the critique pass are good delegation targets. AI is fast and covers common cases.
Discernment: catching the outdated chemistry version (which the AI stated confidently), adjusting cell loading for macrophage biology, and evaluating which critique suggestions to keep and which to drop.
Diligence: the final protocol is yours. If an AI-suggested step introduces a failure mode, that is your failure mode. Document provenance, verify reagent versions against current vendor guides, and have your PI sign off before ordering.

Common failure modes

Treating the draft as the protocol. The AI’s output is a starting point for critique, not a finished SOP.
Accepting version numbers at face value. Chemistry kits, software packages, and instrument firmware all update faster than training data. Always verify version-specific parameters against the current vendor documentation.
Skipping the critique pass. Asking the AI to critique its own output is cheap and often surfaces useful gaps. It does not replace domain-expert review, but it is better than no systematic check.

Exercises

Ask an AI to design a protocol in your area. List everything it got subtly wrong.
Ask the same AI to critique its own protocol. Compare lists. What did it miss in self-critique that you caught?
For the scRNA-seq pilot above, look up the current 10x GEM-X v4 user guide and find the recommended cell-loading concentration for large primary cells. How far off was the AI’s number?

Check your understanding

The AI confidently recommended Chromium v3.1 chemistry when the current kit is GEM-X v4. What is the structural reason an AI is likely to be wrong about chemistry version numbers, regardless of which AI tool you use?
The AI suggested a no-enzyme-digest negative control during the critique pass, and the postdoc rejected it. Was the rejection a Discernment success or a Diligence failure? How would you tell?
You ask an AI to critique its own protocol and it returns five suggestions, three useful and two off-base. Why is the critique pass still worth doing even with the off-base suggestions?

Answers: 1. Training data has a cutoff, and vendor product lines update faster than that cutoff. The AI’s “current” chemistry recommendation reflects what was current when its training data was assembled, usually one to two years prior. Even a model with web access often pulls cached or older documentation that ranks high in search results. The fix is the same regardless of model: verify version-specific parameters against the current vendor user guide before ordering. 2. Discernment success. A no-enzyme-digest negative control is non-standard for primary-tissue scRNA-seq, costs reagent budget, and adds a sample without testing a hypothesis the experiment cares about. The postdoc rejected it on cost-benefit grounds with a documented reason. That is exactly what Discernment looks like. A Diligence failure would be rejecting the suggestion without a reason, or accepting it because the AI suggested it. The presence of a written rationale in the protocol is what distinguishes the two. 3. The critique pass is a cheap way to surface gaps you didn’t think to check yourself. Even at 60% useful, the cost is one prompt and reading five bullet points. The off-base suggestions are signal, not noise. They show what kind of mistakes the AI is prone to in your domain, which calibrates how much to trust the AI elsewhere. The wrong move is to take 5 of 5 as gospel. The right move is to skim, accept the useful ones, document why you rejected the others, and move on.