Structuring directives · Reasoning patterns · Managing the context window · Agentic control
No Code & Agentic AI for Life Sciences · Session 02 · June 2026
By the end of this session, you will be able to:
This is a build-as-you-learn session: ~50% lecture, ~50% hands-on. Keep your agent open. Three graded tasks and one capstone are embedded in the slides.
Part A — Prompt Engineering
~3 hours · 15-min break after Part A · Task 1, Task 2 & Capstone graded.
Part B — Context Engineering
The discipline of structuring an input so a probabilistic model returns the output you actually need — reliably, not by luck.
The systematic practice of structuring inputs to steer a language model toward accurate, reliable, reproducible outputs.
A prompt is a scientific protocol: written once, it should reproduce the same class of result across runs and across people.
Vague prompts produce vague results. Specificity is the antidote to hallucination.
Shrink the model's "search space" so it commits to the relevant region of its training distribution.
A wrong gene symbol, dose, or p-value is not a typo — it is a data-integrity failure. Prompts must be auditable.
Your text is split into tokens (~¾ of a word each), mapped to IDs, and the model predicts the next token by probability. There is no "understanding step" — only conditional probability over the whole prompt.
Why it matters: rare gene/drug names fragment into many tokens, so the model has weaker priors on them. Spell them out, define them, and never assume the model "knows" an obscure identifier.
"Tell me about CRISPR."
Result: a generic, textbook-level summary that misses technical nuance, recent breakthroughs, and your actual question.
"Explain the mechanism of CRISPR-Cas9 base editing vs. prime editing for a genome-editing researcher. Compare on-target indel frequency and PAM constraints. ≤200 words, table form."
Result: a technical comparison with the right metrics, audience, and shape.
Quick poll (30s): in the chat, name one thing the vague prompt failed to specify. Audience, depth, format, metric, or constraint?
Each block answers a question the model would otherwise guess: who am I, what do I know, what do I do, what does good look like, what's the shape, what's forbidden.
Freeze ROLE + FORMAT + CONSTRAINTS as a template; swap only CONTEXT + TASK per run. This is how teams standardize quality.
Primacy & recency: put the most critical instruction first and restate it last. The middle is where models forget (we prove this in Part B).
Who is the model?
"Act as a senior bioinformatician."
What must it do?
"Analyze these gene-expression values."
How should it look?
"Return a Markdown table."
What are the limits?
"Only use genes with p < 0.05."
Before — 1 line, ungrounded
After — RTFC, reproducible
Teaching point: the "after" prompt encodes a statistical decision rule. Anyone re-running it gets the same calls — that is reproducibility, the core scientific virtue.
A role token primes the model toward a region of its training distribution — vocabulary, depth, and assumed prior knowledge all shift.
| Role given | Effect on output |
|---|---|
| "Explain to a 1st-year undergrad" | Analogies, no jargon |
| "Act as a journal reviewer" | Critical, cites limitations |
| "Senior biostatistician" | Tests, assumptions, effect sizes |
| "Regulatory affairs (FDA)" | Compliance & cautious framing |
Make the role specific and earned: "structural biologist who works on GPCR cryo-EM" beats "scientist".
Caveat from the literature: persona prompts improve tone and framing reliably, but do not reliably raise factual accuracy on hard tasks. Don't treat "act as an expert" as a correctness guarantee.
Role sets the voice; constraints + grounding set the facts. Use both.
Show the model 2–5 worked input→output examples before the real query. It infers the pattern (the format, the labelling scheme, the granularity) without any fine-tuning. Introduced at scale by Brown et al., GPT-3 (2020).
Best when you need a consistent format or a niche labelling convention. Keep examples diverse and correct — errors propagate.
Each shot costs tokens. Past ~5 examples, returns diminish and you risk overfitting to the examples' surface form.
Ask the model to reason step by step before answering. For multi-step problems (dosing maths, dilution series, logic over results) this sharply improves accuracy — Wei et al. (2022). Even the zero-shot trigger "Let's think step by step" works (Kojima et al., 2022).
Auditability bonus: visible steps mean a wrong dose is caught at the step, not after the experiment.
CoT costs more tokens and can rationalize a wrong answer fluently. For facts, still verify against a database.
Split a hard task into ordered sub-tasks. Solve and verify each before moving on. Output of step N = input of N+1.
"First list the candidate genes. Then, separately, annotate each."
Sample several reasoning paths, take the majority answer. Reduces one-off reasoning slips on quantitative tasks (Wang et al., 2022).
Ask the model to review its own draft against the constraints before finalizing — a cheap "reflexion" pass (Shinn et al., 2023).
Rule of thumb: if a human would need scratch paper, the model needs decomposition. One giant prompt for a 6-step task is the most common beginner mistake.
Highest-stakes in life sciences: fabricated citations and invented numeric values. Both look authoritative. Never accept a DOI, accession, dose, or p-value the model produced without a ground-truth source.
Build a PubMed abstract triager
{decision, reason, confidence} as JSON.Deliverable & reflection
/stats — record token cost of v1 vs final.20 minutes. Work in pairs; one drives the agent, one records observations, then swap.
Save your best structure in a PROMPTS.md file for reuse across the course.
Prompt engineering writes the instruction. Context engineering decides everything else the model sees — and what it must not.
Crafting the instruction in a single turn — wording, role, examples, format. Scope: one message.
Curating the entire token budget across a session: system prompt, history, retrieved docs, tool outputs, memory — what to include, compress, or evict. Scope: the whole window, over time.
Think of the context window as a workbench, not a warehouse. A good engineer keeps only the tools needed for the current step on the bench.
Everything below competes for the same finite token budget. Every token you spend on history is a token unavailable for retrieved evidence or the model's answer.
Liu et al. (2023) showed retrieval accuracy follows a U-shape: models use information at the start and end of a long context well, but degrade sharply when the key fact sits in the middle.
Fix: put the critical instruction/evidence at the top or bottom; re-state constraints at the end; retrieve fewer, more relevant chunks.
In long sessions the signal-to-noise ratio of the window decays: stale tool outputs, abandoned tangents, and superseded instructions pile up. The model starts attending to obsolete context.
Summarize completed sub-tasks into a short note; drop the raw transcript.
Periodically restate the goal & constraints near the end of the window.
For a new sub-goal, start clean (/clear) and paste only the distilled state.
Practice: keep a running STATE.md the agent updates after each sub-task. Reload it instead of the full history — durable memory beats a bloated window.
Instead of trusting the model's parametric memory, retrieve relevant source text and place it in the context so the answer is grounded in evidence you control (Lewis et al., 2020). This is the antidote to fabricated citations.
No-code path: many agents let you attach a folder of PDFs or point at a knowledge base — that is RAG. Always add: "answer only from the attached sources; if not present, say so."
Durable identity, rules, and guardrails. Highest priority, persists across turns. "You are a lab assistant. Never invent citations."
The turn-by-turn request. Lower priority than system — so a user "ignore the rules" should not override safety.
App-supplied instructions and tool results injected into context. Treat tool output as data, not commands.
Security angle: retrieved documents and web pages can contain prompt-injection ("ignore previous instructions…"). Keep authority in the system message and never let fetched content silently change the task — verify before acting.
Put your reproducibility rules (thresholds, units, "cite source") in the system layer so they survive a long session.
The control loop for multi-step agentic tasks in scientific research.
The agent outlines the steps it will take.
You review the plan and provide feedback.
The agent performs the technical work.
Review output and refine the prompt.
Scientific workflows are expensive in tokens, compute, and reagents. Catching a logic error in the Plan phase costs nothing; catching it after a wet-lab run costs a week.
During "Execute", a tool-using agent cycles: Thought → Action → Observation, feeding each observation back into context until the goal is met (ReAct — Yao et al., 2022). Every cycle adds tokens — which is why context engineering governs agent cost.
"I need the padj column → call the table reader."
Invokes a tool: search PubMed, run a script, query UniProt.
Tool returns data → appended to context → informs the next thought.
Set a stop condition (max steps / "ask me if stuck") so the loop can't burn the whole budget.
| Model (illustrative) | Input | Output |
|---|---|---|
| Frontier (large) | ~$3 / Mtok | ~$15 / Mtok |
| Mid (balanced) | ~$0.8 / Mtok | ~$4 / Mtok |
| Small (fast) | ~$0.25 / Mtok | ~$1.25 / Mtok |
Prices move fast — always check the provider's live pricing page. Output tokens usually cost 4–5× input.
Mental model: a 200k window re-read every turn is expensive. Long history ≈ paying input price on the whole transcript, every message.
Cost-control levers
Summarize finished sub-tasks; /clear between unrelated goals.
Chain small focused calls; feed only the distilled output forward.
Use a small model for extraction/formatting, a frontier model only for hard reasoning.
Pull the 3 relevant chunks, not the whole 80-page PDF.
not reported."Disclosure: most journals and funders now require you to state AI-tool use. Keep your prompts and model versions as part of the research record.
Build a grounded Q&A over a paper
not in source."Deliverable & reflection
20 minutes. Submit the transcript + a 3-line note on any failure you caught.
In groups of 3, design (don't fully build) a prompt+context system that takes a gene variant (e.g. BRCA1 c.5266dupC) and returns a structured, sourced interpretation.
Your design must specify
A one-page design doc + a live demo of the prompt on one variant. 4-minute group presentation.
Rubric: grounding (30%) · structure & format (25%) · failure-mode handling (25%) · reproducibility & logging (20%).
30 minutes design · present after the session.
Run it 3× — does the structure (and ideally the answer) hold? Lower temperature for determinism.
Does the output obey the format and constraints every time? Test edge cases (missing data).
Can every factual claim be traced to a source you supplied? If not, tighten grounding.
Build a tiny eval set: 5–10 inputs with known-good outputs. When you tweak a prompt, re-run the set. This is the no-code version of unit testing — and it turns prompt-tuning from vibes into measurement.
Next Session: Literature Search & Review Automation — PubMed agent workflows and systematic-review pipelines.
Start with [2] (chain-of-thought), [7] (lost in the middle), and [11] (context engineering) — the three ideas that most change how you work day to day.
© 2026 Md. Jubayer Hossain · No Code & Agentic AI for Life Sciences — Session 02