My previous employment was at a company in the Digital asset management space, working on making short form commercial videos, towards the end.
One of the USP was, consistent character generation, which was until Nano Banana, and soon other mainstream models caught up. So the switch was to generate short form commercial videos. Once the shop closed, I spent time reflecting on how things went, and how the overlap landscape changes with AI in picture. One of those thoughts gave birth to krearts
During the time I was also:
This blog documents some of the observations, in building a research oriented framework, and my understanding of this space, apart from code.
When people see ChatGPT, Claude, Grok — the assumption is that building an LLM product is just API calls with a nice UI. Upload a PDF, call the model, return the answer. Ship it.
The gap between using these tools and building with them is massive.
Attachment parsing alone is a project. Scientific PDFs have multi-column layouts, inline equations, tables that span pages, figures with captions that reference other figures. Getting clean text out of that isn’t pdf2text. It’s a pipeline — layout detection, table extraction, figure-caption association, section boundary identification. And every format is different. A Nature paper looks nothing like an arXiv preprint.
Then there’s retrieval. Without preprocessing and indexing:
The first prototype worked exactly this way - no tagging of sections, no embeddings, only enriched extractions.
It was slow, and it had to be tweaked continuously to support new types of queries
Corrective RAG, structured extraction, citation grounding — each turned out to be its own subsystem with its own failure modes. The “wrapper” ended up being the smallest part of the system.
A user types: “What compositions showed the lowest formation energy across these three papers?”
The naive assumption is “send to LLM, get answer.” In practice, there’s a whole pipeline before the model ever sees the question:
Query expansion. The raw question gets rewritten. “Formation energy” might need to expand to “enthalpy of formation,” “DFT-computed stability,” or specific notation like ΔHf. A single user question becomes multiple retrieval queries.
Intent classification. Is this a comparison across papers? A lookup in a single table? A synthesis question that needs reasoning? The retrieval strategy changes depending on the answer.
Hybrid retrieval. Full-text search (FTS), n-gram matching, and semantic embeddings each catch different things. FTS finds exact terms. N-grams handle partial matches and chemical formulas that embedding models mangle. Embeddings capture semantic similarity — “thermal stability” matching “resistance to decomposition.”
None of this retrieval composition is new. Google has done this for decades. The difference is composing these with an LLM reasoning loop instead of hand-tuned ranking signals. Reciprocal Rank Fusion (RRF) merges results from different retrieval methods into a single ranked list. The LLM then reasons over the top results instead of just returning links.
What this replaces. Traditionally, doing this well meant deploying Elasticsearch or Solr — heavy infrastructure with its own operational cost, query DSLs, analyzers, synonym dictionaries, spell-check configs, and tokenizer tuning.
With an LLM and vector search, a lot of that goes away:
pgvector) plus FTS on Postgres replaces what used to require a dedicated search deployment.The infrastructure footprint shrinks dramatically. What used to be a separate cluster with its own ops burden becomes a Postgres extension and an API call.
Reranking. The initial retrieval casts a wide net. A reranker (cross-encoder or LLM-based) scores each chunk against the original question for fine-grained relevance. This is where you go from “related passages” to “the actual answer is in these three paragraphs.”
Without an embedding and indexing pipeline, every query pays the full cost. Parse the PDF. Chunk it. Embed the chunks. Search. Answer. For a 40-page paper, that’s 30+ seconds before the user sees anything.
The alternative is progressive indexing: make the paper useful immediately and build deeper indexes in the background.
This works in tiers:
A query that arrives at tier 1 gets FTS-only retrieval. Not perfect, but fast and useful. By the time the researcher has read the first answer and typed a follow-up, tier 2 or 3 is ready.
The key insight: researchers don’t upload a paper and immediately ask their hardest question. They start with “what’s this about?” and work their way to specifics. Progressive indexing matches the system’s readiness to the user’s actual behavior.
This is the same pattern behind Cinestar’s five-phase video indexing pipeline — make a video searchable the moment it’s uploaded (phase 0, basic metadata), then progressively refine with multi-modal enrichment, coarse segmentation, fine segmentation, and cross-reference passes.
The domain is different but the architecture is identical: immediate utility, background refinement, each tier unlocking better search quality.
Graph RAG has real value. But the costs are real too.
Building a knowledge graph over one document requires:
For a single document, this is expensive relative to the payoff. A well-chunked document with good metadata gets roughly 80% of the way there.
The better alternative — section_covers.
No matter how unstructured a PDF layout looks, the domain and the humans in it have a structure. Every scientific paper has an implicit hierarchy — title, abstract, hypothesis, methods, results, conclusion. The sections might be named differently, merged together, or split across pages, but the structure is always there. Researchers read papers this way instinctively.
The idea: teach the LLM this structure through the prompt, and have it classify each chunk during ingestion. The classification is an array — ["methods", "results", "datasets"] — not a single label, because sections overflow. A “methods” section often contains datasets and preliminary results too.
How the LLM knows the structure. The agent prompt is hierarchically organized with custom tags — identity, capabilities, workflows, security, output rules — each scoped and nested. Within this, the paper-reading workflows define phased strategies:
The prompt doesn’t just say “read the paper.” It encodes how a researcher reads — which sections to check first for which kind of question, when to fall back to broader reading, how to cross-reference across documents.
At query time, filtering by section type is a simple indexed array lookup. “Show me just the methods across these three papers” — no graph traversal needed. The LLM can then drive a ReAct loop to compare across sections, navigate the hierarchy, and synthesize — all without an entity graph.
Researchers don’t work with one paper. They work with a workspace — dozens of papers, experimental notes, simulation results, reviewer feedback. Over weeks and months, connections accumulate:
That’s where a knowledge graph becomes valuable. Not blind LLM extraction — asking a model to “extract all entities” from a paper produces confident garbage. What works is:
Graph RAG for single-document retrieval is usually overkill. Graph RAG as an epistemic knowledge web built over months of research — that’s where it becomes worth the investment.
Models get switched. Pricing changes, a new model drops with better structured output, an open-source option gets good enough for a subtask. Each time, it means rewriting prompts and fixing output parsing — unless the system is built for it.
The prompt is the cage geometry — it shapes the output. But real portability comes from treating LLM output as untrusted input.
When validation fails, the system retries with a corrective prompt that includes the validation error. Self-correcting loops. In practice, most queries resolve in one pass. Some need two. When it takes three, the problem is almost always in the prompt design or the retrieval, not the model.
This leads to a useful model-agnostic metric: not “which model is best” but “how many correction cycles does this model need for this task.” GPT-4 might need one cycle where Claude needs two, or vice versa, depending on the task class. The system handles both.
One observation: most open-source models share failure modes — they’re fine-tuned from the same bases. A correction loop that handles Llama’s JSON formatting quirks tends to handle Mistral’s too. Build the cage right and the wind can change direction.
Foundational models are trained on a lot of data. The assumption is that because of how embedding spaces work, the LLM would be the one that sees all patterns — but more powerful.
Maybe it does. But what it does and doesn’t depends on how it was trained. Combining different domains means mapping them to the same embedding space. General LLMs are not an answer to this.
In video generation, the underlying models have to be tweaked enough to guard how the LLM generates media. The nuances are real:
| Level | What it determines | Example |
|---|---|---|
| Macro | Physics, environment, scene composition | Bags are not displayed the same way as watches |
| Micro | Audience, messaging, tone | A wrist watch is not advertised the same way as a wall clock. A mechanical watch is not advertised the same way as a quartz. |
| Cultural | Aesthetics, conventions, expectations | A Japanese website looks nothing like a US-built website. Japanese fashion magazines are structured for data extraction — what to wear, how to pair it, what it’s for. |
LLMs tend to take the path of least work. Hence the need for detailed planning, guardrails, and extensive instruction following. If these foundational models had all this world knowledge baked in, why would anyone need custom embedding models?
This sounds more like a Mixture of Experts, but constrained to a domain.
When working with non-textual data, it’s fairly impossible for an LLM to predict things in real life. Back in very late 2025, none of the LLM models were consistently good at:
The failure has multiple layers, and it’s well-documented:
The precise term from the literature: statistical co-occurrence without compositional constraint satisfaction. The models know what things look like together, but can’t enforce constraints between them.
LLMs are probabilistic systems.
If you expect a probabilistic system to reliably (do X), it will probably, reliably (do X)
GPT has a lot of information about how good hot chocolate is made. It can tell you how different countries and regions like theirs, if you ask for it. It has access to all kinds of recipes, ratings, discussions — enough for a fair idea.
But an LLM has never seen or tasted hot chocolate. It relies on what I call “collective truth”. So it can’t accurately predict how changes in quantities lead to different taste.
This is the same with materials research — CHGNet and other tools provide contextual models with computation baked in, rather than an LLM trying to predict outcomes.
When I asked the LLM to adjust quantities for 4 people, the amount of water and sugar was way off from reality. What actually worked:
For automation, this means: as long as the LLM has access to eyes, ears and other senses into the real world, foundational models can actively guide towards real-life usable outcomes.
The same principles apply to sequential or batched image and video generation — with a corrective feedback loop. But costs shoot up.
With enough effort — YOLO for vision, RPi Pico or ESP32 to capture images in batches, actual instruments and sensors providing continuous feedback — most real-life applications of AI will come when we add sensory elements to it. The LLM continuously validates against the desired outcome at each stage.
The obvious first move is a web app. Upload papers, ask questions, get answers. Fastest to ship, easiest to demo. For a lot of teams, it’s the right choice. But once the architecture is model-agnostic, it doesn’t actually need to be centralized.
The workspace, the agents, the retrieval layer — all of it can run on a researcher’s machine or a lab’s own infrastructure. That opens up delivery options worth considering.
| Web / SaaS | Editor (JetBrains / Cursor-style) | VSCode / Codium plugin | |
|---|---|---|---|
| Where the workspace lives | Cloud | Researcher’s machine or lab infra | Researcher’s machine |
| Model inference | Hosted endpoints | Local, lab cluster, or hosted — their choice | Same |
| Data residency | Provider-managed | User-managed | User-managed |
| Git-backed tracking | Possible but uncommon | Natural fit — queries, experiments, hypotheses all versioned | Same |
| Lab equipment access | Via API tunnels | Direct — instruments register as tools | Direct |
| Distribution | URL | Installer / package manager | Marketplace (shared across Codium-based editors) |
| Trade-off | Fastest to ship. Data residency is a conversation with every enterprise customer. | Most integrated, most opinionated. Bigger upfront investment. | Lowest barrier to adoption. Less integrated long-term. |
No single right answer:
The interesting observation is that model-agnostic design (the cage and the wind) is what makes these options possible at all.
Once inference is swappable — hosted, self-hosted, open-source, Model Garden, whatever — the delivery question becomes about where the data lives, not where the model lives.
In terms of business models, at the time of writing this, I see three prevalent ones.
moondream, gives us a pretty good idea, what a focussed model can do.
The materials hypothesis engine, which uses chgnet and gpaw etc, to run actual predictions, and dft relaxation.The 1st one however, would probably not survive, if it’s not a niche the big players don’t want to focus on, and public data is unavailable. Without a proper moat, the fear of being invalidated or competitor proliferation is much higher. The DocuSign incident is a glaring example — OpenAI launched DocuGPT, and DocuSign’s stock dropped 17% overnight. Open-source alternatives like OpenSign and Documenso were already circling.
When everyone has a gun, you need a bigger gun (leverage).
Git-based tracking. When the workspace is local, versioning research artifacts in git becomes natural. Every query, experiment config, hypothesis — versioned. Branch a research direction. Diff two experimental setups. Revert a dead end. Researchers already think in version control for code; extending it to research is a small step.
The device-driver pattern. With local or lab-hosted infrastructure, physical equipment can connect directly. The framework defines the tool interface; labs implement the connector for their instruments — a synthesis furnace, an XRD machine, a spectrophotometer. Same interface as any other tool in the registry. This is harder to pull off through a cloud intermediary, though not impossible.
Every research domain already has validated tools:
Researchers know and trust these tools. Building an LLM system that tries to replace them means reimplementing domain logic badly and asking an LLM to do math it will get wrong.
The better pattern: give the agent access to these tools. The LLM understands the question, picks the right tool, formats the input, interprets the output. The computation stays with code that’s been validated by the domain for decades. Same for existing ML models — if a trained classifier or prediction model exists for a subtask, use it. The agent orchestrates; the specialists compute.
Overall, as I understand, llm’s use on textual data, is pretty limited use of this technology. Yes one can convert a natural language to an sql query. The system you might have built, using agents and tools, to produce a valid grounded query, might as well be one shotable with a long enough context window.
So there isn’t much inherent value in building such systems. Looking back, I realize, for sure, if an MVP or bootstrapped product, is headon with existing features available on the available SaaS platform, and is presented to the users, the in-evitable, un-avoidable and un-answereable question, how is that different from GPT, will be presented.
In all honesty, its probably not, because building a production grade ChatGPT like interface, with all the functionalities and edge cases take time.
It would be similar to asking Build me Twitter. Sure, but why?
The users do not and should not care about such engineering challenges.
Chat becomes another interface to communicate with the system, much like REST. And hence the comparison against GPT or other models never come along.
Overall I dont feel like much has affected in terms of how one does business. When the internet came along, there were loads of websites just built with no actual substance, and as time went by, the web just became a medium or enabler for actual labour.
I think the same will happen for AI, most companies, whose identities are based around LLM will not survive in the end. Anything and most things, that was generated or is one shotable by an LLM, will eventually get replicated and saturated.
So one way I guess could be to build wrappers very fast, on multiple domains. I mean, there are 100s and 1000s of n8n workflows, skills, subagents, entire repositories of “leaked” prompts for agentic coding tools.
The only leverage these tools have is funding. If funding, networking and marketing were readily available, everyone could sell their own spinoff of opencode.
Which again is business as usual. There are a lot of twitter clones, and reddit alternatives, but the rate of success doesn’t depend on just code execution.
Code execution at base level has always been cheap. Especially in India, where culture is mostly managerial driven. CxOs all over the world, earn disproportionately more, not because they can write code.
And it’s getting cheaper. Cross-provider deployment can be automated pretty easily now — Terraform, Pulumi, SST, whatever your flavor. Code generation is commoditized. Porting concepts from one language’s ecosystem to another is a weekend project with a coding agent. Elixir’s supervision trees in Go, Python’s Ray actors in Rust, Ruby’s convention-over-configuration patterns anywhere — the implementation barrier between “I know this pattern exists” and “it’s running in my stack” has collapsed. Which means the moat is never in the code.
I don’t think this is new practice now, but still to acknowledge, AI has caused some cultural shifts in org. It also has somehow set the wrong expectations from a lot of people.
Osho once said, “Democracy basically means government by the people, of the people, for the people — but the people are retarded.” I think to some extent its terribly true. This is more prevalent and out in the world today because of social media.
Only a handful of companies, have presented their experiments as is to the real world. Marketing gimmicks like, “LLM wrote C compiler from scratch”. Yeah i mean, great that your LLM could do it, but:
I don’t blame anyone, everyone is doing their part to survive. Lets not forget such waves of data science, have already hit us before twice, and both times
the companies were all in losses.
So building a sustained customer facing model is a lot of pressure.
But you dont need to XD. It begins with educating everyone alike, juniors, seniors, management, stakeholders.
An org should take the time in first doing a planned research on whats possible, to set the records straight. Some basic expectations on either end:
For developers it can be:
For non developers, to understand:
An LLM doesn’t reduce the complexity of a business problem, if at all it increases some workload. Because now one has to think about how to bridge the probabilistic and deterministic parts.
An LLM does make you faster, but the quality of code, the philosophy, the foresight, doesn’t come from the LLM. A novice coder now produces more bad code, faster, and vice versa.
Its medically stupid, to make an LLM learn a very well established set of rules, and turn it into a probabilistic model in real life. Yeah, maybe it can be used to figure a better, faster, alternate way, but the process of exploration can’t be the way.
There is no point in making an LLM add two numbers, or add two numbers using a GPU consuming 16Amps.
The pattern that fell out of building this is domain-agnostic. What follows is the architecture — contracts, conventions, build order, and acceptance criteria. Pick your language, pick your domain. The shape stays the same.
Before building anything, define these for the target domain. Everything downstream depends on them.
| What | Example (materials science) | Example (biotech) | Example (ML research) |
|---|---|---|---|
| Document formats | Scientific PDFs, CIF files, VASP output | PDB files, FASTA, clinical trial PDFs | arXiv PDFs, Jupyter notebooks, model cards |
| Domain tools | Pymatgen, ASE, Materials Project API | BLAST, UniProt API, RDKit | scikit-learn, HuggingFace model hub, W&B API |
| Domain ML models | Crystal system classifiers, GNNs for molecular dynamics | Protein structure predictors, toxicity models | Benchmark evaluators, dataset quality scorers |
| Lab equipment | Furnaces, XRD, spectrophotometers | Sequencers, PCR machines, plate readers | GPU clusters, training pipelines, eval harnesses |
| Domain-specific query patterns | Chemical formulas (Li₂FePO₄), crystal notation | Gene names (BRCA1), protein IDs (P53_HUMAN) | Model identifiers, dataset names, metric names |
| Validation rules | Formation energy ∈ [-10, +10] eV/atom; temperature > 0K | Gene names match HGNC; dosage within safe range | Accuracy ∈ [0, 1]; loss is non-negative |
| Experiment schema | Hypothesis → synthesis params → characterization → result | Hypothesis → protocol → assay → measurement | Hypothesis → hyperparams → training run → eval metrics |
workspace/
├── contracts/ ← message types, shared by all components
│ ├── messages ← every inter-component interaction is a typed message
│ └── schemas ← domain entity schemas (documents, experiments, graph nodes)
│
├── document_store/ ← ingest, parse, chunk, index
│ ├── parsers/ ← one parser per document format, registered by MIME type
│ ├── chunkers/ ← section-aware splitting (not blind fixed-size)
│ └── indexers/ ← progressive: raw → sectioned → embedded → fully indexed
│
├── retriever/ ← hybrid search across whatever indexes exist
│ ├── strategies/ ← fts, semantic, hybrid (RRF fusion)
│ └── domain_matcher ← scores chunks using domain-specific logic
│
├── tool_registry/ ← domain tools, ML models, lab connectors — one interface
│ ├── tools/ ← each tool: name, description, input/output schema, handler
│ └── connectors/ ← lab equipment drivers (same tool interface)
│
├── experiment_tracker/ ← runs, parameters, results, lineage — git-backed
│
├── agents/ ← long-running stateful processes
│ ├── supervisor ← spawns, monitors, restarts, checkpoints
│ ├── research_agent ← handles user queries, invokes tools
│ ├── indexer_agent ← background progressive indexing
│ ├── experiment_agent ← designs experiments, monitors runs, logs results
│ └── watcher_agent ← monitors external sources for new documents
│
├── workspace_state/ ← shared blackboard
│ ├── documents ← registry of ingested docs and their index tier
│ ├── history ← query/response log with tool call traces
│ ├── experiments ← run registry with full lineage
│ └── graph ← knowledge graph: entities, relations, evidence chains
│
├── validation/ ← pure functions, no LLM calls, deterministic
│ ├── schema ← does output match expected structure
│ ├── citations ← does every claim trace to a source passage
│ ├── domain_rules ← is output physically/logically plausible
│ └── experiment_safety ← are params within safe bounds before reaching equipment
│
└── domains/
└── {domain_name}/ ← all domain-specific config in one place
├── parsers ← document format implementations
├── tools ← tool registrations + connector configs
├── matcher ← domain query pattern scorer
├── rules ← validation rules + safety bounds
└── experiment ← what constitutes hypothesis, run, result
Components have dependencies. Build in this order — each layer only depends on layers above it.
| Phase | Component | Depends on | Done when |
|---|---|---|---|
| 1 | contracts/ | Nothing | Message types defined for: tool_call/tool_result, retrieve/results, ingest/indexed, experiment_create/experiment_result, state_read/state_write. All components will import these. |
| 2 | workspace_state/ | contracts | Can store and retrieve documents, history entries, experiment runs, and graph nodes/edges. Supports concurrent reads. |
| 3 | validation/ | contracts | Each validator accepts output + rules, returns ok or error with details. No network calls, no LLM calls. Experiment safety validator rejects out-of-bounds parameters. |
| 4 | document_store/ | contracts, workspace_state | Can ingest a file, run it through a parser, chunk it, and register it in workspace state. Progressive indexing: ingest returns immediately at tier 1, background jobs upgrade tiers. Search works against whatever tiers exist. |
| 5 | retriever/ | contracts, document_store, workspace_state | Accepts a query and strategy (fts/semantic/hybrid). Returns ranked chunks with scores and source references. Domain matcher plugs in as a scoring function. |
| 6 | tool_registry/ | contracts, validation | Tools register with typed schemas. Call dispatches to handler, validates output against schema. Lab connectors implement the same interface. Agent can list available tools and their descriptions. |
| 7 | experiment_tracker/ | contracts, workspace_state, validation | Can create a run from a hypothesis + params, log results, trace lineage back to source hypothesis/papers/queries. Git-backed: each run is a commit, params are diffable. |
| 8 | agents/ | Everything above | Each agent is a long-running process with its own lifecycle. Supervisor manages spawn/monitor/restart/checkpoint. Agents communicate through workspace_state, not direct calls. |
Everything is a message. Components never call each other directly. Every interaction — tool invocation, retrieval request, state update, experiment result — is a typed message on a bus. This maps to whatever concurrency model the language provides (actors, channels, async queues). The constraint: no shared mutable state between components, only messages. This is what makes the system distributable without a rewrite.
Tools are the domain extension point. A tool has: name, description, input schema, output schema, handler. The agent reads the tool catalog at runtime and picks tools based on descriptions and schemas — not hardcoded dispatch. Domain computation, ML models, and lab equipment all enter the system as tools. If a validated domain tool or model exists for a subtask, register it. The LLM orchestrates; specialists compute.
Lab equipment follows the device-driver model. The framework defines the tool interface. Labs implement the connector for their specific equipment — communication protocol, safety interlocks, data formatting. From the agent’s perspective, measuring an XRD pattern and computing a phase diagram are the same operation: call a tool, get a result.
Agents are stateful, long-running processes. Not request handlers. Each agent maintains context across interactions — loaded documents, active hypotheses, running experiments. The agent loop: receive → classify intent → plan steps → execute (with validation at each step) → on failure, retry with corrective context (budget: N attempts) → accumulate results → update workspace state → respond. If an agent crashes, the supervisor restarts it from its last checkpoint. If an indexer dies mid-document, the research agent keeps serving from whatever tiers are already built.
Experiments are first-class. A run records: triggering hypothesis, parameters, tools called, data in, results out. Full lineage — traceable back through the graph to the papers and queries that generated the hypothesis. Git-backed: each run is a commit, params are diffable, branching an experiment direction keeps the history clean.
The epistemic loop. The knowledge graph grows through a cycle:
The graph isn’t a retrieval index. It’s accumulated understanding — which claims have been tested, which hypotheses failed, which things were actually verified vs. only predicted.
Validation is not optional. Every agent step passes through validation. Validators are pure functions — deterministic, no LLM calls. Four categories: schema (structure), citations (grounding), domain rules (plausibility), experiment safety (bounds before reaching equipment). The correction loop feeds validation errors back as retry context. Experiment safety validation is the guardrail between an AI system and real equipment. No exceptions.
The framework needs backing services. Start embedded, graduate to distributed.
| Concern | Dev (laptop) | Production |
|---|---|---|
| Message bus | In-process (language’s native concurrency) | NATS, RabbitMQ, or Redis Streams |
| Document storage | Filesystem + SQLite | Object storage + Postgres |
| Search indexes | SQLite FTS5 + sqlite-vss (or pgvector) | Postgres tsvector + dedicated vector store |
| Agent runtime | Single process, multiple actors/goroutines/tasks | Distributed nodes (OTP, Ray, K8s pods) |
| Job queue | In-process task queue | Durable job queue (language-appropriate) |
| Experiment storage | Git + SQLite | Git + Postgres + object storage for artifacts |
| Model inference | API calls to foundation model providers | Self-hosted (vLLM, TGI), Model Garden, Azure ML, or local GPUs |
The message bus convention means moving from dev to production is configuration, not a rewrite.
domains/{domain_name}/Not being a materials science expert meant there was no way to eyeball whether the system was producing good hypotheses. So the build happened in two directions simultaneously.
Same approach as building a compiler — lexer, parser, codegen, each independently testable:
| Subsystem | What it does | Testable in isolation? |
|---|---|---|
| Document pipeline | Parse, chunk, index scientific PDFs | Yes — output is structured text, verifiable |
| Retrieval | Hybrid search across indexed papers | Yes — relevance scoring against known queries |
| Domain tools | CHGNet, GPAW, materials databases | Yes — known inputs, known outputs |
| Agents | Orchestrate tools, maintain context | Yes — given fixed retrieval, does the plan make sense |
| Experiment runner | Execute computational experiments | Yes — scripts produce reproducible results |
Each piece could be validated without domain expertise. The document pipeline either extracts tables correctly or it doesn’t. CHGNet either returns a valid energy prediction or it doesn’t.
The harder question: does the whole system, end-to-end, produce hypotheses that are actually good?
The approach:
The conversational pattern matters. A direct request — “what caused the LK-99 results?” — would test whether the model memorized the answer. A conversational exploration tests whether the system architecture can guide reasoning through literature, contradictions, and evidence toward a defensible hypothesis.
The hypothesis engine was essentially reverse-engineered from this evaluation. The question “how do you know if it’s any good?” shaped every architectural decision — what agents exist, how they communicate, what tools they call, how hypotheses get ranked.
This evaluation approach generalizes. Once tests exist for LLM-driven workflows, they become a model selection ground — run the same test suite against different models, collect golden results, compare.
But testing probabilistic systems is fundamentally different from testing deterministic code. Three patterns that worked:
1. Black box / outcome-only testing
Treat the LLM like a private method. Don’t assert on internal reasoning. Only check:
This is the most robust approach — it survives model swaps without rewriting tests.
2. Value-in-collection assertions
When the output should contain specific elements but order doesn’t matter:
Not “the answer is X” but “the answer contains X, Y, Z.”
3. Workflow / DAG testing
A goal can have multiple valid pathways, even with cycles. But:
| What stays the same | What can vary |
|---|---|
| The set of nodes visited | The order of traversal |
| The final memory state | The number of correction cycles |
| The tools invoked | Which tool was called first |
| The types of intermediate results | The exact values |
Test the DAG shape, not the exact path. If “retrieve → extract → validate → synthesize” is the expected workflow, assert that all four steps happened and the memory state after each step contains what downstream steps need. The path between them — whether the agent took one cycle or three, whether it backtracked — is the probabilistic part. The nodes and final state are the deterministic contract.
This turns model comparison from “which one feels better” into “which one reaches the same nodes in fewer cycles, with fewer validation failures.”
This post is the map. The territory is in the implementation.