From Paper Chaos to Testable Hypotheses

A materials science researcher reads a new preprint about LK-99. Then another. Then eight more, each contradicting the last. Some claim superconductivity. Others blame Cu2S impurities. One reports diamagnetic response above the transition. Another shows XRD peaks that should not be there.

The researcher is not confused because they lack expertise. They are stuck because the literature itself is conflicted, and the cost of testing the wrong hypothesis first is three weeks of compute time and a grad student’s morale.

This is the problem we built for.

What a Researcher Actually Needs

Before diving into architecture, consider what a materials researcher dealing with a contested claim actually needs:

  1. A ranked reading of the evidence — not “here are ten papers”, but “here is what they collectively say, where they disagree, and which disagreement is most testable.”
  2. A concrete test plan — not “consider running first-principles calculations”, but a workspace with scripts, validation targets, and success criteria ready to execute.
  3. Material candidates evaluated before the expensive run — not “Cu2S is worth investigating”, but “here is the energy landscape for Cu2S, Ag2S, and Cu2Se from ML predictions, and here is which one actually discriminates between hypotheses.”
  4. An audit trail that a PI can review — every hypothesis traces back to specific paper sections, every experiment has explicit falsification criteria, every result is JSON-serializable.

That is not a chatbot feature list. That is a research operations surface.

The Core Loop

Our system runs a three-part loop:

  1. Read and cross-reference a paper set.
  2. Generate ranked hypotheses with falsification criteria.
  3. Materialize those hypotheses into executable experiment workspaces.
graph LR
    P[Paper Set] --> H[Hypothesis Generation]
    H --> W[Workspace Assembly]
    W --> E[Experiment Execution]
    E --> J[Final Evaluation]
    J --> N[Next Iteration]
    N --> H
    style P fill:#f3f4f6,stroke:#6366f1,color:#111827
    style N fill:#eef2ff,stroke:#6366f1,color:#111827

The important detail: each stage emits structured outputs, not freeform text. That keeps the system composable and debuggable.

Stage 1: Hypothesis Generation Across a Paper Chain

The hypothesis engine does not produce one “best guess” and stop. It builds a ranked set of testable claims, each with:

  • a mechanism statement
  • supporting and counter evidence from specific paper sections
  • explicit falsification criteria
  • suggested experimental methods with estimated cost
  • physical parameters extracted from the corpus

Think of it as a compiler pass over literature. Raw prose in, typed claims out.

A typical deep pass over a paper chain uses a phased strategy:

  • scan the full corpus for contradictions and convergences
  • deep-read high-signal sections (results, methods, supplementary data)
  • synthesize contradictions into ranked hypotheses with confidence scores

This separation matters. Without it, systems either stay shallow or burn cycles in unstructured exploration. A researcher already knows how to deep-read one paper. What they cannot do efficiently is cross-reference ten papers, identify the three testable disagreements, and rank them by information gain.

What a Generated Hypothesis Actually Looks Like

This is a trimmed output from a real run against the LK-99 paper corpus:

{
  "hypothesis_id": "H001",
  "statement": "The anomalous resistivity drop observed in LK-99 at ~385 K is caused by a first-order structural phase transition in Cu2S impurity phases, not superconductivity.",
  "confidence": 0.75,
  "mechanism": "Cu2S undergoes a known monoclinic-to-hexagonal phase transition at 377-380 K, causing a sudden change in resistivity of Cu2S inclusions embedded in the Pb-apatite matrix.",
  "supporting_evidence": [
    {"paper_id": "2307.12008", "finding": "Sharp resistivity drop at ~385 K in polycrystalline LK-99"},
    {"paper_id": "2308.01516", "finding": "Cu2S phase detected via XRD in LK-99 samples"}
  ],
  "counter_evidence": [
    {"paper_id": "2307.12037", "finding": "Diamagnetic response persists above 385 K in some samples"}
  ],
  "falsification_criteria": [
    "Cu2S phase transition temperature differs from observed LK-99 transition by > 20 K",
    "Cu2S resistivity change at transition is < 10% of LK-99 observed change",
    "No Cu2S detected in XRD patterns of samples showing resistivity drop"
  ]
}

Every hypothesis carries explicit falsification criteria. The system is not trying to confirm a belief. It is trying to find the fastest way to rule one out.

For a researcher, this means the first conversation with a PI about which experiment to run next starts with data, not intuition. For a group with limited compute budget, it means the most information-dense experiment gets priority.

What This Produced in a Real Run

We ran this pipeline end-to-end against the LK-99 corpus: ten papers spanning the original claim, replication attempts, and debunking analyses.

The hypothesis engine generated three ranked hypotheses. H001 (Cu2S phase transition, confidence 0.75) was selected for workspace execution. The system then generated experiment instructions, provisioned a Docker container with compute access, and ran five experiments autonomously:

Stage Duration What happened
Hypothesis generation ~30 seconds 3 ranked hypotheses with falsification criteria
Instruction synthesis ~15 seconds instructions.md + hypothesis.json + physical_parameters.json
Workspace provisioning ~2.5 minutes Container with compute API client, templates, data files
Experiment execution ~16.5 minutes 5 Python scripts written, executed, outputs verified
Analysis report ~3.5 minutes Structured verdict with evidence chain

The runner agent wrote each script, called the compute gateway for ML predictions and materials database queries, verified outputs against validation targets, and moved to the next experiment. When one script produced an unexpected energy value, the agent flagged it in the results rather than silently continuing.

The final analysis report scored 5 out of 5 on target keyword coverage and produced a composite evaluation score of 1.0 against the held-out resolution paper. The system correctly identified the Cu2S phase transition as the dominant mechanism, consistent with the later consensus in the literature.

Total wall-clock time from “here are ten papers” to “here is a scientific verdict with evidence”: approximately 23 minutes.

For context, a graduate student doing this manually — reading the papers, setting up the calculations, running them, and writing an analysis — would typically need one to two weeks.

Material Discovery Is Part of the Loop, Not a Separate Workflow

The biggest shift is this: hypothesis generation and material discovery are not sequential stages. They are coupled in the same agent loop.

When the agent proposes a mechanism, it can immediately test feasibility against candidate materials, known structures, and computed properties through its tool set. That changes the role of discovery from “later validation” to “real-time pruning.”

The design_material Tool: From One Hypothesis to a Material Family

This is where the loop gets interesting for researchers who think in terms of compositional space, not individual compounds.

The system includes a design_material tool that proposes new material compositions by element substitution on known crystal structures. Given a source structure from a materials property database and a substitution map, it:

  1. Applies the substitution to the crystal structure (e.g., replace all Cu atoms with Ag)
  2. Runs an ML property prediction on the modified structure (~1 second)
  3. Returns predicted energy, forces, stability assessment, lattice parameters, and a CIF file
  4. Registers the designed material as a trackable entity in the answer state

This is not a lookup. It is a structural transformation followed by a physics prediction.

Derived Hypotheses: How One Experiment Spawns the Next

Here is the pattern that makes this a loop rather than a pipeline:

Suppose H001 states that Cu2S impurity phase transitions explain the LK-99 resistivity anomaly. The agent runs experiments on Cu2S and the evidence supports the hypothesis. A naive system would stop here and declare victory.

But a materials researcher would immediately ask: Does this mechanism generalize? What about Ag2S? Cu2Se? What if I substitute the cation and the transition disappears?

This is exactly what the next iteration does. The agent can call design_material to create Ag2S from the Cu2S template:

design_material(
  material_id="mp-560588",     # Cu2S (chalcocite)
  substitutions={"Cu": "Ag"},  # Replace Cu with Ag
  relax=True                   # Predict properties with ML potential
)

The ML potential returns the predicted energy landscape for Ag2S. If the monoclinic-to-hexagonal transition energy differs significantly from Cu2S, that strengthens H001 by providing a negative control: a structurally related sulfide where the mechanism should not produce the same resistivity anomaly.

This generates a derived hypothesis:

H001a: “If the Cu2S phase transition mechanism is correct, substituting Cu with Ag should eliminate the resistivity anomaly near 385 K, because Ag2S undergoes its monoclinic-to-orthorhombic transition at 450 K — outside the LK-99 observation window.”

The derived hypothesis carries its own falsification criteria, its own material references, and its own experiment plan. It feeds back into the loop as a new workspace.

graph TD
    H1[H001: Cu2S Phase Transition] --> DM[design_material: Cu to Ag]
    DM --> P[ML Prediction: Ag2S]
    P --> H1a[H001a: Ag2S Negative Control]
    H1a --> W[New Workspace]
    W --> E[Experiment: Compare Cu2S vs Ag2S]
    E --> V[Verdict: Mechanism Specificity]
    V --> H2[H002: Next Hypothesis]
    style H1 fill:#f3f4f6,stroke:#6366f1,color:#111827
    style V fill:#eef2ff,stroke:#6366f1,color:#111827

This is the difference between an agent that answers questions and a system that designs experiments. The researcher gets a systematic exploration of compositional space, not a single-point answer.

In practice, this means:

  • impossible candidates are eliminated before expensive first-principles runs
  • promising material families are expanded with structure-aware variants
  • each hypothesis carries discovery context with database IDs, not just text rationale
  • negative controls are generated automatically, not as an afterthought

For a university group with limited compute budget, this reduces dead-end simulations by pruning with ML screening (~1 second) before committing to first-principles calculations (~20-120 seconds each). For an engineering team, it reduces iteration latency from days to minutes. For investors evaluating a materials startup, it provides a measurable, auditable trail of what was explored and why.

Tool Impact Inside the Hypothesis Agent

A good agent toolset is not about having many tools. It is about making each tool change a decision boundary.

In our hypothesis loop, tool calls do three high-value jobs:

  • Evidence compression: distill multi-paper contradictions into comparable hypothesis objects with confidence scores and falsification criteria
  • Discovery grounding: retrieve known materials, design new compositions, and validate structural feasibility before downstream execution
  • Execution readiness: package selected hypotheses into reproducible workspaces with explicit commands, status tracking, and analysis hooks

The compute tools available inside the loop span the full fidelity ladder:

Tool Method Cost Use case
predict_properties ML potential ~1s Fast screening across many structures
design_material Substitution + ML ~1s Compositional exploration
simulate_md ML molecular dynamics ~30s Thermal stability, phase transitions
compute_phase_diagram Thermodynamic DB ~2s Stability analysis
predict_first_principles First-principles ~20s High-fidelity single-point energy
relax_first_principles First-principles ~60s Geometry optimization
compute_band_structure First-principles ~120s Electronic structure
compute_dos First-principles ~120s Density of states

The agent navigates this fidelity ladder during hypothesis evaluation: screen with ML predictions, confirm with first-principles calculations, report with explicit error bars. A researcher manually managing this ladder across ten candidate materials would spend most of their time on bookkeeping. The system handles the bookkeeping and lets the researcher focus on the science.

The result is a system that does not just “suggest ideas.” It commits ideas into testable operating artifacts with compute results attached.

Stage 2: Workspace Synthesis, Not Just “Generated Code”

Once a hypothesis is selected, we do not generate one throwaway script.

We generate a full research workspace with:

  • instructions.md (research context, experiment list, success criteria)
  • data/hypothesis.json (structured claim + metadata)
  • data/physical_parameters.json (numeric anchors from the literature)
  • experiment command templates and agent roles
  • analysis/report scaffolding

This is the difference between a demo and an operations surface.

A single script is a sprint. A workspace is a race program.

graph TD
    H[Hypothesis + Metadata] --> I[instructions.md]
    H --> D[data/*.json]
    I --> S[Workspace Skeleton]
    D --> S
    S --> Q[Queued Execution Steps]
    Q --> R[Results + Reports]
    style H fill:#f3f4f6,stroke:#6366f1,color:#111827
    style R fill:#eef2ff,stroke:#6366f1,color:#111827

A Generated Experiment Plan

Here is a trimmed excerpt from an experiment specification produced for the Cu2S phase transition hypothesis:

### Experiment 1: Cu2S Phase Transition Energy Landscape

**Objective:** Determine the energy difference between monoclinic and hexagonal Cu2S phases.

**Approach:** Use ML potential to compute total energies for both Cu2S polymorphs. Calculate relative stability and approximate transition temperature.

**Expected Output:** `results/cu2s_energy_difference.json`

**Validation Targets:** T_Cu2S_transition_K = 377 K +/- 10 K

Four experiments like this are generated from a single hypothesis. Each has explicit inputs, outputs, and validation targets. The workspace includes data files, a compute API client, and agent roles to execute every step autonomously.

Stage 3: Internal Compute Gateway (Security by Topology)

A key requirement is running materials computation from experiment containers without leaking upstream credentials.

This is not a theoretical concern. Experiment containers run agent-generated code. They call ML prediction services, first-principles solvers, and materials databases. Each of those services requires authentication — API keys, database credentials, service tokens. If the experiment container holds those credentials directly, a single code generation error could expose them in logs, error messages, or network traces.

We solve this with an internal compute gateway that enforces security through network topology, not policy:

  • experiment containers call a local gateway endpoint on a private Docker network
  • the gateway forwards requests to upstream compute services, injecting credentials server-side
  • containers never see, store, or transmit upstream secrets
  • if the gateway cannot reach a service, it returns a clean 503 — the container knows something is unavailable, but learns nothing about the upstream topology
graph LR
    C[Experiment Container] --> G[Internal Compute Gateway]
    G --> M[ML Property Service]
    G --> F[First-Principles Service]
    G --> DB[Materials Property Database]
    style C fill:#f3f4f6,stroke:#6366f1,color:#111827
    style DB fill:#eef2ff,stroke:#6366f1,color:#111827

The 503 response is a deliberate design choice. When the gateway has no upstream URL configured for a service, the response is identical to a temporary outage: a generic “service unavailable.” The experiment container cannot distinguish between “this service is not configured” and “this service is temporarily down.” That ambiguity is the security property. An experiment that probes endpoint availability learns nothing about the infrastructure behind the gateway.

For developers, this gives a clean client module inside each workspace — a single base URL, no credential management. For security teams, it constrains the credential blast radius by design: even if a container is fully compromised, the attacker gains access to zero upstream credentials and cannot map the service topology.

For researchers, the benefit is simpler: the compute client inside the workspace just works. They do not need to manage API keys for ML services, first-principles solvers, or the materials database. They point at the gateway, and the gateway handles the rest.

What This Means in Practice

For researchers reading contested claims

You drop a paper set into a workspace and get back ranked hypotheses with explicit evidence chains, not a summary paragraph. Each hypothesis tells you which papers support it, which contradict it, and what specific measurement would falsify it. That is the starting point for a lab meeting, not a literature review.

The system then generates a full experiment plan and executes it. You get results in 23 minutes, not two weeks. If the results support the hypothesis, the system proposes derived hypotheses by exploring related material compositions. If the results falsify it, the next-highest-ranked hypothesis gets its turn.

For university materials groups with limited compute

The fidelity ladder matters here. ML screening at ~1 second per structure lets you evaluate fifty candidate materials before committing a single first-principles run at ~2 minutes each. The design_material tool generates structural variants programmatically, so compositional exploration does not require manual CIF editing.

A group that previously explored three compositions per week can now screen fifty and confirm the top five from first principles in a single afternoon.

For PIs reviewing student work

Every hypothesis carries a JSON-serializable audit trail: which papers contributed, what confidence level was assigned, what would falsify it, and what experiments were run. The analysis report includes explicit validation targets and whether they were met.

This means a PI can review a student’s hypothesis-to-experiment cycle by reading structured outputs, not by re-running calculations or trusting narrative summaries.

For engineering teams and investors

The moat is in orchestration quality: the speed and reliability of moving from ambiguous evidence to executable, auditable experimental decisions.

Progress is not a narrative. It is a data structure. Each iteration of the loop produces:

  • a versioned hypothesis with provenance
  • a workspace with deterministic assembly (same inputs, same workspace)
  • experiment results with explicit success/failure criteria
  • a structured verdict that feeds the next iteration

For investors evaluating a materials discovery company, this is the difference between “we ran some simulations” and “here is the decision tree, here are the branch points, and here is the evidence at each node.” The loop is auditable end-to-end because every intermediate artifact is JSON-serializable and every tool call is logged with inputs and outputs.

The “Withheld Final Paper” Check

One of the most useful validation patterns we use is simple:

  • build hypotheses from an initial paper chain
  • hold out the final resolution paper
  • score generated hypotheses against the held-out resolution

In our LK-99 run, the system scored 5/5 on target keyword coverage against the withheld consensus paper, with a composite evaluation score of 1.0. The hypothesis it ranked highest was the one the community eventually converged on.

That gives you a practical measure of whether your system is learning signal or just echoing noise.

In racing terms, it is not enough for your strategy simulator to look smart during the race. It has to match post-race truth.

Engineering Principles We Optimized For

  1. Structured outputs over narrative outputs — every hypothesis, experiment, and verdict is a typed data structure
  2. Deterministic assembly over ad hoc generation — same hypothesis produces the same workspace, every time
  3. Discovery integrated into hypothesis generationdesign_material and property prediction run inside the reasoning loop, not after it
  4. Secure internal routing over distributed secret handling — credentials never enter experiment containers
  5. Asynchronous execution with explicit status lifecycle — workspaces report provisioning, running, completed, or failed states via SSE
  6. Evaluation loops that can disagree with your prior belief — falsification criteria are first-class citizens, not afterthoughts

Those principles let us support multi-paper reasoning and material discovery without turning the system into a black box.

If you want to see this loop in action, open the app and run a hypothesis-to-workspace cycle on your own paper set.


↑ Back to Table of Contents