My previous employment was at a company in the Digital asset management space, working on making short form commercial videos, towards the end.
One of the USP was, consistent character generation, which was until Nano Banana, and soon other mainstream models caught up. So the switch was to generate short form commercial videos. Once the shop closed, I spent time reflecting on how things went, and how the overlap landscape changes with AI in picture. One of those thoughts gave birth to krearts
During the time I was also:
This blog documents some of the observations, in building a research oriented framework, and my understanding of this space, apart from code.
When people see ChatGPT, Claude, Grok — the assumption is that building an LLM product is just API calls with a nice UI. Upload a PDF, call the model, return the answer. Ship it.
The gap between using these tools and building with them is massive.
Attachment parsing alone is a project. Scientific PDFs have multi-column layouts, inline equations, tables that span pages, figures with captions that reference other figures. Getting clean text out of that isn’t pdf2text. It’s a pipeline — layout detection, table extraction, figure-caption association, section boundary identification. And every format is different. A Nature paper looks nothing like an arXiv preprint.
Then there’s retrieval. Without preprocessing and indexing:
The first prototype worked exactly this way - no tagging of sections, no embeddings, only enriched extractions.
It was slow, and it had to be tweaked continuously to support new types of queries
Corrective RAG, structured extraction, citation grounding — each turned out to be its own subsystem with its own failure modes. The “wrapper” ended up being the smallest part of the system.
Crewai should get more credit, it is like the rails of ruby.
Especially the memory management, extensibility, tooling and integrations. It also has a inbuilt support for varieties of doc format.
For a smaller startup with captial, one could throw GPUs and docling at the problem,
Otherwise choose to build the intermediate stages.
Similar could be said for langchain and langgraph. But I would relate it more to sinatra.
N.B: Watchout for the event management system of crewai, its a app level global situation, but one can use async context to map events to connections
A user types: “What compositions showed the lowest formation energy across these three papers?”
The naive assumption is “send to LLM, get answer.” In practice, there’s a whole pipeline before the model ever sees the question:
Query expansion. The raw question gets rewritten. “Formation energy” might need to expand to “enthalpy of formation,” “DFT-computed stability,” or specific notation like ΔHf. A single user question becomes multiple retrieval queries.
Intent classification. Is this a comparison across papers? A lookup in a single table? A synthesis question that needs reasoning? The retrieval strategy changes depending on the answer.
Hybrid retrieval. Full-text search (FTS), n-gram matching, and semantic embeddings each catch different things. FTS finds exact terms. N-grams handle partial matches and chemical formulas that embedding models mangle. Embeddings capture semantic similarity — “thermal stability” matching “resistance to decomposition.”
Reranking. The initial retrieval casts a wide net. A reranker (cross-encoder or LLM-based) scores each chunk against the original question for fine-grained relevance. This is where you go from “related passages” to “the actual answer is in these three paragraphs.”
Reciprocal Rank Fusion (RRF) merges results from different retrieval methods into a single ranked list. The LLM then reasons over the top results instead of just returning links.
None of this retrieval composition is new. Google has done this for decades. The difference is composing these with an LLM reasoning loop instead of hand-tuned ranking signals.
What this replaces: Traditionally, doing this well meant deploying Elasticsearch or Solr — heavy infrastructure with its own operational cost, query DSLs, analyzers, synonym dictionaries, spell-check configs, and tokenizer tuning.
With an LLM and vector search, a lot of that goes away:
pgvector) plus FTS on Postgres replaces what used to require a dedicated search deployment.The complexity of maintainance cost and migration costs, associated with elasticsearch, and its java baked ecosystem reduces to something much simpler -
GPUs,Database, andAPI calls
Without an embedding and indexing pipeline, every query pays the full cost. Parse the PDF. Chunk it. Embed the chunks. Search. Answer. For a 40-page paper, that’s 30+ seconds before the user sees anything.
The alternative is progressive indexing: make the paper useful immediately and build deeper indexes in the background.
This works in tiers:
A query that arrives at tier 1 gets FTS-only retrieval. Not perfect, but fast and useful. By the time the researcher has read the first answer and typed a follow-up, tier 2 or 3 is ready.
The key insight: researchers don’t upload a paper and immediately ask their hardest question. They start with “what’s this about?” and work their way to specifics. Progressive indexing matches the system’s readiness to the user’s actual behavior.
This is the same pattern behind Cinestar’s five-phase video indexing pipeline — make a video searchable the moment it’s uploaded (phase 0, basic metadata), then progressively refine with multi-modal enrichment, coarse segmentation, fine segmentation, and cross-reference passes.
The domain is different but the architecture is identical: immediate utility, background refinement, each tier unlocking better search quality.
Graph RAG has real value. But the costs are real too.
Building a knowledge graph over one document requires:
For a single document, this is expensive relative to the payoff. A well-chunked document with good metadata gets roughly 80% of the way there.
The better alternative — section_covers.
No matter how unstructured a PDF layout looks, the domain and the humans in it have a structure. Every scientific paper has an implicit hierarchy — title, abstract, hypothesis, methods, results, conclusion. The sections might be named differently, merged together, or split across pages, but the structure is always there. Researchers read papers this way instinctively.
The idea: teach the LLM this structure through the prompt, and have it classify each chunk during ingestion. The classification is an array — ["methods", "results", "datasets"] — not a single label, because sections overflow. A “methods” section often contains datasets and preliminary results too.
How the LLM knows the structure. The agent prompt is hierarchically organized with custom tags — identity, capabilities, workflows, security, output rules — each scoped and nested. Within this, the paper-reading workflows define phased strategies:
The Researcher agent, is given detailed instructions on how a “researcher” reads — which sections to check first for which kind of question, when to fall back to broader reading, how to cross-reference across documents.
At query time, filtering by section type is a simple indexed array lookup. “Show me just the methods across these three papers” — no graph traversal needed. The LLM can then drive a ReAct loop to compare across sections, navigate the hierarchy, and synthesize — all without an entity graph.
Researchers don’t work with one paper. They work with a dozens of papers, experimental notes, simulation results, reviewer feedback. Over weeks and months, connections accumulate:
That’s where a knowledge graph becomes valuable. Not blind LLM extraction — asking a model to “extract all entities” from a paper produces confident garbage.
What works is:
Graph RAG for single-document retrieval is usually overkill. Graph RAG as an epistemic knowledge web built over months of research — that’s where it becomes worth the investment.
Models get switched. Pricing changes, a new model drops with better structured output, an open-source option gets good enough for a subtask. Each time, it means rewriting prompts and fixing output parsing — unless the system is built for it.
The prompt is the cage geometry — it shapes the output. But real portability comes from treating LLM output as untrusted input.
When validation fails, the system retries with a corrective prompt that includes the validation error. Self-correcting loops. In practice, most queries resolve in one pass. Some need two. When it takes three, the problem is almost always in the prompt design or the retrieval, not the model.
This leads to a useful model-agnostic metric: not “which model is best” but “how many correction cycles does this model need for this task.” GPT-4 might need one cycle where Claude needs two, or vice versa, depending on the task class. The system handles both.
One observation: most open-source models share failure modes — they’re fine-tuned from the same bases. A correction loop that handles Llama’s JSON formatting quirks tends to handle Mistral’s too. Build the cage right and the wind can change direction.
I wished I could write very passionately about these things, but I don’t really find much challange or learnings in terms of high level abstractions.
Some of the instruction following of LLM has gotten better over time, so the ReACT is a good place to start at.
So the only two things that remain:
The interesting parts are in the:
Rest of everything even RLMs sounds like a fancy FAD to me.
Using python env to run code, to get information from context. And if it gets the code wrong does it retry? how many tokens per roundtrip? how do you run it securely? how about the llm lieeeeessssss in the code?.
ReACT should be evolved into, ParallelReACT, Goal Planning, and other mechanisms that are already well established in gaming systems. Instead of Why don’t we replace a simple predictible agent and retrival system into a token vaccum and a security knightmare, rather than have a proper data structure and ways to query them!! SQL is for insane people
Foundational models are trained on a lot of data. The assumption is that because of how embedding spaces work, the LLM would be the one that sees all patterns — but more powerful.
Maybe it does. But what it does and doesn’t depends on how it was trained. Combining different domains means mapping them to the same embedding space. General LLMs are not an answer to this.
In video generation, the underlying models have to be tweaked enough to guard how the LLM generates media. The nuances are real:
| Level | What it determines | Example |
|---|---|---|
| Macro | Physics, environment, scene composition | Bags are not displayed the same way as watches |
| Micro | Audience, messaging, tone | A wrist watch is not advertised the same way as a wall clock. A mechanical watch is not advertised the same way as a quartz. |
| Cultural | Aesthetics, conventions, expectations | A Japanese website looks nothing like a US-built website. Japanese fashion magazines are structured for data extraction — what to wear, how to pair it, what it’s for. |
LLMs tend to take the path of least work. Hence the need for detailed planning, guardrails, and extensive instruction following. If these foundational models had all this world knowledge baked in, why would anyone need custom embedding models?
This sounds more like a Mixture of Experts, but constrained to a domain.
When working with non-textual data, it’s fairly impossible for an LLM to predict things in real life. Back in very late 2025, none of the LLM models were consistently good at:
The failure has multiple layers, and it’s well-documented:
The precise term from the literature: statistical co-occurrence without compositional constraint satisfaction. The models know what things look like together, but can’t enforce constraints between them.
LLMs are probabilistic systems.
If you expect a probabilistic system to reliably (do X), it will probably, reliably (do X)
GPT has a lot of information about how good hot chocolate is made. It can tell you how different countries and regions like theirs, if you ask for it. It has access to all kinds of recipes, ratings, discussions — enough for a fair idea.
But an LLM has never seen or tasted hot chocolate. It relies on what I call “collective truth”. So it can’t accurately predict how changes in quantities lead to different taste.
This is the same with materials research — CHGNet and other tools provide contextual models with computation baked in, rather than an LLM trying to predict outcomes.
When I asked the LLM to adjust quantities for 4 people, the amount of water and sugar was way off from reality. What actually worked:
For automation, this means: as long as the LLM has access to eyes, ears and other senses into the real world, foundational models can actively guide towards real-life usable outcomes.
The same principles apply to sequential or batched image and video generation — with a corrective feedback loop. But costs shoot up.
With enough effort — YOLO for vision, RPi Zero or ESP32 to capture images in batches, actual instruments and sensors providing continuous feedback — most real-life applications of AI will come when we add sensory elements to it. The LLM continuously validates against the desired outcome at each stage. The easier version would be just to let an app do the compute
This is still a very interesting field, which is realted to one of my projects, adaptui. I think, one day, UI would be built on fly.
The cost is an inherent problem - not just token cost, but time taken. It would obviously take less time to just generate data
for a static UI, than letting an llm decide an UI. Like any other system design, we can divide the UI into two parts, static and dynamic, as well.
Many of the recent day llms also understand react components (from reactjs), tailwind, icons etc. So it would be intereseting to see
where that goes. Maybe most apps will be replaced by a SuperApp, with some UI protocol allowing different brands to have consistency and control over some parts of the app.
A good middleground right now is somewhere in between:
response schema in this case can be used to populate the data in the corresponding react component.The second one, requires some upfront thinking, like a react component builder, on what parts to allow to change and what is the
expected data structure. Again, the problem here is the same:
More things we leave it to the LLM, more the latency and chances of error increases, and hence more error recovery mechanism.
On a bad day, you will really have a bad day.
Not being a materials science expert meant there was no way to eyeball whether the system was producing good hypotheses. So the build happened in two directions simultaneously.
Same approach as building a compiler — lexer, parser, codegen, each independently testable:
| Subsystem | What it does | Testable in isolation? |
|---|---|---|
| Document pipeline | Parse, chunk, index scientific PDFs | Yes — output is structured text, verifiable |
| Retrieval | Hybrid search across indexed papers | Yes — relevance scoring against known queries |
| Domain tools | CHGNet, GPAW, materials databases | Yes — known inputs, known outputs |
| Agents | Orchestrate tools, maintain context | Yes — given fixed retrieval, does the plan make sense |
| Experiment runner | Execute computational experiments | Yes — scripts produce reproducible results |
Each piece could be validated without domain expertise. The document pipeline either extracts tables correctly or it doesn’t. CHGNet either returns a valid energy prediction or it doesn’t.
The harder question: does the whole system, end-to-end, produce hypotheses that are actually good?
The approach:
The conversational pattern matters. A direct request — “what caused the LK-99 results?” — would test whether the model memorized the answer. A conversational exploration tests whether the system architecture can guide reasoning through literature, contradictions, and evidence toward a defensible hypothesis.
The hypothesis engine was essentially reverse-engineered from this evaluation. The question “how do you know if it’s any good?” shaped every architectural decision — what agents exist, how they communicate, what tools they call, how hypotheses get ranked.
The obvious first move is a web app. Upload papers, ask questions, get answers. Fastest to ship, easiest to demo. For a lot of teams, it’s the right choice. But once the architecture is model-agnostic, it doesn’t actually need to be centralized.
The workspace, the agents, the retrieval layer — all of it can run on a researcher’s machine or a lab’s own infrastructure. That opens up delivery options worth considering.
| Web / SaaS | Editor (JetBrains / Cursor-style) | VSCode / Codium plugin | |
|---|---|---|---|
| Where the workspace lives | Cloud | Researcher’s machine or lab infra | Researcher’s machine |
| Model inference | Hosted endpoints | Local, lab cluster, or hosted — their choice | Same |
| Data residency | Provider-managed | User-managed | User-managed |
| Git-backed tracking | Possible but uncommon | Natural fit — queries, experiments, hypotheses all versioned | Same |
| Lab equipment access | Via API tunnels | Direct — instruments register as tools | Direct |
| Distribution | URL | Installer / package manager | Marketplace (shared across Codium-based editors) |
| Trade-off | Fastest to ship. Data residency is a conversation with every enterprise customer. | Most integrated, most opinionated. Bigger upfront investment. | Lowest barrier to adoption. Less integrated long-term. |
No single right answer:
The interesting observation is that a model-agnostic design (the cage and the wind) is what makes these options possible at all.
Git-based tracking. When the workspace is local, versioning research artifacts in git becomes natural. Every query, experiment config, hypothesis — versioned. Branch a research direction. Diff two experimental setups. Revert a dead end. Researchers already think in version control for code; extending it to research is a small step.
The device-driver pattern. With local or lab-hosted infrastructure, physical equipment can connect directly. The framework defines the tool interface; labs implement the connector for their instruments — a synthesis furnace, an XRD machine, a spectrophotometer. Same interface as any other tool in the registry. This is harder to pull off through a cloud intermediary, though not impossible.
Once inference is swappable — hosted, self-hosted, open-source, Model Garden, whatever — the delivery question becomes about where the data lives, not where the model lives.
In terms of business models, at the time of writing this, I see three prevalent ones.
The 1st one however, would probably not survive, if it’s not a niche the big players don’t want to focus on, and public data is unavailable. Without a proper moat, the fear of being invalidated or competitor proliferation is much higher.
The DocuSign incident is a glaring example — OpenAI launched DocuGPT, and DocuSign’s stock dropped 17% overnight. Open-source alternatives like OpenSign and Documenso were already circling.
When everyone has a gun, you need a bigger gun (leverage).
Overall, as I understand, llm’s use on textual data, is pretty limited use of this technology. Yes one can convert a natural language to an sql query. The system you might have built, using agents and tools, to produce a valid grounded query, might as well be one shotable with a long enough context window.
So there isn’t much inherent value in building such systems. Looking back, I realize, for sure, if an MVP or bootstrapped product, is headon with existing features available on the available SaaS platform, and is presented to the users, the in-evitable, un-avoidable and un-answereable question, how is that different from GPT, will be presented.
In all honesty, its probably not, because building a production grade ChatGPT like interface, with all the functionalities and edge cases take time.
It would be similar to asking Build me Twitter. Sure, but why?
The users do not and should not care about such engineering challenges.
Chat becomes another interface to communicate with the system, much like REST. And hence the comparison against GPT or other models never come along.
Overall I dont feel like much has affected in terms of how one does business. When the internet came along, there were loads of websites just built with no actual substance, and as time went by, the web just became a medium or enabler for actual labour.
I think the same will happen for AI, most companies, whose identities are based around LLM will not survive in the end. Anything and most things, that was generated or is one shotable by an LLM, will eventually get replicated and saturated.
So one way I guess could be to build wrappers very fast, on multiple domains. I mean, there are 100s and 1000s of n8n workflows, skills, subagents, entire repositories of “leaked” prompts for agentic coding tools.
The only leverage these tools have is funding. If funding, networking and marketing were readily available, everyone could sell their own spinoff of opencode.
Which again is business as usual. There are a lot of twitter clones, and reddit alternatives, but the rate of success doesn’t depend on just code execution.
Code execution at base level has always been cheap. Especially in India, where culture is mostly managerial driven. CxOs all over the world, earn disproportionately more, not because they can write code.
And it’s getting cheaper. Cross-provider deployment can be automated pretty easily now — Terraform, Pulumi, SST, whatever your flavor. Code generation is commoditized. Porting concepts from one language’s ecosystem to another is a weekend project with a coding agent. Elixir’s supervision trees in Go, Python’s Ray actors in Rust, Ruby’s convention-over-configuration patterns anywhere — the implementation barrier between “I know this pattern exists” and “it’s running in my stack” has collapsed. Which means the moat is never in the code.
I don’t think this is new practice now, but still to acknowledge, AI has caused some cultural shifts in org. It also has somehow set the wrong expectations from a lot of people. In a way, I feel the bigger question is missed.
It begins with educating everyone alike, juniors, seniors, management, stakeholders.
This is both in terms of work volume and work culture. Both of them can be solved to a good extent using code and conventions.
Maybe in the near future, instead of full-stack engineers, we have actual product engineers, who can understand product and the build it as well.
Until then, platforms in the org needs to bridge this gap. In such a style of development, a developer would need to be able to identify such parts of a system in the architecture planning stage. Building abstractions, configurations, and the right interfaces and pipelines which allows the developers to be able to handoff some or most of the work to the specific dept.
A backend dev, building a service layer, doesn’t need to what prompts are being used, what ml models are running. He/She would instead build a system, where:
If you have a look at claude-skills-repo, I can alwaya
go and create a few sub-agents, prompt so financial-analysis to life, looking at geographical data, charts etc.,
and one could write a few scripts, which can be directly used to run backtesting and quant-analysis.
There is this other thing - if something can be built with ai, it can be replicated with ai.
So the question still is, if its possible to let things happen automatically:
I look back at my experience with building cinestar, and I had no choice but to keep vibe coding, like a druggie. What started off as maybe at max a week like hackathon, turned into a 2-3 month tussle with the LLM:
In hindsight, if I had done an assist mode, instead of vibe mode, it would be been a much better experience.
So, even if you know what needs to be done, seeing the entire codebase playout from inception to delivery and refactor, cycles still need human suffering behind it
You are/should be objectively faster at writing a binary search without mistakes, than an LLM, and can do it multiple times without losing money.
And I am not entirely sure, the people higher up in the org is necessarily using AI to write and deliver code, as soon as he gets back from a client meeting. Nor do I think they should do it at a bigger org, where more complex things are at play.
I am also not entirely sure, why would a doctor be trusted with a vibe coded application, to use for anything substantial in his work.
As we move more closer to the real world, we see that apart from LLMs there are more focussed models and maths which do-not hallucinate. Here, probably, LLM becomes a driver . Eventually the outcomes predicted has to be tested in the real world,
which was quite evident from our hot chochlate stint.
I could not only argue that, Cinestar can be replicated, and be replicated faster, but also say, this whole project can be a collection of skills and sub-agents. For the terminal-ly challanged, we could expose a website or an app or a plugin.
The replication problem has exsisted, but the time to replicate has come down by a lot (arguably). Lets assume that it improves over time. So what’s left?
What actually holds value:
The UX is also quite scattered:
Overall, even if some systems are built, maintaining them would require engineers, so the task or way of execution in the lower layer changes. As long as there is a core product/problem which requires more work than just prompting it out, finer integrations, are the ones that do end up surviving and keeping the influx of engineers.
** I wonder is it always going to be biased towards people who have better networking, capital and influence?**
I guess, Yes. But that was always true. The industry genuienly hates software engineers. There is quite some science to this, but it rarely reflects in the way the IT system functions.
If I consider, that people always saw Software engineering as a temporary anomaly, a means to an end — a 30-year window where a person with a laptop and skill could create outsized value without capital or connections, and
What’s replacing it is closer to how every other industry works:
A person building AI-driven quant tools in their bedroom can absolutely beat a bank’s trading desk on idea generation. But the bank has the capital to act on those ideas at scale, the regulatory access to trade, and the data feeds and a lot more compute and brains and speed(lol, so much more speed its diabolical)
So the bedroom builder either joins the bank, sells to the bank, or finds a niche the bank doesn’t care about.
Not to mention, this shit is not for consumers, at all. All windows users, using powershell to run claude code, yeah sure. So whatever consumer adoption comes, that is going through the business/products they consume.
So computer science as a discipline will still be valid, the stuff in between might eventually become production-line, like amazon factory workers, where someone just monitors metrics, and the llm crashes systems, causes a crowrdstrike like incident and then blames it on the CTO’s family.
Some non-negotiables:
C (changability) at the end of CAP.Systems now not only need to worry about Read and Write patterns, but also which parts of the whole system can be replaced. This is Strategy Pattern applied at a more org and infra level
Once you think, you have a problem, you should explore with existing system. A path could look like so:
chatgpt.com, codex, claude.Simulation, let the llm figure out test and train and target datasets.section_covers.Remember, LLMs and neural nets end of the day are high on patterns, and how humans have interracted with the system. This realisation should help you organise the data access patterns and tool prompts for them
This post is the map. The territory is in the implementation.