Learnings from Building LLM Systems

My previous employment was at a company in the Digital asset management space, working on making short form commercial videos, towards the end.

One of the USP was, consistent character generation, which was until Nano Banana, and soon other mainstream models caught up. So the switch was to generate short form commercial videos. Once the shop closed, I spent time reflecting on how things went, and how the overlap landscape changes with AI in picture. One of those thoughts gave birth to krearts

During the time I was also:

Helping bootstrap an LLM-powered research platform for materials scientists. The kind where researchers upload papers, ask questions, and get structured answers with citations. Not a chatbot. A system.
Was trying to make hot-chocolate, because 90% of hot chocolate served in Bangalore is either a middle-class bourvita drink with extra hot water, and the rest 10% adds a bunch of non-sense to hide their pathetic hot-chocolate.

This blog documents some of the observations, in building a research oriented framework, and my understanding of this space, apart from code.

The GPT wrapper fallacy

When people see ChatGPT, Claude, Grok — the assumption is that building an LLM product is just API calls with a nice UI. Upload a PDF, call the model, return the answer. Ship it.

The gap between using these tools and building with them is massive.

Attachment parsing alone is a project. Scientific PDFs have multi-column layouts, inline equations, tables that span pages, figures with captions that reference other figures. Getting clean text out of that isn’t pdf2text. It’s a pipeline — layout detection, table extraction, figure-caption association, section boundary identification. And every format is different. A Nature paper looks nothing like an arXiv preprint.

Then there’s retrieval. Without preprocessing and indexing:

Every question re-parses everything from scratch.
Every query pays the full cost of understanding the document.

The first prototype worked exactly this way - no tagging of sections, no embeddings, only enriched extractions.

It was slow, and it had to be tweaked continuously to support new types of queries

Corrective RAG, structured extraction, citation grounding — each turned out to be its own subsystem with its own failure modes. The “wrapper” ended up being the smallest part of the system.

Crewai should get more credit, it is like the rails of ruby. Especially the memory management, extensibility, tooling and integrations. It also has a inbuilt support for varieties of doc format. For a smaller startup with captial, one could throw GPUs and docling at the problem, Otherwise choose to build the intermediate stages.

Similar could be said for langchain and langgraph. But I would relate it more to sinatra.

N.B: Watchout for the event management system of crewai, its a app level global situation, but one can use async context to map events to connections

What actually happens to a query

A user types: “What compositions showed the lowest formation energy across these three papers?”

The naive assumption is “send to LLM, get answer.” In practice, there’s a whole pipeline before the model ever sees the question:

Query expansion. The raw question gets rewritten. “Formation energy” might need to expand to “enthalpy of formation,” “DFT-computed stability,” or specific notation like ΔHf. A single user question becomes multiple retrieval queries.

Intent classification. Is this a comparison across papers? A lookup in a single table? A synthesis question that needs reasoning? The retrieval strategy changes depending on the answer.

Hybrid retrieval. Full-text search (FTS), n-gram matching, and semantic embeddings each catch different things. FTS finds exact terms. N-grams handle partial matches and chemical formulas that embedding models mangle. Embeddings capture semantic similarity — “thermal stability” matching “resistance to decomposition.”

Reranking. The initial retrieval casts a wide net. A reranker (cross-encoder or LLM-based) scores each chunk against the original question for fine-grained relevance. This is where you go from “related passages” to “the actual answer is in these three paragraphs.”

Reciprocal Rank Fusion (RRF) merges results from different retrieval methods into a single ranked list. The LLM then reasons over the top results instead of just returning links.

None of this retrieval composition is new. Google has done this for decades. The difference is composing these with an LLM reasoning loop instead of hand-tuned ranking signals.

What this replaces: Traditionally, doing this well meant deploying Elasticsearch or Solr — heavy infrastructure with its own operational cost, query DSLs, analyzers, synonym dictionaries, spell-check configs, and tokenizer tuning.

With an LLM and vector search, a lot of that goes away:

Spell correction, synonyms, query reformulation — handled natively by the LLM. “therml stability” still matches “thermal stability” because the embedding is close enough, and the LLM rewrites the query anyway.
The search cluster itself — a vector database (or even just pgvector) plus FTS on Postgres replaces what used to require a dedicated search deployment.

The complexity of maintainance cost and migration costs, associated with elasticsearch, and its java baked ecosystem reduces to something much simpler - GPUs, Database, and API calls

The preprocessing tax

Without an embedding and indexing pipeline, every query pays the full cost. Parse the PDF. Chunk it. Embed the chunks. Search. Answer. For a 40-page paper, that’s 30+ seconds before the user sees anything.

The alternative is progressive indexing: make the paper useful immediately and build deeper indexes in the background.

This works in tiers:

Instant — Raw text extraction + section splitting. The paper is searchable within seconds.
Fast — Table detection, figure extraction, metadata parsing. Available within a minute.
Deep — Full embedding generation, entity extraction, cross-reference resolution. Runs in the background.

A query that arrives at tier 1 gets FTS-only retrieval. Not perfect, but fast and useful. By the time the researcher has read the first answer and typed a follow-up, tier 2 or 3 is ready.

The key insight: researchers don’t upload a paper and immediately ask their hardest question. They start with “what’s this about?” and work their way to specifics. Progressive indexing matches the system’s readiness to the user’s actual behavior.

This is the same pattern behind Cinestar’s five-phase video indexing pipeline — make a video searchable the moment it’s uploaded (phase 0, basic metadata), then progressively refine with multi-modal enrichment, coarse segmentation, fine segmentation, and cross-reference passes.

The domain is different but the architecture is identical: immediate utility, background refinement, each tier unlocking better search quality.

Graph RAG — where it fits

Graph RAG has real value. But the costs are real too.

Single document: usually not worth it

Building a knowledge graph over one document requires:

Full entity extraction — identifying entities, properties, conditions, methods
Relationship extraction between them
Co-reference resolution
Building a traversable graph

For a single document, this is expensive relative to the payoff. A well-chunked document with good metadata gets roughly 80% of the way there.

The better alternative — section_covers.

No matter how unstructured a PDF layout looks, the domain and the humans in it have a structure. Every scientific paper has an implicit hierarchy — title, abstract, hypothesis, methods, results, conclusion. The sections might be named differently, merged together, or split across pages, but the structure is always there. Researchers read papers this way instinctively.

The idea: teach the LLM this structure through the prompt, and have it classify each chunk during ingestion. The classification is an array — ["methods", "results", "datasets"] — not a single label, because sections overflow. A “methods” section often contains datasets and preliminary results too.

How the LLM knows the structure. The agent prompt is hierarchically organized with custom tags — identity, capabilities, workflows, security, output rules — each scoped and nested. Within this, the paper-reading workflows define phased strategies:

Summarization: extract six elements from any paper — hypothesis, claim, evidence, assumption, experiment, result — mapped to section types
Comparison: compare across papers element-by-element, using section types as the navigation axis
Hypothesis generation: scan → deep-read → synthesize, with section-aware retrieval at each phase

The Researcher agent, is given detailed instructions on how a “researcher” reads — which sections to check first for which kind of question, when to fall back to broader reading, how to cross-reference across documents.

At query time, filtering by section type is a simple indexed array lookup. “Show me just the methods across these three papers” — no graph traversal needed. The LLM can then drive a ReAct loop to compare across sections, navigate the hierarchy, and synthesize — all without an entity graph.

Across a workspace over time: where it shines

Researchers don’t work with one paper. They work with a dozens of papers, experimental notes, simulation results, reviewer feedback. Over weeks and months, connections accumulate:

This paper’s synthesis conditions produced the same phase as that paper’s computational prediction
This reviewer’s objection was addressed by that experiment
This failed hypothesis ruled out a class of compositions

That’s where a knowledge graph becomes valuable. Not blind LLM extraction — asking a model to “extract all entities” from a paper produces confident garbage.

What works is:

Curated, validated connections built incrementally
Human-in-the-loop validation for each new paper and result
The graph grows as the research grows

Graph RAG for single-document retrieval is usually overkill. Graph RAG as an epistemic knowledge web built over months of research — that’s where it becomes worth the investment.

The cage and the wind

Models get switched. Pricing changes, a new model drops with better structured output, an open-source option gets good enough for a subtask. Each time, it means rewriting prompts and fixing output parsing — unless the system is built for it.

The prompt is the cage geometry — it shapes the output. But real portability comes from treating LLM output as untrusted input.

Expected a JSON object with specific fields? Parse it, validate the schema, check the types.
Expected citations? Verify they reference real passages in the retrieved chunks.
Expected a numerical comparison? Check that the numbers actually appear in the source material.

When validation fails, the system retries with a corrective prompt that includes the validation error. Self-correcting loops. In practice, most queries resolve in one pass. Some need two. When it takes three, the problem is almost always in the prompt design or the retrieval, not the model.

This leads to a useful model-agnostic metric: not “which model is best” but “how many correction cycles does this model need for this task.” GPT-4 might need one cycle where Claude needs two, or vice versa, depending on the task class. The system handles both.

One observation: most open-source models share failure modes — they’re fine-tuned from the same bases. A correction loop that handles Llama’s JSON formatting quirks tends to handle Mistral’s too. Build the cage right and the wind can change direction.

The Context Window Problem

I wished I could write very passionately about these things, but I don’t really find much challange or learnings in terms of high level abstractions.

Some of the instruction following of LLM has gotten better over time, so the ReACT is a good place to start at. So the only two things that remain:

Handle Compression of Context
Use memory agent with a storage backed, and expose the tools of the agent.

The interesting parts are in the:

details of how to better store or structure the data for storage.
combining sparse and dense embeddings
how year or month long conversations can be retrieved.

Rest of everything even RLMs sounds like a fancy FAD to me.

Using python env to run code, to get information from context. And if it gets the code wrong does it retry? how many tokens per roundtrip? how do you run it securely? how about the llm lieeeeessssss in the code?.

ReACT should be evolved into, ParallelReACT, Goal Planning, and other mechanisms that are already well established in gaming systems. Instead of Why don’t we replace a simple predictible agent and retrival system into a token vaccum and a security knightmare, rather than have a proper data structure and ways to query them!! SQL is for insane people

Video Generation, Spatial Awareness and Hot Chocolate

Foundational models are trained on a lot of data. The assumption is that because of how embedding spaces work, the LLM would be the one that sees all patterns — but more powerful.

Maybe it does. But what it does and doesn’t depends on how it was trained. Combining different domains means mapping them to the same embedding space. General LLMs are not an answer to this.

Domain knowledge has layers

In video generation, the underlying models have to be tweaked enough to guard how the LLM generates media. The nuances are real:

Level	What it determines	Example
Macro	Physics, environment, scene composition	Bags are not displayed the same way as watches
Micro	Audience, messaging, tone	A wrist watch is not advertised the same way as a wall clock. A mechanical watch is not advertised the same way as a quartz.
Cultural	Aesthetics, conventions, expectations	A Japanese website looks nothing like a US-built website. Japanese fashion magazines are structured for data extraction — what to wear, how to pair it, what it’s for.

LLMs tend to take the path of least work. Hence the need for detailed planning, guardrails, and extensive instruction following. If these foundational models had all this world knowledge baked in, why would anyone need custom embedding models?

This sounds more like a Mixture of Experts, but constrained to a domain.

LLMs lack spatial awareness

When working with non-textual data, it’s fairly impossible for an LLM to predict things in real life. Back in very late 2025, none of the LLM models were consistently good at:

Dynamic camera angles mid-scene
Heavy action sequences — even something simple as parkour would fall apart
Predicting anything that needs visual cues or real-world feedback loops

The failure has multiple layers, and it’s well-documented:

Text encoders lose spatial information before generation even starts. CLIP-based encoders (used in most diffusion models) establish representations in early layers and don’t compose spatially. T5-based encoders (Imagen, DeepFloyd) do ~10% better because they process sequentially — but still fall short. (Unlocking Spatial Comprehension in T2I Diffusion Models)
Training data rarely contains explicit spatial language. Captions say “a dog in a park” not “a dog positioned 2 meters left of a bench.” The models have minimal exposure to spatial relations like “inside”, “below”, “smaller than.” (Improving Explicit Spatial Relationships in T2I Generation)
No internal 3D or physics model. The generator learns statistical co-occurrence — “pocket” appears with “phone”, “boombox” appears with “music player” — but never learns that a pocket has a fixed volume, or that an object must be smaller than its container.
Ask an image model for “a big music player bulging in someone’s pocket” — you’ll get a person holding a boombox. It retrieved the strongest visual pattern for “big music player”, never computed whether it fits inside the pocket.
T2I-CompBench (NeurIPS 2023) confirms spatial relationships are the weakest category across all tested models. OpenAI’s own Sora technical report acknowledges failures in physics, causality, and left/right differentiation.

The precise term from the literature: statistical co-occurrence without compositional constraint satisfaction. The models know what things look like together, but can’t enforce constraints between them.

LLMs are probabilistic systems. If you expect a probabilistic system to reliably (do X), it will probably, reliably (do X)

The hot chocolate test

GPT has a lot of information about how good hot chocolate is made. It can tell you how different countries and regions like theirs, if you ask for it. It has access to all kinds of recipes, ratings, discussions — enough for a fair idea.

But an LLM has never seen or tasted hot chocolate. It relies on what I call “collective truth”. So it can’t accurately predict how changes in quantities lead to different taste.

This is the same with materials research — CHGNet and other tools provide contextual models with computation baked in, rather than an LLM trying to predict outcomes.

When I asked the LLM to adjust quantities for 4 people, the amount of water and sugar was way off from reality. What actually worked:

The LLM knew the desired result of each step — “a creamy paste, not pudding-thick, but not watery”
The LLM knew what good taste means (not going into the details of one-shot or ReAct)
But it needed real-world feedback to get the quantities right

The sensory gap

For automation, this means: as long as the LLM has access to eyes, ears and other senses into the real world, foundational models can actively guide towards real-life usable outcomes.

The same principles apply to sequential or batched image and video generation — with a corrective feedback loop. But costs shoot up.

With enough effort — YOLO for vision, RPi Zero or ESP32 to capture images in batches, actual instruments and sensors providing continuous feedback — most real-life applications of AI will come when we add sensory elements to it. The LLM continuously validates against the desired outcome at each stage. The easier version would be just to let an app do the compute

UI in the era of LLMs

This is still a very interesting field, which is realted to one of my projects, adaptui. I think, one day, UI would be built on fly. The cost is an inherent problem - not just token cost, but time taken. It would obviously take less time to just generate data for a static UI, than letting an llm decide an UI. Like any other system design, we can divide the UI into two parts, static and dynamic, as well.

Many of the recent day llms also understand react components (from reactjs), tailwind, icons etc. So it would be intereseting to see where that goes. Maybe most apps will be replaced by a SuperApp, with some UI protocol allowing different brands to have consistency and control over some parts of the app.

A good middleground right now is somewhere in between:

Either the LLM can generate the data and the html component, which has to be rendered safetly on FE
Or, the LLM is given the data and the expected response structure, that is associated with a response schema. The response schema in this case can be used to populate the data in the corresponding react component.

The second one, requires some upfront thinking, like a react component builder, on what parts to allow to change and what is the expected data structure. Again, the problem here is the same:

More things we leave it to the LLM, more the latency and chances of error increases, and hence more error recovery mechanism.

On a bad day, you will really have a bad day.

Testing from both directions

Not being a materials science expert meant there was no way to eyeball whether the system was producing good hypotheses. So the build happened in two directions simultaneously.

Direction 1: Build the subsystems

Same approach as building a compiler — lexer, parser, codegen, each independently testable:

Subsystem	What it does	Testable in isolation?
Document pipeline	Parse, chunk, index scientific PDFs	Yes — output is structured text, verifiable
Retrieval	Hybrid search across indexed papers	Yes — relevance scoring against known queries
Domain tools	CHGNet, GPAW, materials databases	Yes — known inputs, known outputs
Agents	Orchestrate tools, maintain context	Yes — given fixed retrieval, does the plan make sense
Experiment runner	Execute computational experiments	Yes — scripts produce reproducible results

Each piece could be validated without domain expertise. The document pipeline either extracts tables correctly or it doesn’t. CHGNet either returns a valid energy prediction or it doesn’t.

Direction 2: Build the evaluation

The harder question: does the whole system, end-to-end, produce hypotheses that are actually good?

The approach:

Pick a resolved scientific controversy with a known outcome (LK-99 superconductivity)
Build the paper chain without the final resolution paper
Feed the incomplete chain to the system in a conversational pattern — not “tell me the answer”, but the way a researcher would actually explore: “what are the competing claims?”, “what experiments would resolve this?”, “which hypothesis has the strongest evidence?”
Score the generated hypotheses against the withheld paper

The conversational pattern matters. A direct request — “what caused the LK-99 results?” — would test whether the model memorized the answer. A conversational exploration tests whether the system architecture can guide reasoning through literature, contradictions, and evidence toward a defensible hypothesis.

The hypothesis engine was essentially reverse-engineered from this evaluation. The question “how do you know if it’s any good?” shaped every architectural decision — what agents exist, how they communicate, what tools they call, how hypotheses get ranked.

Shipping LLM Powered Apps

The obvious first move is a web app. Upload papers, ask questions, get answers. Fastest to ship, easiest to demo. For a lot of teams, it’s the right choice. But once the architecture is model-agnostic, it doesn’t actually need to be centralized.

The workspace, the agents, the retrieval layer — all of it can run on a researcher’s machine or a lab’s own infrastructure. That opens up delivery options worth considering.

The options as I see them

	Web / SaaS	Editor (JetBrains / Cursor-style)	VSCode / Codium plugin
Where the workspace lives	Cloud	Researcher’s machine or lab infra	Researcher’s machine
Model inference	Hosted endpoints	Local, lab cluster, or hosted — their choice	Same
Data residency	Provider-managed	User-managed	User-managed
Git-backed tracking	Possible but uncommon	Natural fit — queries, experiments, hypotheses all versioned	Same
Lab equipment access	Via API tunnels	Direct — instruments register as tools	Direct
Distribution	URL	Installer / package manager	Marketplace (shared across Codium-based editors)
Trade-off	Fastest to ship. Data residency is a conversation with every enterprise customer.	Most integrated, most opinionated. Bigger upfront investment.	Lowest barrier to adoption. Less integrated long-term.

No single right answer:

SaaS is simpler to operate
An editor gives researchers a workspace that feels like theirs
A plugin meets them where they already are
Different labs, different constraints, different choices

The interesting observation is that a model-agnostic design (the cage and the wind) is what makes these options possible at all.

What local delivery unlocks

Git-based tracking. When the workspace is local, versioning research artifacts in git becomes natural. Every query, experiment config, hypothesis — versioned. Branch a research direction. Diff two experimental setups. Revert a dead end. Researchers already think in version control for code; extending it to research is a small step.

The device-driver pattern. With local or lab-hosted infrastructure, physical equipment can connect directly. The framework defines the tool interface; labs implement the connector for their instruments — a synthesis furnace, an XRD machine, a spectrophotometer. Same interface as any other tool in the registry. This is harder to pull off through a cloud intermediary, though not impossible.

Once inference is swappable — hosted, self-hosted, open-source, Model Garden, whatever — the delivery question becomes about where the data lives, not where the model lives.

In terms of business models, at the time of writing this, I see three prevalent ones.

That bets on the foundational model being better, using simple adapters, all forms of prompt and context engineering, to build a product.
The ones that are building their own foundational models, embedding models, ocr and other llm-ification of existing tools.
They have a or a bunch of products, and LLM, VLM, Vector Embeddings are a driver of the product. These could be services industry as well, providing custom implementations on top of different models.

The 1st one however, would probably not survive, if it’s not a niche the big players don’t want to focus on, and public data is unavailable. Without a proper moat, the fear of being invalidated or competitor proliferation is much higher.

The DocuSign incident is a glaring example — OpenAI launched DocuGPT, and DocuSign’s stock dropped 17% overnight. Open-source alternatives like OpenSign and Documenso were already circling.

When everyone has a gun, you need a bigger gun (leverage).

Thoughts

Overall, as I understand, llm’s use on textual data, is pretty limited use of this technology. Yes one can convert a natural language to an sql query. The system you might have built, using agents and tools, to produce a valid grounded query, might as well be one shotable with a long enough context window.

So there isn’t much inherent value in building such systems. Looking back, I realize, for sure, if an MVP or bootstrapped product, is headon with existing features available on the available SaaS platform, and is presented to the users, the in-evitable, un-avoidable and un-answereable question, how is that different from GPT, will be presented.

In all honesty, its probably not, because building a production grade ChatGPT like interface, with all the functionalities and edge cases take time. It would be similar to asking Build me Twitter. Sure, but why?

The users do not and should not care about such engineering challenges.

Businesses should identify an actual gap or pain-point, which is much narrow, and yet generic enough for 10 people.
Evaluate where in a LLM and its Senses are needed, if at all.
Build on the narrow domain, and choose the pain points early on. Especially for bootstrap teams or startups, choosing the right battle is necessary

Chat becomes another interface to communicate with the system, much like REST. And hence the comparison against GPT or other models never come along.

Overall I dont feel like much has affected in terms of how one does business. When the internet came along, there were loads of websites just built with no actual substance, and as time went by, the web just became a medium or enabler for actual labour.

I think the same will happen for AI, most companies, whose identities are based around LLM will not survive in the end. Anything and most things, that was generated or is one shotable by an LLM, will eventually get replicated and saturated.

So one way I guess could be to build wrappers very fast, on multiple domains. I mean, there are 100s and 1000s of n8n workflows, skills, subagents, entire repositories of “leaked” prompts for agentic coding tools.

The only leverage these tools have is funding. If funding, networking and marketing were readily available, everyone could sell their own spinoff of opencode. Which again is business as usual. There are a lot of twitter clones, and reddit alternatives, but the rate of success doesn’t depend on just code execution.

Code execution at base level has always been cheap. Especially in India, where culture is mostly managerial driven. CxOs all over the world, earn disproportionately more, not because they can write code.

And it’s getting cheaper. Cross-provider deployment can be automated pretty easily now — Terraform, Pulumi, SST, whatever your flavor. Code generation is commoditized. Porting concepts from one language’s ecosystem to another is a weekend project with a coding agent. Elixir’s supervision trees in Go, Python’s Ray actors in Rust, Ruby’s convention-over-configuration patterns anywhere — the implementation barrier between “I know this pattern exists” and “it’s running in my stack” has collapsed. Which means the moat is never in the code.

Org changes

I don’t think this is new practice now, but still to acknowledge, AI has caused some cultural shifts in org. It also has somehow set the wrong expectations from a lot of people. In a way, I feel the bigger question is missed.

It begins with educating everyone alike, juniors, seniors, management, stakeholders.

This is both in terms of work volume and work culture. Both of them can be solved to a good extent using code and conventions. Maybe in the near future, instead of full-stack engineers, we have actual product engineers, who can understand product and the build it as well.

Until then, platforms in the org needs to bridge this gap. In such a style of development, a developer would need to be able to identify such parts of a system in the architecture planning stage. Building abstractions, configurations, and the right interfaces and pipelines which allows the developers to be able to handoff some or most of the work to the specific dept.

A backend dev, building a service layer, doesn’t need to what prompts are being used, what ml models are running. He/She would instead build a system, where:

Prompts are versioned and the source is decoupled
Loads of configurations for models or family of models, the location of model, finetuned or not is not his headache.
Eval framework, no matter how lightweight, to compare against golden data
Monitoring and logging
Keeping a data migration plan between versions ready.

The Bigger Question

If you have a look at claude-skills-repo, I can alwaya go and create a few sub-agents, prompt so financial-analysis to life, looking at geographical data, charts etc., and one could write a few scripts, which can be directly used to run backtesting and quant-analysis.

There is this other thing - if something can be built with ai, it can be replicated with ai.

So the question still is, if its possible to let things happen automatically:

then how does business look?
how does the concept of fairness evolve, is it going to be first mover?, only deep tech? or deep connections and pockets?
software engineers are not essential? a doctor could vibe code some app.

The hidden cost of development

I look back at my experience with building cinestar, and I had no choice but to keep vibe coding, like a druggie. What started off as maybe at max a week like hackathon, turned into a 2-3 month tussle with the LLM:

constantly lying/skipping
writing tests to satisfy its own understanding
missing the data gathering phases
fixing one thing, breaking other things
refactor nightmare

In hindsight, if I had done an assist mode, instead of vibe mode, it would be been a much better experience.

So, even if you know what needs to be done, seeing the entire codebase playout from inception to delivery and refactor, cycles still need human suffering behind it

You are/should be objectively faster at writing a binary search without mistakes, than an LLM, and can do it multiple times without losing money.

And I am not entirely sure, the people higher up in the org is necessarily using AI to write and deliver code, as soon as he gets back from a client meeting. Nor do I think they should do it at a bigger org, where more complex things are at play.

I am also not entirely sure, why would a doctor be trusted with a vibe coded application, to use for anything substantial in his work.

As we move more closer to the real world, we see that apart from LLMs there are more focussed models and maths which do-not hallucinate. Here, probably, LLM becomes a driver . Eventually the outcomes predicted has to be tested in the real world, which was quite evident from our hot chochlate stint.

Replication and UX problem

I could not only argue that, Cinestar can be replicated, and be replicated faster, but also say, this whole project can be a collection of skills and sub-agents. For the terminal-ly challanged, we could expose a website or an app or a plugin.

The replication problem has exsisted, but the time to replicate has come down by a lot (arguably). Lets assume that it improves over time. So what’s left?

What actually holds value:

Data moats, not code moats. The software is replicable. Your proprietary dataset isn’t. The memory agent recording execution pathways, curated knowledge, internal algorithms/workflows/secret sauces — that’s probably the moat. The code is commodity. The accumulated intelligence isn’t.
Distribution > creation. Always was true, now brutally so. Two identical products — the one with users wins. It’s not just speed, it’s access to the audience, and also the type of audience.
Domain depth: Kknowing that NMDC drops after earthquakes near Bellary, that INR depreciation eats nominal returns, that specific credit card reward structures can be arbitraged — an AI can discover these things, but Someone has to know to ask the right questions. The person who’s lived in Indian markets for years asks better questions than someone who just has the same tools.

The UX is also quite scattered:

Devs use it in a certain set of ways. Editors, TUIs, Plugins
General dumb public, who would let products have complete autonomy of devices
Non-devs who are entering into the dev territory. These are people who just want to build.

Overall, even if some systems are built, maintaining them would require engineers, so the task or way of execution in the lower layer changes. As long as there is a core product/problem which requires more work than just prompting it out, finer integrations, are the ones that do end up surviving and keeping the influx of engineers.

Platform/infra engineers — Core services, Devops + Data Engineers
Security engineers — attack surface grows with AI-generated code
Domain specialists who code and audit — the quant trader, the biotech researcher, the fintech person
“Taste” roles — product sense, design judgment, editorial quality

** I wonder is it always going to be biased towards people who have better networking, capital and influence?**

I guess, Yes. But that was always true. The industry genuienly hates software engineers. There is quite some science to this, but it rarely reflects in the way the IT system functions.

If I consider, that people always saw Software engineering as a temporary anomaly, a means to an end — a 30-year window where a person with a laptop and skill could create outsized value without capital or connections, and

What’s replacing it is closer to how every other industry works:

Capital lets you run more experiments faster (more API calls, more compute, more data)
Network gives you distribution and domain access
Speed matters because the first mover accumulates the data/users that become the moat

A person building AI-driven quant tools in their bedroom can absolutely beat a bank’s trading desk on idea generation. But the bank has the capital to act on those ideas at scale, the regulatory access to trade, and the data feeds and a lot more compute and brains and speed(lol, so much more speed its diabolical)

So the bedroom builder either joins the bank, sells to the bank, or finds a niche the bank doesn’t care about.

Not to mention, this shit is not for consumers, at all. All windows users, using powershell to run claude code, yeah sure. So whatever consumer adoption comes, that is going through the business/products they consume.

So computer science as a discipline will still be valid, the stuff in between might eventually become production-line, like amazon factory workers, where someone just monitors metrics, and the llm crashes systems, causes a crowrdstrike like incident and then blames it on the CTO’s family.

What I’d tell someone starting this

Some non-negotiables:

Configurability: as much as possible, especially with prompts, model names/bundles, iteration counts, embeeders, stores, anything that is experiment land (which is anything that the language models and embeddings touches)
VersionSystem: for schema changes and prompt changes
Conventions for collaboration: allow for people to participate. Platform engineers now need to consider non-tech people as junior developers, so that other teams can make the changes in-dependently
Abstractions: This is not the what design pattern to use. If I were a famous person, I would add an C (changability) at the end of CAP.

Systems now not only need to worry about Read and Write patterns, but also which parts of the whole system can be replaced. This is Strategy Pattern applied at a more org and infra level

Once you think, you have a problem, you should explore with existing system. A path could look like so:

Start with finding what is your delivery mechanism, and pin point on 2 maybe. This also forces you to write code for future expansion
Try the whole workflow manually with your favourite tools, chatgpt.com, codex, claude.
In case you have no idea about the field, you can use a similar approach of Simulation, let the llm figure out test and train and target datasets.
Try to build the system using exsisting opensource tools, without much code, just bare-bones.
If the above turns out to be not so diffucult, that you can wrap the tui in a API and SSE, find something else or other business model. Otherise,
If the above turns out can not be completely done with a claude/opencode skills and subagents or n8n, it means you have leverage and a solution.
So even if the workflow can be built by someone, neither the provider, nor the competitor would have access to the propeitory stuff.
Once you have a system, which works with these already established tools like above, rebuilt them in code.
Run against a bunch of queries, and multi turn conversations, and track the tools used, total iterations taken. Try ot against different models and embeddings, So, making things configurable should come from day one
This is where you start to see failures, and correction loops of the agent.
Start tweaking the prompts and tool descriptions, add the memory/storage layer, think about the intermediate pre-processing steps, as we saw with the section_covers.

Remember, LLMs and neural nets end of the day are high on patterns, and how humans have interracted with the system. This realisation should help you organise the data access patterns and tool prompts for them

Build correction loops from day one. If the system can’t handle a malformed LLM response gracefully, it can’t handle production.
Measure retrieval quality separately from generation quality. When the answer is wrong, it’s almost always because retrieval surfaced the wrong context, not because the model can’t reason.
Treat every shortcut as debt. Skipping the indexing pipeline feels fast until every query takes 30 seconds. Hardcoding prompts for one model feels easy until you need to switch. Building for one document format feels simple until researchers upload the weird one.

This post is the map. The territory is in the implementation.

Thanks. mind sharing?

← Previous Post