Why Most LLM App POCs Fail

Here’s an uncontroversial take – large language models are pretty amazing, and a fundamentally new type of computational building block. Yes, there’s a bunch of hype, and yes, I’m biased, and perhaps they can’t quite do everything folks expect from them. Nonetheless, they unlock several unique and novel capabilities that would have been widely considered science fiction 36 months ago.

Nonetheless – despite crazy consumer adoption as a product, LLMs aren’t yet widely used as an architectural component in production. There’s a variety of issues that folks point to – IP concerns, cost, performance, etc. – but most forward-thinking executives understand that those are all navigable obstacles. For them, the core issue is reliability: Will it say something incorrect? Will it say something rude? Will it hallucinate? 

Not knowing how to engage with the reliability challenge – in a structured and productive manner – is what I think limits the success of most teams building LLM-powered applications. (Sorry if you were expecting me to bury the lede.) 

In our projects at Hop, we’ve developed a relatively uncommon perspective on how to effectively engage with this reliability challenge. To explain our approach, I have to take a brief detour into the typical project development process in industry.

What is a Proof of Concept, Anyway?

Most projects involve a proof of concept (POC) that folks use to build consensus and approval prior to building the actual project. For both startups and internal product teams, this is the key to unlocking the funding necessary for the actual build. 

Sometimes, the POC is the same as the demo of the product, but a lot of harm comes from misunderstanding one for the other. A prototype (or demo, or mockup) is closer to the idea of a concept car than to a proof of concept. In the automotive industry, concept cars are not really functional, and are mostly used to gauge reaction to a fundamentally novel idea and inspire folks to think differently. 

Prototypes can be used in the same way, and can be very useful in inspiring a team to think differently. The first prototype that had a big influence on me was Tog’s Starfire prototype – where Bruce Tognazzini (Apple’s first application developer) tried to both predict and influence in 1994 where the industry would be in 2004. Starfire was somewhat unique as a prototype, in that the team behind it went to great lengths to ensure it would someday be scientifically plausible.

Most UX/UI prototypes today do away with such concerns. They are often beautiful and can be inspiring, but are essentially only constrained by the designer's creativity. This is the point of a prototype – to show what might be useful, not necessarily what is possible.

To show what is actually possible, you need a proof of concept. Here, the proof part is more important than the concept part. A proof of concept is essentially a demonstration of capability – that this almost unbelievable particular thing we want to do is actually possible to do. And therefore, it tends to be a demonstration of the riskiest and the most complex capabilities that the project will need. 

This is why it’s such an important stage gate in unlocking funding – while a prototype is mostly about inspiring an alternate vision for the future, a POC is about derisking your attempt to get there. 

Proofs of Concept Patterns in Tech

For much of the last decade, software has been eating the world. Increasingly, innovation projects have become software projects, and high-growth startups have often been differentiated by the quality of their software. Consequently, most recent POCs that have been built have been focused on demonstrating better software capabilities.

This can take different forms.

If you’re building something with a complex backend – something like TurboTax, perhaps – your proof of concept¹ is perhaps around doing some of the relevant computations, handling some of the complex corner cases correctly, filling out and filing the proper forms in the right way, etc. But once you’ve demonstrated you can do all of that for a Federal 1040, you don’t necessarily need to demonstrate you can do that for each individual state’s tax forms. Importantly, you never really need to demonstrate complexity around how you handle user password recovery, or mobile layouts, or whatever else – those are all essential to your ultimate product success, but there’s no real risk there.

Alternatively, if you’re building something with a complex frontend – perhaps something like Google Maps – your proof of concept is likely entirely about the frontend. You have some new interaction paradigm that is compelling, and you need to demonstrate that you can make that work on a variety of devices and form factors. (This may also require some backend innovation in tiling, CDNs, etc. to support the frontend experience.) Notably, you have a stronger proof of concept if you can demonstrate it working on an additional form factor than if you were to demonstrate it working (for the Google Maps example) on an additional city. 

This pattern has conditioned most of the tech community into concluding that the POC is basically a simplified version of the software you’ll ultimately build. This has been a useful heuristic so far, because a POC is about derisking complexity, and most of the complexity has been in the software. 

Until now.

Complexity in LLM Applications

In the patterns for LLM applications that folks are engaging with today – RAG systems, knowledge extraction, etc. – almost all of the complexity is in the language and embedding models. There’s not much real complexity in the software.

This can be a good thing – with powerful models merely an API call away, many more teams can build LLM-powered applications. The software pieces they have to build themselves are fairly straightforward. 

The flip side is that the complexity in the system ends up being in black-box models. And debugging² these models is not easy. 

Many modern LLMs are specifically designed to be helpful, and are reinforced to provide answers that humans prefer. In their attempts to be helpful, if they don’t know an answer they’ll provide the best answer they can. And because they are also trying to provide answers humans prefer – and humans tend to prefer authoritative and confident answers – their answers are tonally confident. As a result, LLMs are great at providing answers that non-experts find believable. 

Frustratingly, findings about correctness from one model also don’t always generalize to other similar models – even alternate versions of the same model. As a community, we’re still learning the contours of which results are likely portable across model sizes and architectures, and which ones need to be empirically tested and revalidated. 

And the wide-open surface area of an LLM – it takes in a stream of text, it responds with another stream of text – is tantalizing from a product perspective but maddening for traditional QA teams. Although relatively out of fashion, many QA teams still rely on testing every likely combination of input (or classes of input) and validating for very specific outputs. This is tractable for traditional software applications that have relatively few things a user can do on a given screen, but not possible for open-ended LLM-powered applications. Even though teams may intellectually realize this, it can be unnerving to suddenly operate without the safety net you’ve had your whole career. 

Proofs of Concept in LLM Applications

At Hop, we believe that the first step towards developing a proof of concept for an LLM (or collection of LLMs) powering an application is a rigorous assessment process. And an actual proof of concept is a working LLM system passing that assessment process consistently. You can sometimes get there without even building any technology. 

We start by collecting a thoughtful battery of questions³ and developing a principled rubric on what a good answer is. 

For a chatbot, this might be a list of questions users might ask it. For a summarization tool, this might be a repository of texts to be summarized. For a system that normalizes unstructured data, this would be a collection of all the different types of messy data it might be asked to normalize. These need to be collected thoughtfully, so that they cover the full range of inputs the system is designed to respond to (including inputs it should not respond to, and inputs it should respond to with caution).

This might seem like overkill but is often the most important part of the project for us. It also takes longer than you would expect, because your internal stakeholders often have surprisingly different ideas on which questions your system should be able to answer or how it should answer them.

You also have to build alignment on where the finish line is – how many questions does it have to answer correctly to be good enough? This is a nuanced question. As an aside, it is not dissimilar to how hiring managers determine the hiring bar for potential (human) candidates. After all, humans are also complex (and sometimes unreliable) intelligences that you have to depend on! And sadly, you can’t debug a human teammate. 

Of course, you also need to develop the tooling to actually conduct the assessment – at scale, repeatably, and rigorously. Unlike a traditional tech POC, you don’t ultimately throw this engineering effort away, since you aren’t going to replace it with better engineering. This engineering effort builds the tools you need to derisk the LLMs, so the best teams end up approaching how they build (and reuse) that tech very differently than the engineering for most tech POCs. (If you’re interested in this topic, please do email me – I recently put together a talk on this topic for a conference, and can share some slides on the tech architecture and key pieces.)

Ultimately, proofs of concept are – as always – about derisking project outcomes. With LLM-powered applications, the area of risk and complexity is in the LLM and not in the engineering, and that requires a different approach. With rigor and thoughtfulness, it is possible to build reliable systems out of unreliable components. And with reliability, you can more fully engage with the true potential of LLM-powered applications in production. 

— Ankur Kalra, Founder/CEO @ Hop

¹ This is pure speculation, I've no idea about TurboTax's actual origin story.

² IMHO, "debugging" is the wrong way to think about LLMs. I use the term here for accessibility -- as a shorthand that holds the right rough semantic meaning in most readers' heads. A more precise term would be novel and unfamiliar.

³ I use the terms "questions" and "answers" (as in the context of a RAG) here, though the point still stands -- but reads less clearly -- with "queries" and "responses" (in the context of a generic LLM-powered application).