Engineer Better Research Results From a Solid Workbench

Treating the process of your work as important as the result will improve the quality of your results. All of the most successful projects that I’ve seen share a common factor: they are a delight to work on. When your workspace is organized, your tools are sharp, and the goals are clear, it’s easier to stay in a flow state and to do your best work. Projects that are mired in tedium, don’t have a good feedback loop, and don’t have a solid pattern of delivery can easily get into trouble. Without enough institutional momentum to make up for the poor engineering environment, they can fail. A lot of focus gets put on building the right thing for customers, and rightfully so, but it’s important to remember that before we can ship anything, we have to first build our workbench. Whether we do that haphazardly or intentionally can have an enormous impact on the quality of our results.

Read more

Evaluating the Evaluators: LLM Assessments in Practice

While an afternoon can be enough to get an LLM app demo working, it can take much longer to characterize and curtail unexpected LLM behavior. The process of making an LLM app reliable is mostly trial and error, involving spot-checking by the developer, reviews by product owners, and auto-evaluation. Auto-evaluation was introduced with the GPTScore paper in 2023, and by now people appreciate the need to evaluate this middle layer of LLM evaluators. At Hop, we’ve spent much of the past year working with auto-evaluation and feel that there’s a rich set of design decisions that aren’t regularly discussed. Here are some of the things we’ve been thinking about along the way.

Read more

Could You Be Talking to an AI Doctor?

Think back to your last telehealth visit with a doctor. Perhaps your kid had a persistently high fever, or you had worrying chest pain. Are you sure you were interacting with a human? What makes you sure? Perhaps the doctor listened attentively to your symptoms, asked pertinent questions, and even picked up on subtle cues in your language that hinted at the severity of your condition. 

Read more

Testing Research Code: Sometimes Worth It

Machine learning researchers often don’t write tests for their code. They’re not software engineers, and their code needs only to train a model or prove out an experiment. Plus, their code changes rapidly, and it’s hard to write tests in that context that don’t immediately need to be rewritten. However, at Hop, we’ve found that adding certain kinds of tests can actually accelerate research and increase confidence in results through improving code quality and encouraging reuse.

Read more

Why Most LLM App POCs Fail

LLMs aren’t yet widely used as an architectural component in production — the core issue is reliability. Not knowing how to engage with the reliability challenge – in a structured and productive manner – is what I think limits the success of most teams building LLM-powered applications. In our projects at Hop, we’ve developed a relatively uncommon perspective on how to effectively engage with this reliability challenge.

Read more

Machine Learning Is About Statistics After All: A Series of Vignettes, Part 1

Over the past decade, we’ve seen diminishing importance of traditional statistics in data science. It’s now possible to train complicated models while understanding very little about how they work. There’s a widespread attitude among practitioners that it’s enough to know how to code up architectures in PyTorch and correct obscure bugs, and that the math is someone else’s problem. We at Hop put ML models into production, and we’re here to tell you that the math is not someone else’s problem.

Read more

Code Quality for Research

I view research (and especially applied research of the type that Hop does) as a type of multi-armed bandit problem — one that tries to balance new approaches (exploration) with successful approaches (exploitation). The code quality/technical debt conversation is usually a bit muddled these days, but it becomes a bit easier to think about if you articulate where on the exploration/exploitation spectrum you currently are.

Read more

Hiring Your Minimum Viable Machine Learning Team

A question we often get from executives exploring Machine Learning for their organizations is: "What is the minimum viable machine learning team?". There are likely many right answers, and some industries have unique constraints. However, in our experience, a minimal-but-effective ML team requires a few specific roles to be filled. In prioritized order, we believe these to be…

Read more