On Tuesday, June 3rd Distributional hosted a dinner and discussion with AI platform and engineering leaders responsible for scaling AI applications. The buzzing of bleeding edge conversation was a mismatch for the old school Delmonico environs. Guido Appenzeller from a16z and Venkat Varadachary teed up the discussion with Scott Clark. Then engineering leaders from American Express, BlackRock, C3, Citi, Cox, Deepmind, EY, Google, Guy Carpenter, IBM, Morgan Stanley, Mozilla, Nasdaq, Netflix, PayPal, Protego, Shopify, Two Sigma, and Walmart provided diverse perspectives on similar problems across a variety of use cases. I couldn’t believe it when I looked down at my watch and three hours had flown by. Here are my takeaways on a few consistent themes I heard through the evening.
There isn’t a standard stack, but we are moving quickly toward it. One AI leader described relying heavily on services and consulting for now to build out multiple types of platforms or ways of scaling, and planning to reduce this “tech debt” later once they had more proven, scaled use cases. Another described investing heavily in just a few critical components related to infrastructure and observability to reduce long-term tech debt from running these systems, while letting the team iterate on the rest. And those more mature with AI described having a relatively standard stack for how they deploy, but pairing this with significant ongoing support with a big engineering team to protect their investment in the technology.
The best advice I heard through the evening was to find ways to standardize areas that will be relatively consistent regardless of the ultimate product direction—core infrastructure and core software supporting it—even during this time of relative uncertainty regarding every component of the stack.
A significant part of the discussion revolved around observability, testing, monitoring, evaluation, and analysis of these non-deterministic systems (I promise we didn’t bias the conversation). The first part of this discussion tended to focus on the need for continuous testing in production, rather than one-time testing in development. One engineer described using statistical analysis to select rows of data (rather than random sampling) for review by a team of analysts, who would then determine whether to pass them on to the modeling team for further development. Another described computing distributions of properties off of their unstructured data, at which point I pitched them Distributional. A third described running more rudimentary testing that mimicked classic synthetics testing to understand when these LLMs were down and what was driving it.
In all cases, this type of “monitoring” was more robust than a classic threshold on a statistic and typically involved running deeper analysis with heavier computation to try to understand what was actually happening with use or availability of their GenAI products.
The second part of this broader discussion around testing focused on the need for evals to go beyond response quality. When something goes wrong, it is often possible to find the issue in analysis of the response quality, but impossible to use this information to actually discover what happened or do any robust root cause analysis. This relates to a sub theme that most issues relate to the components surrounding the LLM rather than the LLM itself. One AI engineer described spending two days debugging a degradation in response quality for an agent performing tasks on policy documents where the issue was with the router. Another engineer described finding an issue in the retrieval mechanism for their knowledge RAG application. In these and related cases, the theme was needing to eval all components of an LLM system, not just the responses—and to do so by looking at the underlying data from each of them.
A few engineers in the room bemoaned the death of modeling with the rise of LLMs as APIs. Most work is now focused on engineering rather than the more artful data science tasks creating features and getting them to work on a given task. Many of the former data scientists present said most of their work revolved around solving engineering scaling problems in one form or another. One of these was the aforementioned use case applying statistical analysis to identify the right sample of data to send to a human analyst for review as part of a continuous testing pipeline. Another related to solving deployment related scaling problems where simply running an LLM reliably at scale ended up more challenging than expected. The consistent theme was that in the age of GenAI, scaling problems had replaced modeling problems.
There were a wide range of people on the AI maturity curve, some having invented the technology underpinning it, while others hadn’t yet used LLMs on any meaningful problem. I expected those early in their journey to progress through all the same steps as those further along. Instead, the consistent theme I heard from them was a plan to leapfrog directly to agents.
One AI leader in financial services said that performance from LLMs for many predictive tasks that are part of their bread and butter approach to using ML for data problems were underwhelming and expensive, so they haven’t gone anywhere. But early tests on agents for data pipelines were promising, so their plan is to leapfrog directly to this technology for at scale use cases. As earlier adopters struggle with ROI from basic use cases like multi-turn chat, I suspect they’ll go down this agentic data pipelines path as well for net positive ROI business cases that scale.
Finally, Anthropic CEO Dario Amodei’s prediction that “unemployment would hit 10-20% in 5 years” fueled discussion around how we could stay on the cutting edge. One AI leader described his child graduating from a premium university with a computer science degree, and him encouraging them to lean into these tools to become a “superpowered” employee as an edge in a tough job market for entry level workers. Another AI consulting leader described using AI daily to scope projects for customers and provide a detailed plan for execution that he has used with minimal refinement to win multiple customer projects. Many others discussed the use of coding agents as a ubiquitous and essential part of their daily lives. In all cases, the conclusion was to embrace AI and figure out ways to leverage it to become a superpowered employee.
It was a fun evening, and I appreciate everyone who brought their unique insights to the discussion. I look forward to hearing how these themes progress at the next event—whether this is progress along the agentic journey or onto something entirely new that nobody predicted. We are already planning for more of the same at the San Francisco Tech Week in October—follow us on LinkedIn to learn when our next event is. I’ll see you there.