What's AI Ops?
I've always had a soft spot for Ops in general. Ops teams and initiatives are often the "center of activity" for company transformations -- especially when new software categories are involved.
What have been the primary gaps in workflow for technical people trying to use AI?
What are the software categories within AI Ops?
In which categories do startups have an advantage versus older companies that have operated in adjacent categories?
You can listen to the conversation or else read the lightly edited transcript below. Enjoy!
If you’d like to learn from more founders about how to scale generative AI companies, subscribe here for free!
Allison: Sarmad, thank you so much for joining us on the podcast today to talk about everything related to AI Ops.
Sarmad: Thank you for having me, Allison.
A: To get started, what have been the primary gaps in workflow for technical people trying to use AI?
S: Just that it's such a dynamic space right now and there are moving parts at every layer of the stack. There are new foundation models that are coming out with different capabilities. It's not just LLMs anymore. There's increasingly video, audio, text and image models. Simply figuring out what's even possible is the first step that anyone has to take today. And then you get started thinking about the use case or problem you want to solve.
Even if you’re technical, let's say you’re a software engineer, you may not have used AI before in this way. It's actually new even for ML engineers, because the traditional development journey has always been to start with data, train a model for a specific use case, deploy that and then the cycle continues from there. But now you can get started with these very sophisticated models from scratch without having to do any of that pre-work. And yet there's no tooling to support this new kind of workflow. So you have to chain together a bunch of different tools that are in their infancy today. Everybody's trying to figure out what that development workflow looks like.
Even if you're technical, there are parallels to this old world of how AI happened or how software development happened. But it's a drastic shift. Software engineers are used to writing code in a way that's very deterministic. But now you can actually use an LLM almost as a compute engine and give it prompts.
People are now writing functions that are described in English about what they want to accomplish. How do you think through that or manage that work? How do you version control it? How do you evaluate the performance of it? It's very different from traditional software engineering. There are parallels to the old world, but we don't have enough of the tools built out yet across this entire workflow to be able to go end-to-end.
A: That’s a great context for the rest of this discussion. I want to talk more about the people involved in AI Ops. I find it interesting that with the advent of new technology, new roles can be created for the people who use that technology. But in turn, those people influence the technology that they are using. I think understanding that dynamic and how innovation happens is really important. What new roles are being created in the realm of AI Ops, and then how are those specialized roles in turn influencing what kind of software is being developed?
S: On the one hand, when you think of AI, you see the word “copilot” everywhere. There's this amazing idea that I can have copilots for different tasks that I'm already doing that make me more productive. Marketers, writers, and artists are using AI for the first draft, for example. So, in that case, the roles haven't changed. But your capabilities and your workflow have evolved.
Then there are people who are actually developing AI applications, and there are software engineers involved in that. But it's actually a team effort now. Traditional ML has often been pretty lonely because you had an ML research scientist who was trying to build a model (end-to-end from scratch usually) and then that fed into a product and there was some kind of loop.
But now with generative AI, that loop is a lot more collaborative even at its infancy. Product teams might have requirements that they want to see performed by AI agents or AI workflows. Then you might have reviewers who are trying to evaluate whether the application is actually doing those things. What will these roles look like in the future? Is it purely software engineers who are AI engineers going forward, or is it a hybrid of a product team? It’s unclear.
We can think about producers vs. consumers. It might be a data scientist on my team who creates the growth metrics dashboard for me. But then the consumers of that dashboard may be anyone in the organization.
The people actually building AI applications will still be fairly specialized. It won't necessarily be ML researchers doing that, but I think it'll still be product teams, engineers, and designers who are really adding AI into their applications.
A: What professional opportunities do you think there are for folks in this field?
S: The hard stuff that still exists beyond just AI Ops is how to extract better performance out of these models. For example, we've talked about fine-tuning LLMs. But we can also think about fine-tuning embedding models to create better embeddings that result in better retrieval. So there are opportunities in specialization to really understand LLMs’ transformer architecture and advance some of the core competencies of the models themselves.
Then there’s a new class of engineers that is being created. I don't think it's just prompt engineering. Software engineering is changing more holistically.
Software engineers are not going away anytime soon. Their roles will become more interesting because they will have a few more tools to play with. The people who are really able to specialize in understanding the capabilities and the limitations of these models will be in short supply for the foreseeable future.
If you're looking to figure out what to study in college, learning computer science and software engineering is still probably the best thing you can start with. Try to learn how LLMs actually work, how foundation models function, how you can extract better performance from them, and what are the steps in the workflow that you can actually tweak and control.
A: How would you describe the emerging world of AI Ops?
S: In a single sentence, it's about filling these gaps in the workflow. It's not a complete revolution of tooling, but it's essentially trying to bridge that gap between software engineering, what used to be ML engineering, and this new AI engineering that's emerging. So, concretely what that means is, the traditional ML Ops tools that have existed so far, they were all for ML engineers who were trying to train models. None of them were geared towards software engineers. And so the first thing is just the target persona of AI Ops is different. It's not core ML research scientists. It's software engineers and product teams that are trying to productionize AI.
The second aspect of AI Ops is: How do you fit AI development into the wider workflow and toolkit that software engineers already have and that has been built up over decade?
What that concretely means is, figuring out how to start from prompt engineering as the very first thing that everybody does, and then tune the model itself. There are still those hyper-parameter-style sweeps that you can do to tweak the settings of a model.
Then, how do you figure out what model to use? How do you connect that to the data?Every company has some kind of specific use case or custom data that they want to leverage. So, how do you connect these powerful models to those data sources? Do you need to create a new model with new model weights, which people are calling fine-tuning? Or can you just do some prompt engineering with vector databases? How do you evaluate the performance of this? How do you do version control? All of these are moving parts, so how do you track all of that? How do you cache things? LLM performance is another thing that people are thinking about.
There are parallels to the old world of monitoring, observability, caching, but we need to rethink those concepts for this new space.
A: It's helpful to hear you disaggregate AI Ops into its components. I'm also wondering, what AI Ops software categories do you expect to emerge?
S: Traditional developer workflows are generally split up into the inner loop, the outer loop, and monitoring.
The inner loop involves rapid iteration and experimentation. You’re seeking immediate feedback and have a fast feedback loop. The Jupyter Notebooks are an example of that for Python development.
Then there's the outer loop of deployment. You might have integration tests set up for your application. Or you might have evaluators that are running in the background. Or you have data pipelines that are ingesting things that you need. The outer dev loop takes a little bit longer. Maybe it runs nightly, for example.
Then there's the monitoring part where you’ve deployed something in production and now you want to observe it, monitor it, maybe even have a feedback loop back into your inner dev loop.
Those categories of inner dev loop, outer loop deployment and monitoring are going to remain the same, but we'll see a lot more cohesion of how they develop for AI Ops specifically. You can imagine prompt engineering and parameter tuning being part of the inner dev loop, when you’re iterating on the system prompt, adding different kinds of vector databases or embedding models to retrieve custom data. Then there's caching. These core categories of AI Ops will look very familiar to someone who's built developer workflows before.
A: Are you bullish on a couple of those categories in particular?
S: That inner dev loop is probably the first thing we'll see rapid improvements in. We're underestimating how much we can do with just prompt engineering right now. Everyone we've talked to is interested in fine-tuning. They look at that as a North Star of eventually being able to own their own model or extract the most performance they can out of it.
But if you look at the tools and applications that have really succeeded so far, a lot of them were able to go a long way just by doing chain-of-thought prompting, or injecting facts into a context that you then used the LLM to analyze. So figuring out how to make the inner dev loop efficient will probably be the first and most visible part of this. It’s also something that's achievable today.
As the adoption curve continues to mature, we'll see more tools for fine-tuning and more complex workflows build up. But right now, if you think of what is everybody doing when they explore LLMs and foundation models, they're playing with them and experimenting with them and trying a bunch of stuff out. We will see a lot more tooling that enables that to be done in a systematic way.
A: What do you think is holding back companies from adopting AI more significantly within their workflows?
S: The tools, libraries, frameworks that exist today make it very easy for you to prototype something really quickly. In less than 50 lines of code, you can take a YouTube video, extract the transcript from it, and summarize it with an LLM.
But you can't take that from a prototype phase to production, where you have entire pipelines or workflows built without a lot of customization today. Customization means you have to be an expert at every step of this developer workflow.
Organizations are struggling to evaluate the performance of things that are in production. It's been eight and a half months since ChatGPT came out, but I haven’t really seen many companies with ChatGPT-powered chatbots out there in production yet.
There are a few plugins that were added for ChatGPT with some integrations with external partners, but nothing's really taken off. That shows you that it's not that easy to just call the OpenAI API and call it a day. There are all of these steps you have to take and evaluate, and then you need rules-based engines to guarantee some guardrails around model behavior. That makes it really hard to actually put the model into production.
There's value in the prototype phase because that’s how you get a whole new generation of engineers to experiment with this. But then to take those prototypes into production, you need another phase of tools that currently don't exist.
A: I want to tap into your inner contrarian. What do you think are the areas of AI Ops that are not talked about as much as they should be?
S: Offline workflows. Everybody's very excited by chatbots. When ChatGPT came out, the obvious use case that stood out was to enable customer support orgs to be more efficient by having AI-powered chatbots to talk directly to customers. That's fine. But I actually think the value of these models goes way beyond the customer-facing side of things. The workflows that most organizations run today that are powered by traditional models are actually offline workflows. So, for example, even on the customer support side, you can use LLMs to evaluate the quality of calls that happen with human operators and have that as a feedback loop built into your organization. That's an offline workflow where you might evaluate millions of transcripts, for example.
People are already using LLMs in production for data annotation. An LLM could review the first draft, then you could have human reviewers involved. An example of an offline workflow on the developer tool side is running 5,000 different versions of a prompt with five different models. You can then have that as almost an integration test suite for the AI part of your applications.
A: Within AI Ops, are there certain categories where you think startups will have an advantage versus companies and incumbents that are in adjacent categories?
S: Startups have agility and speed. They can do things that require rapid experimentation, rapid change of direction, completely overhauling and trying really crazy revolutionary things. In a larger organization, you'd have to get buy-in from five different stakeholders just to try that.
Large companies excel at things that require scale. If you're trying to build a foundation model from scratch, you’d better have a lot of capital or else be a large company. Open source will continue to build better and better models. But some of the best open source models came from large companies. Llama 2 came from Meta, for example.
But I actually think startups will be the ones defining developer workflows. That’s because it requires a lot of rapid iteration and experimentation to win the community’s buy-in, which startups are better positioned to do.
A: On this train of thought about how much room there is for startups to build big companies in this area: how many startups do you think could become big companies in the realm of AI Ops? And by big, I mean $100 million ARR companies growing at 70 plus percent per year?
S: It’s too early to say. But my hypothesis is that there won’t be too many. This will be consolidated over time and maybe even quickly. If I was an organization looking to leverage an AI Ops solution, I probably wouldn’t want to chain together five or six different solutions to build my own AI Ops workflow. I'd much rather have something that can help me go end-to-end and customize pieces of that.
So far, what we've seen is a lot of companies looking specifically into vector databases, embedding retrieval, or prompt engineering. It's very, very specific parts of the workflow, which makes sense because startups have limited resources, limited time. They want to focus.
But the companies that are able to provide that end-to-end cohesive workflow for developers are the ones that will probably succeed at growing ARR and other metrics more than just GitHub Stars, because that stuff is hard. Companies talk about moats quite a bit, and being end-to-end creates a moat. That's really hard to do. That's why we haven't seen this moat emerge even from large companies yet.
A: I often find that in the early days, certain startups are trying to evangelize about categories that they want to create, but that don't necessarily deserve to be created. Are there any “fake” categories that have set off red flags for you?
S: It's not so much about fake categories, but more about things that are perhaps too early to invest in right now. I alluded to one: Fine-tuning is valuable, but it's early in the maturity curve. So, if you’re a startup that's working on that, be prepared to wait it out. Or invest now but realize that adoption will probably accelerate later.
The other one that I see a lot of hype around is agents. It's really cool to see LLMs be orchestrators of something complex. You ask the LLM to break something down into discrete tasks and then it orchestrates the execution and resolution of those tasks. That's a really interesting area. But the way people have been looking at it is predominantly to try to build these general purpose agents that can do everything. I don’t think that's possible to actually productionize.
A: Now that we've got the lay of the land in terms of AI Ops as this emerging space, I want to talk more specifically about what you are doing. You are building a company that can help shape the future of AI Ops. So, can you tell us a little bit about what LastMile is, what inspired you to build it, and what's the perspective on the world that your company represents?
S: I'll start with our motivation for building it. A little bit of a background on me and the rest of the LastMile team—we all came from Meta. We were involved in building most of the developer tool stack that ML engineers and data scientists used at Meta. So, the Jupyter Notebook platform that was built from the ground up, experimentation, model management, and a bunch of other things that thousands of ML engineers that the company use. We understood the ML developer workflow and MLOps really well from that experience.
With generative AI coming on the scene, we realized that developer workflows needed to evolve and would require a new set of tools, catered toward software engineers instead of ML researchers. We’re building those at LastMile.
The reason we named it LastMile is we want to bridge that last mile gap between these amazing foundation models and your application. We want to help you experiment with, personalize, and integrate foundation models into your application.
Right now we are focused on the playground side, or the inner dev loop side, of AIOps. We built what we’re calling “AI workbooks.” There’s a notebook interface where you can try out any kind of model, tweak it, chain it together, build a workflow out of it, and iterate on it rapidly. It looks like a notebook and has cells, so there are parallels to Jupyter Notebooks. But instead of a compute kernel, you have an ML kernel of sorts, because you're not writing code, you're prompting with words, images, or audio.
A: Sarmad, it was wonderful to interview you. Thank you so much for sharing all your insights on this space.
S: Thank you, Allison. It was great to meet you, great to chat, and hope you have a great rest of your day.