Jay Alammar on Building AI for the Enterprise – O’Reilly

Generative AI in the Real World

Generative AI in the Real World: Jay Alammar on Building AI for the Enterprise

00:00
/
42m 38s

Jay Alammar, director and Engineering Fellow at Cohere, joins Ben Lorica to talk about building AI applications for the enterprise, using RAG effectively, and the evolution of RAG into agents. Listen in to find out what kinds of metadata you need when you’re onboarding a new model or agent; discover how an emphasis on evaluation helps an organization improve its processes; and learn how to take advantage of the latest code-generation tools.

About the Generative AI in the Real World podcast: In 2023, ChatGPT put AI on everyone’s agenda. In 2025, the challenge will be turning those agendas into reality. In Generative AI in the Real World, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.

Check out other episodes of this podcast on the O’Reilly learning platform.

Timestamps

0:00: Introduction to Jay Alammar, director at Cohere. He’s also the author of Hands-On Large Language Models.
0:30: What has changed in how you think about teaching and building with LLMs?
0:45: This is my fourth year with Cohere. I really love the opportunity because it was a chance to join the team early (around the time of GPT-3). Aidan Gomez, one of the cofounders, was one of the coauthors of the transformers paper. I’m a student of how this technology went out of the lab and into practice. Being able to work in a company that’s doing that has been very educational for me. That’s a little of what I use to teach. I use my writing to learn in public.
2:20: I assume there’s a big difference between learning in public and teaching teams within companies. What’s the big difference?
2:36: If you’re learning on your own, you have to run through so much content and news, and you have to mute a lot of it as well. This industry moves extremely fast. Everyone is overwhelmed by the pace. For adoption, the important thing is to filter a lot of that and see what actually works, what patterns work across use cases and industries, and write about those.
3:25: That’s why something like RAG proved itself as one application paradigm for how people should be able to use language models. A lot of it is helping people cut through the hype and get to what’s actually useful, and raise AI awareness. There’s a level of AI literacy that people need to come to grips with.
4:10: People in companies want to learn things that are contextually relevant. For example, if you’re in finance, you want material that will help deal with Bloomberg and those types of data sources, and material aware of the regulatory environment.
4:38: When people started being able to understand what this kind of technology was capable of doing, there were multiple lessons the industry needed to understand. Don’t think of chat as the first thing you should deploy. Think of simpler use cases, like summarization or extraction. Think about these as building blocks for an application.
5:28: It’s unfortunate that the name “generative AI” came to be used because the most important things AI can do aren’t generative: they’re the representation with embeddings that enable better categorization, better clustering, and enabling companies to make sense of large amounts of data. The next lesson was to not rely on a model’s information. In the beginning of 2023, there were so many news stories about the models being a search engine. People expected the model to be truthful, and they were surprised when it wasn’t. One of the first solutions was RAG. RAG tries to retrieve the context that will hopefully contain the answer. The next question was data security and data privacy: They didn’t want data to leave their network. That’s where private deployment of models becomes a priority, where the model comes to the data. With that, they started to deploy their initial use cases.
8:04: Then that system can answer systems to a specific level of difficulty—but with more difficulty, the system needs to be more advanced. Maybe it needs to search for multiple queries or do things over multiple steps.
8:31: One thing we learned about RAG was that just because something is in the context window doesn’t mean the machine won’t hallucinate. And people have developed more appreciation of applying even more context: GraphRAG, context engineering. Are there specific trends that people are doing more of? I got excited about GraphRAG, but this is hard for companies. What are some of the trends within the RAG world that you’re seeing?
9:42: Yes, if you provide the context, the model might still hallucinate. The answers are probabilistic in nature. The same model that can answer your questions 99% of the time correctly might…
10:10: Or the models are black boxes and they’re opinionated. The model may have seen something in its pretraining data.
10:25: True. And if you’re training a model, there’s that trade-off; how much do you want to force the model to answer from the context versus general common sense?
10:55: That’s a good point. You might be feeding conspiracy theories in the context windows.
11:04: As a model creator, you always think about generalization and how the model can be the best model across the many use cases.
11:15: The evolution of RAG: There are multiple levels of difficulty that can be built into a RAG system. The first is to search one data source, get the top few documents, and add them to the context. Then RAG systems can be improved by saying, “Don’t search for the user query itself, but give the question to a language model to say ‘What query should I ask to answer this question?’” That became query rewriting. Then for the model to improve its information gathering, give it the ability to search for multiple things at the same time—for example, comparing NVIDIA’s results in 2023 and 2024. A more advanced system would search for two documents, asking multiple queries.
13:15: Then there are models that ask multiple queries in sequence. For example, what are the top car manufacturers in 2024, and do they each make EVs? The best process is to answer the first question, get that list, and then send a query for each one. Does Toyota make an EV? Then you see the agent building this behavior. Some of the top features are the ones we’ve described: query rewriting, using search engines, deciding when it has enough information, and doing things sequentially.
14:38: Earlier in the pipeline—as you take your PDF files, you study them and take advantage of them. Nirvana would be a knowledge graph. I’m hearing about teams taking advantage of the earlier part of the pipeline.
15:33: This is a design pattern we’re seeing more and more of. When you’re onboarding, give the model an onboarding phase where it can collect information, store it someplace that can help it interact. We see a lot of metadata for agents that deal with databases. When you onboard to a database system, it would make sense for you to give the model a sense of what the tables are, what columns they have. You see that also with a repository, with products like Cursor. When you onboard the model to a new codebase, it would make sense to give it a Markdown page that tells it the tech stack and the test frameworks. Maybe after implementing a large enough chunk, do a check-in after running the test. Regardless of having models that can fit a million tokens, managing that context is very important.
17:23: And if your retrieval gives you the right information, why would you stick a million tokens in the context? That’s expensive. And people are noticing that LLMs behave like us: They read the beginning of the context and the end. They miss things in the middle.
17:52: Are you hearing people doing GraphRAG, or is it a thing that people write about but few are going down this road?
18:18: I don’t have direct experience with it.
18:24: Are people asking for it?
18:27: I can’t cite much clamor. I’ve heard of lots of interesting developments, but there are lots of interesting developments in other areas.
18:45: The people talking about it are the graph people. One of the patterns I see is that you get excited, and a year in you realize that the only people talking about it are the vendors.
19:16: Evaluation: You’re talking to a lot of companies. I’m telling people “Your eval is IP.” So if I send you to a company, what are the first few things they should be doing?
19:48: That’s one of the areas where companies should really develop internal knowledge and capabilities. It’s how you’re able to tell which vendor is better for your use case. In the realm of software, it’s akin to unit tests. You need to differentiate and understand what use cases you’re after. If you haven’t defined those, you aren’t going to be successful.
20:30: You set yourself up for success if you define the use cases that you want. You gather internal examples with your exact internal data, and that can be a small dataset. But that will give you so much direction.
20:50: That might force you to develop your process too. When do you send something to a person? When do you send it to another model?
21:04: That grounds people’s experience and expectations. And you get all the benefits of unit tests.
21:33: What’s the level of sophistication of a regular enterprise in this area?
21:40: I see people developing quite quickly because the pickup in language models is tremendous. It’s an area where companies are catching up and investing. We’re seeing a lot of adoption of tool use and RAG and companies defining their own tools. But it’s always a good thing to continue to advocate.
22:24: What are some of the patterns or use cases that are common now that people are happy about, that are delivering on ROI?
22:40: RAG and grounding it on internal company data is one area where people can really see a type of product that was not possible a few years ago. Once a company deploys a RAG model, other things come to mind like multimodality: images, audio, video. Multimodality is the next horizon.
23:21: Where are we on multimodality in the enterprise?
23:27: It’s very important, specifically if you are looking at companies that rely on PDFs. There’s charts and images in there. In the medical field, there’s a lot of images. We’ve seen that embedding models can also support images.
24:02: Video and audio are always the orphans.
24:07: Video is difficult. Only specific media companies are leading the charge. Audio, I’m anticipating lots of developments this year. It hasn’t caught up to text, but I’m expecting a lot of audio products to come to market.
24:41: One of the earliest use cases was software development and coding. Is that an area that you folks are working in?
24:51: Yes, that is my focus area. I think a lot about code-generation agents.
25:01: At this point, I would say that most developers are open to using code-generation tools. What’s your sense of the level of acceptance or resistance?
25:26: I advocate for people to try out the tools and understand where they’re strong and where they’re lacking. I’ve found the tools very useful, but you need to assert ownership and understand how LLMs evolved from being writers of functions (which is how evaluation benchmarks were written a year ago) to more advanced software engineering, where the model needs to solve larger problems across multiple steps and stages. Models are now evaluated on SWE-bench, where the input is a GitHub issue. Go and solve the GitHub issue, and we’ll evaluate it when the unit tests pass.
26:57: Claude Code is quite good at this, but it will burn through a lot of tokens. If you’re working in a company and it solves a problem, that’s fine. But it can get expensive. That’s one of my pet peeves—but we’re getting to the point where I can only write software when I’m connected to the internet. I’m assuming that the smaller models are also improving and we’ll be able to work offline.
27:45: 100%. I’m really excited about smaller models. They’re catching up so quickly. What we could only do with the bigger models two years ago, now you can do with a model that’s 2B or 4B parameters.
28:17: One of the buzzwords is agents. I assume most people are in the early phases—they’re doing simple, task-specific agents, maybe multiple agents working in parallel. But I think multi-agents aren’t quite there yet. What are you seeing?
28:51: Maturity is still evolving. We’re still in the early days for LLMs as a whole. People are seeing that if you deploy them in the right contexts, under the right user expectations, they can solve many problems. When built in the right context with access to the right tools, they can be quite useful. But the end user remains the final expert. The model should show the user its work and its reasons for saying something and its sources for the information, so the end user becomes the final arbiter.
30:09: I tell nontech users that you’re already using agents if you’re using one of these deep research tools.
30:20: Advanced RAG systems have become agents, and deep research is maybe one of the more mature systems. It’s really advanced RAG that’s really deep.
30:40: There are finance startups that are building deep research tools for analysts in the finance industry. They’re essentially agents because they’re specialized. Maybe one agent is going for earnings. You can imagine an agent for knowledge work.
31:15: And that’s the pattern that is maybe the more organic growth out of the single agent.
31:29: And I know developers who have multiple instances of Claude Code doing something that they will bring together.
31:41: We’re at the beginning of discovering and exploring. We don’t really have the user interfaces and systems that have evolved enough to make the best out of this. For code, it started out in the IDE. Some of the earlier systems that I saw used the command line, like Aider, which I assumed was the inspiration for Claude Code. It’s definitely a good way to augment AI in the IDE.
32:25: There’s new generations of the terminal even: Warp and marimo, that are incorporating many of these developments.
32:39: Code extends beyond what software engineers are using. The general user requires some level of code ability in the agent, even if they’re not reading the code. If you tell the model to give you a bar chart, the model is writing Matplotlib code. Those are agents that have access to a run environment where they can write the code to give to the user, who’s an analyst, not a software engineer. Code is the most interesting area of focus.
33:33: When it comes to agents or RAG, it’s a pipeline that starts from the source documents to the information extraction strategy—it becomes a system that you have to optimize end to end. When RAG came out, it was just a bunch of blog posts saying that we should focus on chunking. But now people realize this is an end-to-end system. Does this make it a much more formidable challenge for an enterprise team? Should they go with a RAG provider like Cohere or experiment themselves?
34:40: It depends on the company and the capacity they have to throw at this. In a company that needs a database, they can build one from scratch, but maybe that’s not the best approach. They can outsource or acquire it from a vendor.
35:05: Each of those steps has 20 choices, so there’s a combinatorial explosion.
35:16: Companies are under pressure to show ROI quickly and realize the value of their investment. That’s an area where using a vendor that specializes is helpful. There are a lot of options: the right search systems, the right connectors, the workflows and the pipelines and the prompts. Query rewriting and rewriting. In our education content, we describe all of those. But if you’re going to build a system like this, it will take a year or two. Most companies don’t have that kind of time.
36:17: Then you realize you need other enterprise features like security and access control. In closing: Most companies aren’t going to train their own foundation models. It’s all about MCP, RAG, and posttraining. Do you think companies should have a basic AI platform that will allow them to do some posttraining?
37:02: I don’t think it’s necessary for most companies. You can go far with a state-of-the-art model if you interact with it on the level of prompt engineering and context management. That can get you so far. And you benefit from the rising tide of the models improving. You don’t even need to change your API. That rising tide will continue to be helpful and beneficial.
37:39: Companies that have that capacity and capability, and maybe that’s closer to the core of what their product is, things like fine tuning are things where they can distinguish themselves a little bit, especially if they’re tried things like RAG and prompt engineering.
38:12: The superadvanced companies are even doing reinforcement fine-tuning.
38:22: The recent development in foundation models are multimodalities and reasoning. What are you looking forward to on the foundation model front that is still below the radar?
38:48: I’m really excited to see more of these text diffusion models. Diffusion is a different type of system where you’re not generating your output token by token. We’ve seen it in image and video generation. The output in the beginning is just static noise. But then the model generates another image, refining the output so it becomes more and more clear. For text, that takes another format. If you’re emitting output token by token, you’re already committed to the first two or three words.
39:57: With text diffusion models, you have a general idea you want to express. You have an attempt at expressing it. And another attempt where you change all the tokens, not one by one. Their output speed is absolutely incredible. It increases the speed, but also could pose new paradigms or behaviors.
40:38: Can they reason?
40:40: I haven’t seen demos of them doing reasoning. But that’s one area that could be promising.
40:51: What should companies think about the smaller models? Most people on the consumer side are interacting with the large models. What’s the general sense for the smaller models moving forward? My sense is that they will prove sufficient for most enterprise tasks.
41:33: True. If the companies have defined the use cases they want and have found a smaller model that can satisfy this, they can deploy or assign that task to a small model. It will be smaller, faster, lower latency, and cheaper to deploy.
42:02: The more you identify the individual tasks, the more you’ll be able to say that a small model can do the tasks reliably enough. I’m very excited about small models. I’m more excited about small models that are capable than large models.