To build customer support bots, internal knowledge graphs, or Q&A systems, customers often use Retrieval Augmented Generation (RAG) applications which leverage pre-trained models together with their proprietary data. However, the lack of guardrails for secure credential management and abuse prevention prohibits customers from democratizing access and development of these applications. We recently announced the , a highly scalable, enterprise-grade API gateway that enables organizations to manage their LLMs and make them available for experimentation and production. Today we are excited to announce extending the AI Gateway to better support RAG applications. Organizations can now centralize the governance of privately-hosted model APIs (via ), proprietary APIs (OpenAI, Co:here, Anthropic), and now open model APIs via MosaicML to develop and deploy RAG applications with confidence.
In this blog post, we’ll walk through how you can build and deploy a RAG application on the Databricks Lakehouse AI platform using the for text generation and the model for text embeddings, which are hosted and optimized through . Using hosted models allows us to get started quickly and have a cost-effective way to experiment with low throughput.
The RAG application we’re building in this blog answers gardening questions and gives plant care recommendations.
What is RAG?
RAG is a popular architecture that allows customers to improve model performance by leveraging their own data. This is done by retrieving relevant data/documents and providing them as context for the LLM. RAG has shown success in chatbots and Q&A systems that need to maintain up-to-date information or access domain-specific knowledge.
Use the AI Gateway to put guardrails in place for calling model APIs
The recently announced MLflow AI Gateway allows organizations to centralize governance, credential management, and rate limits for their model APIs, including SaaS LLMs, via an object called a Route. Distributing Routes allows organizations to democratize access to LLMs while also ensuring user behavior doesn’t abuse or take down the system. The AI Gateway also provides a standard interface for querying LLMs to make it easy to upgrade models behind routes as new state-of-the-art models get released.
We typically see organizations create a Route per use case and many Routes may point to the same model API endpoint to make sure it is getting fully utilized.
For this RAG application, we want to create two AI Gateway Routes: one for our embedding model and another for our text generation model. We are using open models for both because we want to have a supported path for fine-tuning or privately hosting in the future to avoid vendor lock-in. To do this, we will use MosaicML’s Inference API. These APIs provide fast and easy access to state-of-the-art open source models for rapid experimentation and token-based pricing. MosaicML supports and models for text completion, and models for text embeddings. In this example, we will use Llama2-70b-Chat, which was trained on 2 trillion tokens and fine-tuned for dialogue, safety, and helpfulness by Meta and Instructor-XL, a 1.2B parameter instruction fine-tuned embedding model by HKUNLP.
It’s easy to create a route for Llama2-70B-Chat using the new support for MosaicML Inference APIs on the AI Gateway:
from mlflow.gateway import create_route mosaicml_api_key = "your key" create_route(