Our Concept in Practice
Introducing MeshRAG - unified data management for big data and GenAI
You've probably seen plenty of buzz about concepts like Retrieval-Augmented Generation (RAG) and other cutting-edge topics. But let’s cut to the chase: in my experience, no AI model or technique, no matter how advanced, can succeed without a solid foundation of high-quality data. Before optimizing machine learning models, we need to get the data right.
This article is based on a talk I presented at ODSC west on how data mesh can streamline and accelerate your GenAI initiatives, and wanted to share a summary for our Nextdata community.
Why machine learning and GenAI projects fail
Let’s rewind a bit. Machine learning project failures—whether in Gen AI or traditional AI—don’t typically happen because the model isn’t good enough. Often, the research team has crafted a brilliant prototype, with results that impress stakeholders. But when it’s time to deploy the model in production, cracks begin to show. Why? More often than not, the issue lies in data quality and how that data is managed in corporate environments.
This challenge is particularly pronounced in industries like finance and healthcare, where data must not only be current and high-quality but also secure, compliant, and usable within strict regulatory frameworks. The ability to get the right data at the right quality—and ensure you’re allowed to use it—is what makes or breaks projects, whether it’s analytics for humans, or RAG for AI’s
RAG: unlocking the power of your data
Enter RAG, or Retrieval-Augmented Generation, the technique for combining large language models (LLMs) with your organization’s data. While LLMs are powerful on their own, their value really gets dialed-in when augmented with custom, enterprise-specific data.
The simplified RAG pipeline
Here’s what a basic RAG pipeline looks like:
- Data Sources: It starts with your data—structured, unstructured, or multimodal. This could include PDF documents, relational databases, data lakes like S3 or Snowflake. We often end up with or even a mix of these.
- Embeddings: An embedding model converts your data into vectors. Interestingly, even the embedding model itself is a form of data, and needs to be a versioned artifact to support reproducibility. Specialized embedding models can tailor this process for your use case.
- Vector Store: Embeddings are usually stored in a vector database like Pinecone, often supplemented by a metadata store (e.g. in MongoDB) to support filtering and efficient queries.
- LLMs: Finally, the LLM interacts with this ecosystem, answering user queries based on the data and embeddings provided. Fine-tuning the LLM can further improve task-specific performance.
It’s worth taking a minute to clarify the difference between RAG and fine-tuning:
- RAG: Use RAG when you need up-to-date and independent knowledge from your data. The LLM combines its foundational training with real-time data updates through APIs or embeddings.
- Fine-Tuning: Fine-tuning improves the LLM’s performance for specific tasks. For example, an LLM answering data mesh queries might benefit from fine-tuning to enhance its logic while still relying on RAG for the latest data.
In most enterprise scenarios, RAG and fine-tuning are complementary, coming together to deliver optimal outcomes.
Challenges of scaling RAG
Building a single RAG pipeline is straightforward—there’s no shortage of tutorials for that. But scaling it across a large organization with diverse datasets is another matter entirely. Here are some key challenges:
- Out-of-Sync Embeddings: Data changes constantly—new data is added, old data is updated, and some data may need to be removed due to regulatory changes. Keeping embeddings synchronized with the latest data is critical.
- Data Quality: Poor data quality leads to unreliable answers. For instance, a semantic change to a field in the source data could ripple through the pipeline, compromising results.
- Compliance: Handling sensitive data, such as personally identifiable information (PII) or GDPR-relevant data, requires robust governance to ensure privacy and compliance.
- Scalability: As new data sources are added or pipelines grow, maintaining centralized control while empowering decentralized teams becomes increasingly complex.
The role of data mesh
Some background about data mesh, if you don’t know about it already. It’s a new data management paradigm invented by our CEO, Zhamak, to break open the bottleneck of centralized data teams and data management while still preserving governance and ability to scale. She wrote a 400 page book for O’reilly about it that you can buy, or you can learn more here in one of her early blog posts on the subject.
Scaling RAG pipelines effectively requires rethinking data management. This is where the data mesh comes into play. By decentralizing data ownership while maintaining centralized governance, the data mesh ensures:
- Teams can add new data sources independently.
- Data quality and compliance are upheld.
- Pipelines remain scalable and cost-effective - changes don’t break the system.
For example, say you’re building a song recommendation system. Two inputs—listener play events and metadata about music tracks—need embeddings. But how often should embeddings update? With orchestrated policies, these decisions can be automated, ensuring high-quality results while reducing manual effort.
In a fully functioning data mesh, “orchestrated policies” are implemented as an inherent capability of every data product. In the data mesh paradigm, data products aren’t just data. Nor are they just items in a catalog pointing to storage, with metadata attached. In the data mesh world, data products include not just the data but also the code that processes it for different use cases, the policy that governs it, the metadata that explains it, all executing as a long running service throughout the lifecycle of the data product. This means the data is continuously governed per your quality expectations, and your organization’s policies.
Single data product and pipeline challenge
Because of the intensive focus on GenAI in most organizations, project owners often take a short cut and provision specialized data ops and lifecycle management tooling specific to that project. What seems different about these tools is support for different modes and formats of data (eg vector stores), but they still need to implement data quality, safety and accuracy standards. In large organizations exploring many AI initiatives, this means multiple, special purpose data stacks.
The result is multiple YADS (Yet Another Data Silo) that need to be maintained, and provisioned with new data sources as new GenAI use cases are discovered and explored - all of which eventually need to be brought under enterprise governance.
Data mesh data products are inherently multi-modal - allowing the same data to be served in multiple formats (tabular, unstructured, vector) and in streaming or batch mode. This means that organizations can apply the same governance regime across all data sources and use cases, without slowing down experimentation and innovation.
Introducing MeshRAG
To scale RAG pipelines enterprise-wide, we need dynamic solutions. MeshRAG leverages metadata to automate pipeline creation. Instead of manually connecting data sources, the process is dynamic and query-driven. Here’s how it works:
- Pre-Filtering: By selecting only relevant data, MeshRAG reduces costs and improves accuracy.
- Dynamic Pipelines: Automated systems dynamically adapt to changes in data, ensuring embeddings stay updated and compliant.
- Decentralized Flexibility: Teams can create and manage their pipelines without sacrificing centralized oversight.
The result? High-quality answers, enhanced compliance, and enterprise-level scalability.
The takeaway
Whether it’s model training, fine-tuning, or RAG pipelines, your outcomes are only as good as your data. Scaling AI solutions in an enterprise setting demands a strategy that prioritizes data quality, governance, and flexibility. By combining data mesh principles with advanced AI techniques like RAG, organizations can unlock the full potential of their data while staying compliant and efficient.
At the heart of it all is this simple truth: data is everything. If you get that right, the possibilities for machine learning and Gen AI are endless.
Update: I’ll be hosting a webinar on this topic next week on Thursday January 16th. You can register to join here