Before You Deploy an LLM, Fix Your Data Foundation
Why clean, accessible, and well-governed data is a prerequisite for Large Language Model success.
Chandra Rau
Founder & CEO
LLMs are only as good as the data they are grounded in. Without a strong data foundation, you are simply automating misinformation at scale. Before a single line of model code is written, organisations must confront an uncomfortable truth: their data estate is almost certainly not ready.
Why Data Quality Outweighs Model Choice
In our experience advising APAC enterprises, the decision of which LLM to deploy -- GPT-4, Gemini, Claude, or an open-source alternative -- consumes far more executive bandwidth than it deserves. A world-class model grounded in poorly structured, duplicated, or outdated enterprise data will consistently underperform a simpler model grounded in clean, semantically rich data. Model selection is a week-long decision; data remediation is a six-to-twelve month programme.
The Data Governance Prerequisites
- /Data ownership: Every dataset must have a named business owner accountable for its accuracy and timeliness.
- /Access control taxonomy: Role-based access policies must be defined before AI systems ingest sensitive records.
- /Data lineage documentation: Know where every data point originates, how it has been transformed, and where it flows.
- /Retention and deletion policies: AI training data must comply with PDPA and sector-specific regulations before model ingestion.
- /Golden record strategy: Resolve entity duplication across CRM, ERP, and operational systems before constructing RAG pipelines.
Metadata Management and Data Catalogs
A data catalog is not a luxury for organisations deploying LLMs -- it is a prerequisite. When an agentic system queries your enterprise knowledge base, it relies entirely on metadata to determine relevance, recency, and authority. Without a catalog like Apache Atlas, Collibra, or Alation, your LLM cannot distinguish between a superseded pricing policy from 2021 and the current approved version. The result is confident, plausible, and dangerously wrong outputs.
Data Quality Metrics That Matter
- /Completeness: What percentage of required fields are populated across critical datasets?
- /Accuracy: What proportion of records have been validated against a source of truth within the last 90 days?
- /Consistency: Are the same entities represented identically across systems -- same naming conventions, same identifiers?
- /Timeliness: What is the average lag between a real-world event and its reflection in your data systems?
- /Uniqueness: What is your duplicate record rate across customer, product, and transactional tables?
Cleaning Pipelines for LLM Readiness
Preparing data for LLM consumption is substantively different from preparing it for traditional BI reporting. The pipeline must handle unstructured content -- PDFs, email threads, meeting transcripts, scanned documents -- and convert it into coherent, chunk-sized text units with rich metadata tags. We recommend a four-stage pipeline: ingest and normalise, deduplicate and resolve entities, enrich with metadata, and version-control the resulting knowledge corpus.
"You cannot build a trustworthy AI system on untrustworthy data. The model will faithfully reproduce every error, bias, and gap you have failed to address."
— Chandra Rau
LLM-Specific Data Requirements
Retrieval-Augmented Generation (RAG)
RAG pipelines require your data to be chunked intelligently -- not split arbitrarily at token boundaries, but divided at semantic boundaries that preserve meaning. Each chunk must carry metadata including source system, author, last-modified date, and document type. Vector embeddings must be regenerated whenever the source document is updated. For Malaysian enterprises in regulated sectors such as banking and healthcare, this pipeline must also enforce document-level access control so that an agent cannot surface data to a user who lacks clearance.
Fine-Tuning Requirements
- /Volume: Effective fine-tuning typically requires a minimum of 1,000 to 10,000 high-quality, domain-specific examples.
- /Consistency: Training data must reflect the exact tone, format, and terminology you expect from the model in production.
- /Bias audit: Before fine-tuning, datasets must be audited for demographic, linguistic, and operational bias.
- /Versioning: Every fine-tuning dataset version must be stored with its corresponding model checkpoint for reproducibility and compliance.