LLM & RAG · LLM fine-tuning

Your LLM follows instructions in the demo and ignores them by the third real query

Banao fine-tunes language models on your labelled data — teaching the model the exact format, tone, terminology, and judgment calls your product requires, rather than hoping a prompt convinces it to behave.

We run the data curation, the training run, and the evaluation harness that proves the fine-tune actually moved the number you care about. The model we hand back follows your instructions by construction, not by persuasion.

Banao— we fine-tuned model behaviour on our own hiring and demand-gen workflows before we recommended it to any client.

What a Banao fine-tuning engagement includes

A fine-tuned model is a trained artefact, not a configured product. The work is the data, the training discipline, and the evaluation that proves the model changed in the direction you needed.

Task scoping and training objective design

We pin the fine-tune to a single, measurable task with a clear definition of done — because a model trained on an unclear objective just learns to be confident in the wrong direction.

Training data curation and quality filtering

We audit, clean, and structure your labelled examples, remove contradictions, and balance the set — because a training corpus with 10% bad data produces a model that is wrong 10% of the time, in ways that are hard to trace.

Supervised fine-tuning (SFT)

Full fine-tuning and instruction fine-tuning on open-weight models, using your examples to teach the model the output shape, tone, and behaviour your product needs — not a generic approximation of it.

Parameter-efficient fine-tuning (LoRA / QLoRA)

Low-rank adaptation that updates a fraction of model weights at a fraction of the compute cost — practical for large models where full fine-tuning would require infrastructure you don't need for any other purpose.

Domain terminology and vocabulary adaptation

Fine-tuning the model's vocabulary and phrasing on your domain's real language — so medical, legal, financial, or industrial terminology comes out the way your reviewers expect, not the way a general model approximates it.

Output format and schema enforcement

Teaching the model to always return the structure your application expects — JSON with specific fields, a consistent report format, or a classification from a fixed set — so downstream code doesn't have to defensively parse every response.

Preference alignment (RLHF / DPO)

When SFT alone doesn't shape the model's judgment precisely enough — such as ranking responses by quality, avoiding a specific class of error, or producing outputs your reviewers consistently prefer — we apply preference optimisation on your own feedback data.

Evaluation harness and regression testing

A held-out test set built from your real cases, scored before and after training, run on every change — so you know whether the fine-tune improved the metric you care about or quietly broke something it used to handle.

When fine-tuning earns its cost — and when it does not

Most requests we get for fine-tuning turn out to be retrieval problems in disguise. If the answer depends on a document the model wasn't trained on, no fine-tune fixes that — you need RAG. Fine-tuning teaches a model a behaviour, a format, a skill; it does not give it facts it has never seen. Reaching for a training run to fix a knowledge gap is the most common and expensive mistake in this space.

Where fine-tuning genuinely earns its keep is narrower: consistent output format the model otherwise violates, domain tone or vocabulary that prompt engineering can't hold reliably, a classification task where the base model gets the edge cases wrong. We will tell you on the first call which bucket your problem falls into — and we will not run a training job on a problem that prompt engineering or retrieval would solve for less.

Fine-tuning vs prompting

A well-written system prompt changes a model's default behaviour without training. Fine-tuning changes it by construction. The test: if the behaviour you need breaks when the model is under load or receives an off-script input, prompting is not holding it and fine-tuning is the right next step.

Fine-tuning vs retrieval

RAG is for knowledge — facts from documents the model wasn't trained on. Fine-tuning is for behaviour — format, tone, judgment. Most problems that look like the model doesn't know your domain are retrieval problems; most problems that look like the model can't follow your output rules are fine-tuning problems.

Fine-tuning vs a larger model

A fine-tuned small model frequently beats a general large model on a narrow task, at a fraction of the inference cost. We scope the task tightly before choosing a base model, so you are not paying for capability you will never use.

The data bar is the real bar

A fine-tune can only learn what the training data actually shows. If your examples contradict each other, include edge cases without labels, or cover only the easy portion of the distribution, the model will learn those gaps faithfully. Data quality is not a pre-step — it is the deliverable.

What the fine-tuning process looks like in practice

Fine-tuning is often described as a training run. In practice, the training run is the short part. The work before it — mapping the task, sourcing and cleaning examples, defining what a correct output actually is, and building the evaluation set — takes longer than the GPU hours and decides whether the resulting model is useful.

We treat data curation and evaluation design as the critical path, not the setup. A training run that starts with a clean, well-labelled set and ends with a scored evaluation is repeatable and improvable. One that starts with a bulk export and ends with a vibe check is not.

Map the task before labelling

We work with your domain experts to define what a good output looks like — including the ambiguous cases where a labeller could go either way — before any data is processed. Disagreement between labellers on a portion of examples is normal; leaving that disagreement unlabelled teaches the model the wrong thing.

Hold out the hard cases

The evaluation set is built before training begins, from the cases the model is most likely to get wrong — not a random slice of the training distribution. A model that scores well on easy cases and fails the hard ones is not a model you can trust in production.

Measure the delta, not the score

The only number that matters is how much the fine-tune moved the metric relative to the base model on the same eval set. An absolute accuracy figure without a baseline tells you nothing — the base model may have already scored similarly.

Re-run evals on every change

A fine-tuned model that gets an update — new training data, a prompt change, a base model version bump — must re-run the full evaluation set before it ships. We build the eval harness as a deliverable so your team can run it without re-engaging us.

Fine-tuned models already in production

Metrics shown dotted (··) are being finalised in our case-study metrics pack — published only once verified. The deployments are live.

Legal-tech platform (anonymized)

A contract analysis model fine-tuned on clause-level labels

  • ··%clause classification accuracy on held-out test set
  • ··×faster than in-house manual review

We fine-tuned an open-weight model on labelled contract clauses — risk ratings, clause type, and required redlines — so the platform could surface the same judgment a senior reviewer would reach, on every document, in seconds. The evaluation harness ran on the same held-out clause set before and after training so the improvement was a measured number, not a demo impression.

Healthcare communications firm (anonymized)

Clinical-tone fine-tune for patient-facing content generation

  • ··%outputs accepted by clinical reviewers without edits
  • ··minmedian review time per document

The base model produced fluent content that violated clinical tone guidelines on roughly one in three outputs. We fine-tuned on approved examples curated with the clinical team, then scored the model on a blind review set before it reached any patient-facing content — measuring reviewer acceptance rate rather than a perplexity score.

We fine-tune on our own operations before we recommend it

Banao runs a ~300-person engineering company, and the models that screen our applicants and generate our outreach are models we fine-tuned or adapted on our own labelled data. InterviewGod evaluates candidates against role-specific criteria we defined and labelled ourselves; Vikaas generates demand-gen content in the Banao voice, held to a tone standard we built from real approved examples.

That means every fine-tuning decision we make for a client — how much data is enough, where SFT ends and preference alignment begins, how to design an evaluation set that will catch the failures that matter — is a decision we have already made for ourselves and had to live with.

  • InterviewGodEvaluates Banao's own applicants against fine-tuned, role-specific criteria.
  • VikaasGenerates Banao's own demand-gen content in a fine-tuned, consistent voice.

When fine-tuning is the wrong call

Fine-tuning is the right answer less often than the hype implies, and we will say so before you allocate a budget to a training run:

  • The problem is a knowledge gap: if the model answers incorrectly because it wasn't trained on your documents, a fine-tune won't help — you need retrieval over those documents, not a training run.
  • You don't have labelled data: fine-tuning on unlabelled text or on examples you haven't reviewed makes the model more confident, not more correct. Labelling is a prerequisite, not a shortcut.
  • A prompt change would fix it: if the behaviour you need is achievable with a clear system prompt and doesn't break under load, the training cost is not worth the marginal improvement.
  • The volume doesn't exist: if the task is rare enough that you can't build a meaningful held-out test set, you can't evaluate the fine-tune — and an unevaluated fine-tune is one you can't trust.
  • The base model just changed: if your provider updated the model you're running on, retraining is often faster than debugging why the old fine-tune no longer holds on the new base.

How we start — define the task before the training run

A fine-tune that starts without a clear task definition and evaluation set is a training run that produces an opaque, hard-to-improve artefact. We fix the task first.

  1. AI Discovery Sprint2 weeks · fixed price

    We scope the fine-tuning task, audit your candidate training data, build a small evaluation set, and run a baseline against the unmodified model — so you know the accuracy gap the fine-tune needs to close before you pay for the training run. Sprint cost is credited against the build if you proceed.

  2. Build

    Data curation, training run, evaluation harness, and a handover package — the trained model, the eval suite, and the documentation your team needs to extend it without re-hiring us.

  3. Production & ongoing improvement

    Deployment behind your own infrastructure, monitoring on the metrics we defined in the Sprint, and a process for incorporating new labelled examples so the model improves on real failures rather than drifting.

Frequently asked questions

Fine-tuning updates a pre-trained model's weights on a curated set of labelled examples from your specific task. It changes the model's behaviour — its output format, tone, domain vocabulary, or a classification skill — by construction, rather than by instruction at inference time. The result is a model that applies the learned behaviour consistently, including on inputs a prompt might not handle well.

RAG (retrieval-augmented generation) gives the model facts it doesn't have by retrieving your documents at answer time. Fine-tuning changes how the model behaves — its format, tone, or a domain skill. If the answer depends on a document the model hasn't seen, use RAG. If the model gets the output wrong even when it has the right information, fine-tuning is the right step. Most mature systems use both.

It depends on the task complexity, the base model, and the technique. Parameter-efficient methods like LoRA can show measurable improvement with a few hundred clean examples on a narrow task; a full fine-tune for complex behaviour may need tens of thousands. The quality filter matters more than the count: 500 carefully reviewed examples consistently outperform 5,000 that include contradictions. We audit your candidate data and give you a realistic estimate before any training run.

LoRA (low-rank adaptation) fine-tunes a small set of added weight matrices rather than the full model, achieving comparable results at a fraction of the compute cost. For most enterprise tasks it is the practical choice: you get a task-specific model without GPU infrastructure sized for a full training run, and the adapter is small enough to swap quickly when the task changes.

Not by itself, and not reliably. Hallucination in fine-tuning often gets worse if training examples reward fluent-sounding outputs rather than accurate ones. The controls that reduce hallucination are: grounding the model's answers in retrieved sources (RAG), building an evaluation set that scores faithfulness, and adding an abstention path so the model says it doesn't know rather than inventing. Fine-tuning can reduce certain error patterns on the narrow task it was trained for, but it is not a hallucination fix.

We build a held-out evaluation set — drawn from real cases, including the edge cases the model is most likely to get wrong — before training begins. The base model is scored on that set; the fine-tuned model is scored on the same set; the delta is the result. An absolute score without a baseline tells you nothing. We hand over the eval harness as a deliverable so your team can re-run it on every future change.

Yes. We work with open-weight models (Llama, Mistral, Phi, Qwen, and others) where we control the full training process and the resulting weights are yours. For API-only providers that expose a fine-tuning endpoint we can manage the training run through that API. We will tell you honestly when an open-weight fine-tune is likely to outperform an API-provider fine-tune for your task — which is often the case for narrow, high-frequency enterprise applications.

A two-week Discovery Sprint covers task scoping, data audit, and baseline evaluation. The build — data curation, training, eval harness, and handover — is typically 4–8 weeks depending on data volume, task complexity, and the technique chosen. Cost drivers are: size of the base model, training method (LoRA vs full), data curation effort, and the breadth of the evaluation set. The Sprint pins all of these before you commit to the build.

Tell us the output your model keeps getting wrong

In 45 minutes we will tell you whether fine-tuning is the right intervention, or whether retrieval, a prompt change, or a different base model would fix it faster. If fine-tuning is the answer, we scope the data requirement and the evaluation plan before you commit.

Book a 45-min scoping call