Technology Apr 18, 2026 · 16 min read

RAG in Practice — Part 5: Build a RAG System in Practice

Part 5 of 8 — RAG Article Series ← Part 4: Chunking, Retrieval, and the Decisions That Break RAG · Part 6 (publishing soon) Why This Article Is Different By now, you already know what a RAG pipeline is. Part 3 gave you the full pipeline. Part 4 showed how chunking and retrieval deci...

DE
DEV Community
by Gursharan Singh
RAG in Practice — Part 5: Build a RAG System in Practice

Part 5 of 8 — RAG Article Series

← Part 4: Chunking, Retrieval, and the Decisions That Break RAG · Part 6 (publishing soon)

Why This Article Is Different

By now, you already know what a RAG pipeline is.

Part 3 gave you the full pipeline. Part 4 showed how chunking and retrieval decisions break that pipeline in practice. This article does something different: it shows what that pipeline does when it meets real documents.

The code is in the repo. You can read it in a few minutes, run it, and even generate your own version with modern tools. What is harder to see — and what this article is for — is what actually happens when a pipeline processes documents with different shapes.

That is the real skill.

A return policy is not a changelog. A numbered troubleshooting guide is not an HTML table. If your documents have different shapes, they stress different parts of the pipeline. Some pass through almost untouched. Some break at chunk boundaries. Some retrieve the wrong thing even when chunking looks reasonable. Some fail before chunking even starts because parsing already lost the structure.

So this article is not organized around functions like load, chunk, embed, and retrieve. It is organized around document categories.

We will walk through four document types from a small TechNova support corpus. For each one, we will look at what kind of document it is, what the pipeline does to it, what works, what breaks, and what decision that teaches for your own documents.

If you want to see the code run first, do that. Then come back here. The rest of this article is designed to make sense of what you saw.

The Corpus and How to Run It

We are still using the same TechNova corpus from earlier parts, but now the important thing is not just that it exists. The important thing is that each file represents a different document shape.

Document category Example file Approx. size What it represents
Short policy-style docs return-policy.md, warranty-terms.md ~249–350 words Short markdown documents with self-contained business rules
Procedural docs troubleshooting-guide.md ~1,089 words Step-by-step support instructions under headings
Versioned updates firmware-changelog.md 3 version entries Near-duplicate release notes that are semantically distinct
Structured content product-specs.html HTML table Product specs stored as structured markup, not prose

The baseline implementation uses Python, the OpenAI embeddings API, and ChromaDB. The full working code is in the companion repo. Run part5_rag.py to see the same behaviors described below.

The baseline is intentionally simple — recursive chunking, vector-only retrieval, no reranking — so that the failure modes stay visible rather than hidden behind optimizations.

RAG Pipeline: The Baseline You Are Running

Watch the output: how many chunks each file creates, what gets retrieved for each question, and where the answers feel solid or strange.

If you have already done that, the rest of this article should feel like retroactive explanation. If you have not, the examples below still show the important parts.

Short Policy-Style Documents

Start with the easiest category.

TechNova's return policy and warranty terms are short, clean markdown files. They have headings, short paragraphs, and business rules that mostly stay together. This is the kind of content many teams start with, and it is also the kind of content that makes naive RAG look better than it really is.

From return-policy.md:

# TechNova Return Policy

TechNova offers a 15-day return window on all products purchased
directly from TechNova or through authorized retailers. The return
period begins on the date of delivery, not the date of purchase.

## Eligibility

To be eligible for a return, the product must be in its original
packaging with all included accessories, cables, and documentation.

From warranty-terms.md — notice the similar shape:

# TechNova Warranty Terms

TechNova products are covered by a limited warranty from the date
of original purchase. This warranty applies to products purchased
from TechNova directly or through authorized retailers.

When the baseline pipeline sees documents like these, very little happens. Even when they are split across multiple chunks, the content stays self-contained. Each chunk is a complete policy rule or section — headings, bullet points, or short paragraphs that already carry their own meaning. Embeddings capture them cleanly. Retrieval is straightforward. Generation usually has enough context to answer correctly.

That is why these documents feel easy.

If a user asks about TechNova's return policy, the retriever surfaces a chunk — or a couple of adjacent chunks — that together contain the full rule. The model does not have to reconstruct a scattered answer from fragments. The document's natural structure did most of the work.

This is the class of document where naive RAG mostly behaves.

The caution is smaller here. If you have several short policy-style documents that overlap in vocabulary and intent, retrieval can still surface adjacent content. But that is a secondary concern, not the main lesson of this section.

The lesson from short policy-style documents is simple: not every document needs aggressive chunking. Sometimes the right design decision is to do less.

Takeaway: For short self-contained documents, chunking barely matters — but duplication across them can still confuse retrieval.

Procedural Troubleshooting Documents

This is where things get more interesting.

The troubleshooting guide is long enough to force multiple chunks, and its meaning depends on order. That makes it a very different shape from a short policy file.

From troubleshooting-guide.md — the Bluetooth reset procedure:

## Bluetooth Connection Issues

If your TechNova headphones will not connect or keep disconnecting
from your device, follow these steps:

1. Open Settings → Bluetooth on your device.
2. Forget "WH-1000" from saved devices.
3. On the WH-1000, hold the power button for 7 seconds until the
   LED flashes blue.
4. Select "WH-1000" when it appears in your device's Bluetooth list.
5. Wait for "Connected" confirmation before playing audio.

If the headphones still disconnect intermittently, check that you
are within 10 meters of the connected device with no major
obstructions.

A troubleshooting guide is not just support text. It is a sequence. Step 1 exists because Step 2 comes after it. Step 4 only makes sense if the reader already completed Step 3.

That is why procedural content stresses chunking differently.

With the baseline pipeline, the file is split into multiple chunks. On paper, that sounds reasonable. The file is too long, so chunk it. But the question is not whether to chunk. The question is whether the chunk boundaries respect the procedure.

If the split happens in the middle of a five-step fix, the reader may retrieve only part of the instructions.

Here is what that looks like concretely:

Chunk 1 ends with:

2. Forget "WH-1000" from saved devices.

Chunk 2 begins with:

3. On the WH-1000, hold the power button for 7 seconds until
   the LED flashes blue.
4. Select "WH-1000" when it appears in your device's Bluetooth list.

Chunks include some overlap from the previous chunk, so in the code's output you will see this new content preceded by a short repeat of earlier text — the boundary that matters for retrieval is where each chunk's new content begins.

If retrieval surfaces only Chunk 1, the user gets steps 1 and 2 — enough to feel like an answer. But step 3 is the actual reset action. Without holding the power button for 7 seconds, the headphones do not enter pairing mode. The user forgets the device, never re-pairs it, and concludes the troubleshooting did not work.

Each chunk carries the source filename as metadata, so the retriever knows which document a chunk came from — but it does not know whether the chunk represents a complete unit within that document.

That is the real danger.

Imagine a question like: "My WH-1000 keeps disconnecting from Bluetooth. What should I do?"

The retriever might bring back a chunk that contains only the first part of the reset procedure and miss the rest. The answer still sounds useful. It still sounds plausible. But it becomes a partial procedure — a half-fix.

That is worse than a clearly wrong answer because it feels complete.

This is the key decision: for procedural content, chunk boundaries matter more than chunk size.

A common instinct is to make chunks bigger. Sometimes that helps a little. But it does not solve the real issue. The real issue is that the splitting strategy is not aware that a procedure is a unit.

If your pipeline treats paragraph boundaries as good-enough structure, but the document's real structure is procedure blocks, you will eventually hand your user half-instructions.

What works here: the retriever can still find the right topic, and the guide is rich enough to answer support questions.

What breaks: procedures can split across chunks, generation can sound correct while returning incomplete steps, and overlap does not fully solve a bad structural split.

What this teaches for your own documents: if your content depends on sequence, your chunking has to respect sequence. Headings, numbered lists, procedure blocks, and task units matter more than arbitrary size ceilings.

Takeaway: For procedural content, chunking has to respect the structure the content depends on — or the pipeline hands your reader half-instructions.

Versioned Changelogs

At first glance, changelogs look simple.

They are short. They are structured. Each version is clearly labeled. Compared to a long troubleshooting guide, they seem much easier.

That appearance is misleading.

From firmware-changelog.md — two adjacent version entries:

## Version 3.2.1 — Released 2026-02-15

Bug fixes and stability improvements.

- Fixed an issue where ANC would occasionally produce a brief
  clicking sound when toggling between High and Low modes.
- Improved Bluetooth reconnection speed after the headphones exit
  sleep mode.

## Version 3.1.0 — Released 2025-11-01

Performance improvements and new features.

- Added Bluetooth multipoint support: the WH-1000 can now maintain
  simultaneous connections with two devices.
- Fixed a Bluetooth stability issue where the headphones would
  disconnect from certain Android 14 devices after exactly 30
  minutes of continuous playback.

This is one of the most dangerous document shapes in RAG because the entries are distinct in meaning but similar on the surface. Each version talks about updates, fixes, firmware, stability, and improvements. The retriever sees strong similarity across entries even when the versions should stay separate.

That makes questions like this tricky: "What changed in the latest firmware update?"

The user wants one thing: the latest version.

But the retriever may surface chunks from multiple versions because they all look relevant in embedding space. They all mention firmware. They all mention changes. They all sound like neighbors.

When retrieval returns two or three similar version entries together, the model has to sift signal from noise — and without reranking or metadata constraints, first-pass vector search is often too generous to be useful here.

Then generation does what generation often does with overlapping evidence: it blends.

Now the answer can quietly combine version 3.0.0, 3.1.0, and 3.2.1 into a single confident response that never existed in the source material.

That is the changelog trap.

What works: a query about a specific version number usually gives the retriever a stronger target, and versioned entries are compact and easy to isolate if chunked correctly.

What breaks: "latest update" is semantically broad, multiple similar version entries become embedding neighbors, and the model receives blended context and produces blended answers.

The important lesson here is not "make the embedding model better." It is: when documents are near-duplicates by design, retrieval needs help understanding the boundaries that matter.

That help can come from chunking each version as its own unit, preserving version numbers explicitly, using exact-match retrieval signals like BM25, and filtering or reranking by version metadata.

The document shape itself is the issue. It looks neat and structured, but its surface similarity hides the boundaries the user actually cares about.

Takeaway: When documents are near-duplicates by design — versions, changelogs, revisions — naive retrieval blends them, and the answer the user gets may be confidently wrong.

Structured HTML and Tables

Now look at a very different failure mode.

The product specs file is not a prose document at all. It is structured content stored as HTML.

That matters immediately.

From product-specs.html — raw HTML as the pipeline receives it:

<table border="1">
  <thead>
    <tr>
      <th>Specification</th>
      <th>WH-1000 Premium Headphones</th>
      <th>WH-500 Sport Headphones</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Weight</td>
      <td>250g</td>
      <td>180g</td>
    </tr>
    <tr>
      <td>Battery Life</td>
      <td>30 hours (ANC off), 20 hours (ANC on)</td>
      <td>8 hours</td>
    </tr>
  </tbody>
</table>

If you read that file as plain text and pass it into a normal chunker, you are already in trouble.

Because a table is not meaningful as a sequence of words. A table works because rows and columns create relationships: this battery life belongs to this product, this weight belongs to that model, this number is only meaningful because of its label.

Semantic search is good at prose similarity — finding text that sounds like the query. But tables are relational structure, not prose. Once you flatten row and column relationships into a text stream, the embedding still captures the words, but it has lost the spreadsheet logic that made those words meaningful.

When you flatten the table into text too early, you lose the structure that makes the values interpretable.

So now the pipeline may retrieve a chunk containing "8 hours," but the model cannot easily tell whether that is battery life, charging time, or some other attribute. The number survived. The meaning did not.

That is not a chunking failure. It is a parsing failure.

And this is one of the most important lessons in the article: the pipeline can lose meaning before embeddings ever happen.

From html_table_to_text.py in the repo — the real fix:

pairs = [f"{headers[i]}: {row[i]}" for i in range(min(len(headers), len(row)))]
text_rows.append(" | ".join(pairs))

This is not interesting because of Python syntax. It is interesting because it expresses the real decision: turn structure into labeled text before chunking.

In practice, you would use an HTML parsing library like BeautifulSoup or lxml rather than parsing raw tags by hand — the important thing is not which tool you use, but that structure is preserved before chunking begins.

Once the table becomes something like Specification: Battery Life | WH-500 Sport Headphones: 8 hours, the rest of the pipeline has a fighting chance. The retriever sees self-contained facts. The generator can answer without guessing which number belongs to which product.

What works after structure-preserving preprocessing: retrieval becomes more precise, values stay attached to labels, and the answer can cite the right attribute.

What breaks without it: chunks contain raw HTML noise, values lose their relationships, and generation is forced to infer structure from flattened markup.

This is the clearest case where the right answer is not "better chunking." It is: teach the parser about the document's real shape.

Takeaway: When your documents have structure — tables, forms, code blocks — the pipeline needs to see that structure. Chunking a table as if it were prose discards the thing that makes the table useful.

Document Shapes: Where the Baseline Holds and Where It Strains

Three Questions, Three Retrievals

Now step back from the documents and look at the three questions the baseline script asks.

The important thing here is that retrieval behavior is downstream. By the time you ask the question, many decisions have already been made: how the file was parsed, how it was chunked, what boundaries were preserved, and what boundaries were lost.

Question 1: "What is TechNova's return policy?" This usually works because the underlying document is short, self-contained, and semantically direct. The upstream decision that helped: the document's natural structure kept each chunk as a complete policy unit.

Question 2: "My WH-1000 keeps disconnecting from Bluetooth. What should I do?" This strains because the quality of the answer depends on whether the troubleshooting procedure stayed intact during chunking. The upstream decision that matters: whether the chunker respected procedure boundaries.

Question 3: "What changed in the latest firmware update?" This strains because version boundaries are not automatically retrieval boundaries. The upstream decision that matters: whether each version was chunked and tagged as a distinct unit.

So the important lesson is not that retrieval succeeded or failed in isolation. The important lesson is which earlier decision made that outcome likely.

Takeaway: Retrieval is a downstream effect. The shape of your retrieval is decided when you decide how to parse and chunk.

Where This Baseline Breaks

At this point, the pattern should be visible.

The baseline pipeline does not fail randomly. It fails at the seams between document shape and pipeline assumptions.

Here are the four boundaries you just saw:

Chunking is structural, not statistical. Procedural content does not fail because your chunk size was a little off. It fails because the pipeline did not respect the structure the procedure depends on.

Similarity is a liability for near-duplicate content. Versioned documents look clean, but retrieval can still blend them because the system sees embedding neighbors, not the distinctions your user cares about.

Parsing is upstream of everything. If structure is lost during parsing, chunking and retrieval inherit that damage. HTML tables do not become trustworthy just because you embedded them.

Generation compounds upstream mistakes. Once retrieval hands generation bad evidence, the model often does not produce a visibly broken answer. It produces a fluent one. That is what makes these failures dangerous.

So what did this baseline actually give you?

Not a production-ready RAG system. Something more useful than that.

It gave you a visible pipeline. It gave you document-level failure modes. It gave you a baseline that can now be improved deliberately.

And that matters, because if you cannot see where the seams are, you cannot improve them.

Takeaway: The pipeline does not fail randomly. It fails at the seams between document shape and pipeline assumptions. Seeing those seams is the work.

What You've Seen

You already had the RAG pipeline in abstract form.

Now you have seen what it does to real documents.

You have seen when short policy-style documents pass through cleanly, when procedures break at chunk boundaries, when near-duplicate changelogs blend at retrieval time, and when structured HTML fails before chunking even starts.

That is the point of Part 5.

The code is in the companion repo. The baseline runs. But the main thing to carry forward is not the implementation. It is judgment.

For a document like this, what will the pipeline do? Where will it stress? What decision does that force?

But there is a bigger question underneath. We could keep optimizing this baseline — smarter chunking, structure-aware parsing, hybrid retrieval, reranking by metadata. Each of those would help. But the harder question is whether RAG was the right tool for every one of these cases in the first place.

That is the question that connects this article back to Part 4 — and forward to Part 6.

Because now that you have seen where RAG works and where it strains, the next question gets bigger: when is RAG the wrong tool entirely?

Next: RAG, Fine-Tuning, or Long Context? (Part 6 of 8)

Found this useful? Follow me on Dev.to for the rest of the series.

DE
Source

This article was originally published by DEV Community and written by Gursharan Singh.

Read original article on DEV Community
Back to Discover

Reading List