Technology Apr 26, 2026 · 5 min read

Part 4: Improving Retrieval Quality with Token-Aware Chunking and HyDE

Making RAG Smarter with Token-Aware Chunking, HyDE, and Context-Aware Search In Part 3, we improved chunking and optimized context. The system was faster and cleaner… but still not always correct. What broke after Part 3? By this point, the system looked solid: Smarte...

DE
DEV Community
by Sharath Kurup
Part 4: Improving Retrieval Quality with Token-Aware Chunking and HyDE

Rag Architecture

Making RAG Smarter with Token-Aware Chunking, HyDE, and Context-Aware Search

In Part 3, we improved chunking and optimized context. The system was faster and cleaner… but still not always correct.

What broke after Part 3?

By this point, the system looked solid:

  • Smarter chunking
  • Context compression
  • FAISS + re-ranking
  • Streaming responses

But when I started using it more realistically, a few problems showed up:

1. Token limits were still hurting quality

Even with better chunks, we were still not controlling how much context we send to the model.

2. Vague queries failed badly

Questions like:

  • “Explain this”
  • “What does it mean?”

…would often retrieve irrelevant chunks.

3. Follow-up questions felt disconnected

The system didn’t “remember” what we were talking about.

At this point, it stopped feeling like a “retrieval problem”
…and more like a context understanding problem.

So in Part 4, I focused on making RAG smarter.

What we’re building in this part

  • Token-aware chunking (based on actual LLM limits)
  • HyDE (Hypothetical Document Embeddings)
  • Early version of context-aware retrieval

Updated Pipeline

Before jumping into code, here’s how the pipeline evolved:

Before (Part 3):

Query → Embedding → FAISS → Re-rank → Context → LLM

Now (Part 4):

Query → (HyDE) → Better Query → Embedding  
     → FAISS → Re-rank → Token-aware Context  
     → LLM (with constraints)

Pipeline

Problem 1: Token limits are real (and we were ignoring them)

Until now, chunking was based on:

  • characters
  • sentences
  • separators

But LLMs don’t think in characters.

They think in tokens.

Why this matters

You might send:

  • 2 chunks → fine
  • 5 chunks → maybe fine
  • 10 chunks → silently truncated or degraded

And the worst part?

You won’t even know it’s happening.

Solution: Token-aware chunking

Instead of guessing chunk sizes, we measure them using a tokenizer.

import tiktoken

tokenizer = tiktoken.get_encoding("cl100k_base")

def get_token_length(text):
    return len(tokenizer.encode(text))

Now chunking becomes token-driven instead of size-driven.

Token-based chunking strategy

Key idea:

  • Build chunks until a token limit
  • If exceeded → split intelligently
  • Maintain overlap using tokens, not characters
MAX_TOKENS = 250
OVERLAP_TOKENS = 50

Smarter chunk building

Instead of blindly splitting:

  • Prefer paragraphs
  • Then sentences
  • Then fallback
def generate_chunks_recursive_tokens(text, page_num):
    paragraphs = text.split("\n\n")
    current_chunk = []
    current_tokens = 0

    for paragraph in paragraphs:
        paragraph_tokens = get_token_length(paragraph)

        if current_tokens + paragraph_tokens > MAX_TOKENS:
            chunks.append({
                "text": "\n\n".join(current_chunk),
                "page": page_num
            })

            current_chunk, current_tokens = _get_overlap(current_chunk)

        current_chunk.append(paragraph)
        current_tokens += paragraph_tokens

Why this works better

  • Matches actual LLM limits
  • Avoids hidden truncation
  • Improves context density
  • Makes responses more reliable

Problem 2: RAG fails on vague queries

This was the bigger issue.

Even with good chunking, queries like:

“Explain this concept”

…don’t contain enough semantic signal.

So FAISS retrieves something… but often not the right thing.

Solution: HyDE (Hypothetical Document Embeddings)

This is one of the most interesting tricks in RAG.

Instead of embedding the raw query…

👉 We first generate a hypothetical answer
👉 Then embed that

Why this works

A vague query becomes a rich semantic representation.

Example:

User query:

Explain this

HyDE generates:

This concept refers to a method where...

Now embedding this gives:

  • More keywords
  • Better semantic alignment
  • Stronger retrieval

Implementation

Step 1: Generate hypothetical answer

def generate_hypothetical_answer(query, chat_history):
    recent_user = [m["content"] for m in chat_history[-4:] if m["role"] == "user"]
    history_text = " | ".join(recent_user[-2:])

    prompt = (
        f"Write a 2-sentence technical summary answering: {query}\n"
        f"Recent user context: {history_text}"
    )

    response = ollama.generate(
        model=HYDE_MODEL,
        prompt=prompt,
        stream=False
    )

    return response['response']

Step 2: Augment the query

hypothetical_answer = generate_hypothetical_answer(query, chat_history)
search_query = f"{query} {hypothetical_answer}"

Step 3: Embed the enriched query

response = ollama.embed(model=EMBED_MODEL, input=search_query)

Impact

  • Better retrieval for vague queries
  • Improved relevance
  • More stable responses

At the cost of:

  • ~1–2 seconds extra latency

Worth it? In most cases — yes.

Early Step: Context-aware retrieval

Another issue we started addressing:

Follow-up questions were treated as completely new queries.

Example:

User: Explain attention mechanism
User: How does it work?

Second query loses context.

What we added

A simple but effective improvement:

  • Detect vague follow-ups
  • Inject previous page context
if not target_page and vague_followups.match(query.strip()):
    last_pages = get_last_referenced_pages(chat_history)

    if last_pages:
        query = f"{query} page {last_pages[0]}"

Result

  • Follow-ups become meaningful
  • Retrieval stays anchored
  • Conversation feels connected

Putting it all together

Now the system:

  • Understands token limits
  • Improves weak queries
  • Handles basic conversation flow

Show Flow

From learning project → real system

This is where things started to change.

Earlier:

  • It worked
  • It demonstrated RAG

Now:

  • It behaves more like a real assistant
  • Handles imperfect queries
  • Works under constraints

What’s next?

We’ve improved:

  • Data representation (chunks)
  • Query understanding (HyDE)

But one big gap still remains:

We still don’t know why RAG fails when it fails.

In Part 5, we’ll go deeper into:

  • Debugging the RAG pipeline
  • Visualizing FAISS vs re-ranking
  • Understanding retrieval quality
  • Making the system more transparent

Code

Full implementation available here:

👉 https://github.com/SharathKurup/chatPDF/blob/token_aware_rag/

Final thoughts

At a high level, RAG seems simple:

Retrieve → augment → generate

But in practice, most of the work is here:

  • How you represent data
  • How you interpret queries
  • How you control context

This part was about tightening those pieces.

And it makes a noticeable difference.

If you’ve been building with RAG, I’d recommend trying:

  • Token-aware chunking
  • HyDE

They’re relatively small changes — but high impact.

Let me know what you think or what you’d improve.

DE
Source

This article was originally published by DEV Community and written by Sharath Kurup.

Read original article on DEV Community
Back to Discover

Reading List