RAG Series (2): Building Your First RAG Pipeline with LangChain

From 100 Lines to a Production Pipeline

In the last article, we built a minimal RAG system in 100 lines of pure Python. It worked. It demonstrated the core idea. But if you tried to productionize that code, you'd quickly run into a wall.

Loading a PDF? You need PyPDF2 or pdfplumber, and you'll discover that tables, headers, and footers are a nightmare to parse cleanly.

Splitting text? Your naive text.split("\n\n") will cut sentences in half, break code blocks, or create chunks so large they blow past the token limit.

Swapping the vector database? Say goodbye to your afternoon — every database has a different API, different distance metrics, different metadata handling.

Switching LLM providers? OpenAI's client, Anthropic's client, local models via llama.cpp — each has its own message format, its own token counting, its own error handling.

This is exactly why LangChain exists. It doesn't do anything magical. It doesn't replace the underlying models or databases. What it does is simple and valuable: it gives you a uniform interface for plugging components together, so you can focus on the logic of your RAG system instead of the plumbing.

In this article, we'll rebuild our RAG pipeline using LangChain's modern API. By the end, you'll have a complete, runnable project that loads PDFs, splits them intelligently, stores them in ChromaDB, and answers questions using a multi-provider LLM — with about 30 lines of actual pipeline code.

LangChain Version Note: The code in this article is based on langchain 1.x (current stable). LangChain 1.x performed a breaking reorganization of 0.3.x — high-level APIs like create_retrieval_chain were removed. We use LCEL native syntax (| pipe operator) instead, which is functionally equivalent and version-agnostic. Full source code: https://github.com/chendongqi/llm-in-action/tree/main/02-langchain-basic

The Six Moving Parts of a RAG Pipeline

LangChain decomposes RAG into six components. Understanding what each one does — and where the quality risks hide — is the key to debugging RAG systems later.

Component	Role	The Quality Risk
Document Loader	Reads raw files (PDF, Word, Markdown, HTML) and extracts text	Tables, images, and weird formatting get mangled
Text Splitter	Cuts long documents into semantically coherent chunks	Chunks too large = low precision; too small = lost context
Embedding Model	Converts text chunks into high-dimensional vectors	Poor model choice = semantically unrelated texts cluster together
Vector Store	Persists vectors and enables fast similarity search	Wrong distance metric or no metadata filtering = bad retrieval
Retriever	Accepts a query, searches the vector store, returns relevant chunks	Top-K too low = missing information; too high = noisy context
Chain	Orchestrates the full flow: query → retrieve → prompt → LLM → answer	Prompt design and context assembly determine answer quality

Think of these six components as an assembly line in a factory. The Document Loader is the raw material intake. The Text Splitter is the precision cutting station. The Embedding Model and Vector Store form the warehouse and inventory system. The Retriever is the picker who fetches the right parts. The Chain is the foreman who coordinates everything and delivers the final product.

If any station on the line is misconfigured, the final product suffers — and the tricky part is that the failure often looks like an LLM problem when it's actually a retrieval problem.

Hands-On: A Complete LangChain RAG Project

Let's build it. We'll create a RAG system that reads a directory of PDF files, indexes them, and lets you ask questions in natural language.

Project Structure

rag-project/
├── requirements.txt
├── data/
│   └── sample.pdf          # Your PDF documents go here
└── rag_pipeline.py

Step 0: Dependencies

langchain>=0.3.0
langchain-text-splitters>=0.3.0
langchain-openai>=0.2.0
langchain-chroma>=0.1.0
langchain-community>=0.3.0
pypdf>=4.0.0
python-dotenv>=1.0.0

Full source code (ready to run): https://github.com/chendongqi/llm-in-action/tree/main/02-langchain-basic
Supports Zhipu AI / OpenAI / Ollama for LLM, SiliconFlow and local Ollama for Embedding.

Install them:

pip install -r requirements.txt

Configure your API keys (copy .env.example to .env):

cp .env.example .env
# Edit .env to fill in LLM_API_KEY and EMBEDDING_API_KEY

Supported Providers:

LLM: Zhipu AI (default), OpenAI, SiliconFlow, Ollama, Azure
Embedding: SiliconFlow (default), OpenAI, Ollama

Step 1: Load Documents

The PyPDFLoader handles PDF parsing for us. It extracts text page by page and returns a list of Document objects, each containing the page content and metadata (page number, source file, etc.).

from langchain_community.document_loaders import PyPDFLoader
from pathlib import Path

def load_pdfs(data_dir: str = "./data"):
    """Load all PDF files from the data directory."""
    documents = []
    pdf_paths = list(Path(data_dir).glob("*.pdf"))

    for pdf_path in pdf_paths:
        loader = PyPDFLoader(str(pdf_path))
        pages = loader.load()
        documents.extend(pages)
        print(f"Loaded '{pdf_path.name}': {len(pages)} pages")

    print(f"Total documents loaded: {len(documents)}")
    return documents

Real-world note: PDFs are the wild west of document formats. If your PDFs contain scanned images, you'll need OCR (via pdfplumber with Tesseract or Azure Document Intelligence). If they contain tables, consider UnstructuredPDFLoader which preserves table structure better than raw text extraction.

Step 2: Split Documents into Chunks

Remember the chunking problem from Part 1? LangChain's RecursiveCharacterTextSplitter is the industry default for good reason. It tries to split on natural boundaries — paragraphs first, then newlines, then sentences, then words — so it avoids cutting mid-sentence whenever possible.

from langchain_text_splitters import RecursiveCharacterTextSplitter

def split_documents(documents, chunk_size=200, chunk_overlap=30):
    """
    Split documents into overlapping chunks.
    chunk_overlap ensures context continuity between adjacent chunks.
    """
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len,
        separators=["\n\n", "\n", ". ", " ", ""]
    )

    chunks = splitter.split_documents(documents)
    print(f"Split into {len(chunks)} chunks (chunk_size={chunk_size}, overlap={chunk_overlap})")
    return chunks

Why chunk_overlap matters: If a key concept spans two chunks — say, "The API rate limit is 100 requests per minute. Exceeding this limit returns a 429 status code" — an overlap of 50 characters ensures the second chunk still contains "100 requests per minute" as context. Without overlap, the retriever might fetch only one chunk and miss the causal relationship.

Chunk size trade-offs:

Chunk Size	Precision	Context	Best For
256 tokens	High	Minimal	Fact lookup, Q&A over structured docs
512 tokens	Balanced	Moderate	General-purpose RAG (good default)
1024 tokens	Lower	Rich	Long-form summarization, narrative documents
2048+ tokens	Low	Very rich	Only when the LLM context window is large and queries are broad

For most use cases, 512 tokens with 50-token overlap is a safe starting point.

Step 3: Embed and Store in ChromaDB

Now we convert each chunk into a vector and store them. ChromaDB is our choice here because it's persistent (data survives restarts), supports metadata filtering, and requires zero setup — it runs locally as an embedded database.

from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
import os

persist_directory = "./chroma_db"

def build_vector_store(chunks):
    """Create embeddings and store in ChromaDB."""
    embeddings = OpenAIEmbeddings(
        model="BAAI/bge-large-zh-v1.5",  # SiliconFlow Chinese model
        api_key=os.getenv("EMBEDDING_API_KEY"),
        base_url="https://api.siliconflow.cn/v1",
        dimensions=1024,
        chunk_size=32  # SiliconFlow limit: max 32 per batch
    )

    if os.path.exists(persist_directory):
        import shutil
        shutil.rmtree(persist_directory)

    vector_store = Chroma.from_documents(
        documents=chunks,
        embedding=embeddings,
        persist_directory=persist_directory
    )
    print(f"Vector store built: {vector_store._collection.count()} vectors persisted")
    return vector_store

Embedding model note: We use BAAI/bge-large-zh-v1.5 on SiliconFlow (excellent for Chinese). Access via the OpenAI-compatible interface in langchain_openai. Switch to text-embedding-3-small for OpenAI. The chunk_size=32 parameter is critical — it's SiliconFlow's batch limit (max 32 per request), while most other providers default to 1000.

Step 4: Build the Retriever

The retriever is a thin wrapper around the vector store that handles the search logic. By default, it performs similarity search and returns the top-K most relevant chunks.

def get_retriever(vector_store, search_k=4):
    """Configure retriever with similarity search."""
    retriever = vector_store.as_retriever(
        search_type="similarity",
        search_kwargs={"k": search_k}
    )
    return retriever

Step 5: Build the RAG Chain

Here's where LangChain's modern API shines. Instead of the older RetrievalQA class, we use LCEL (LangChain Expression Language) to compose a chain that is explicit, debuggable, and easy to modify.

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

def build_rag_chain(retriever):
    """
    Build the full RAG chain using LCEL (langchain 1.x compatible).
    retrieve → format_docs → assemble prompt → LLM → StrOutputParser
    """
    llm = ChatOpenAI(
        model="glm-4-flash",  # Zhipu AI, via SiliconFlow or direct
        api_key=os.getenv("LLM_API_KEY"),
        base_url="https://open.bigmodel.cn/api/paas/v4",
        temperature=0
    )

    # System prompt. {context} filled by format_docs, {question} by the user's input.
    system_prompt = (
        "You are a precise knowledge assistant. Answer the user's question "
        "based solely on the provided reference content below. "
        "If the reference content does not contain the answer, say so clearly "
        "— do not make anything up.\n\n"
        "Reference content:\n{context}"
    )

    prompt = ChatPromptTemplate.from_messages([
        ("system", system_prompt),
        ("human", "{question}")
    ])

    # Helper: convert Document list to a single string for {context}
    def format_docs(docs: list) -> str:
        return "\n\n".join(doc.page_content for doc in docs)

    # LCEL Chain: compose components with the pipe operator
    # 1. {"context": retriever | format_docs, "question": RunnablePassthrough()}
    #    → retriever fetches docs, format_docs converts them to a string
    # 2. | prompt → assembles into a full prompt
    # 3. | llm → generates the answer
    # 4. | StrOutputParser() → returns plain text (not an AIMessage object)
    rag_chain = (
        {
            "context": retriever | format_docs,
            "question": RunnablePassthrough()
        }
        | prompt
        | llm
        | StrOutputParser()
    )

    return rag_chain

What's happening here? We're using LCEL (LangChain Expression Language) native syntax — the pipe | operator — instead of the high-level create_retrieval_chain (which was removed in langchain 1.x). The key line is retriever | format_docs: the retriever outputs a list of Document objects, format_docs converts them to a string that fills the {context} placeholder. RunnablePassthrough() passes the user's raw question through to the {question} placeholder. Three lines of declarative code that are functionally equivalent to the 50 lines of imperative Python from Part 1.

Step 6: Query the Pipeline

def query(rag_chain, question: str, retriever):
    """Run a question through the RAG pipeline, print answer and sources."""
    print(f"\nQuestion: {question}")

    answer = rag_chain.invoke(question)  # LCEL chain returns plain text directly
    print(f"\nAnswer:\n{answer}")

    # Retrieve docs separately to show sources (rag_chain doesn't expose them)
    docs = retriever.invoke(question)
    print("\nRetrieved sources:")
    for i, doc in enumerate(docs, 1):
        source = doc.metadata.get("source", "unknown")
        page = doc.metadata.get("page", "?")
        preview = doc.page_content[:120].replace("\n", " ")
        print(f"  [{i}] {source} (page {page}): {preview}...")

    return answer

Putting It All Together

if __name__ == "__main__":
    # 1. Load
    docs = load_pdfs("./data")

    # 2. Split (smaller chunk_size for short PDF pages)
    chunks = split_documents(docs, chunk_size=200, chunk_overlap=30)

    # 3. Embed & Store
    vector_store = build_vector_store(chunks)

    # 4. Retrieve
    retriever = get_retriever(vector_store, search_k=4)

    # 5. Build Chain (LCEL)
    rag_chain = build_rag_chain(retriever)

    # 6. Interactive Q&A
    while True:
        user_input = input("\nYour question (quit to exit): ").strip()
        if user_input.lower() in ("quit", "exit", "q"):
            break
        if user_input:
            query(rag_chain, user_input, retriever)

Running the Pipeline

Place a PDF in data/sample.pdf and run:

python rag_pipeline.py

Sample output:

RAG Pipeline 启动
  LLM Provider    : zhipu
  LLM Model       : glm-4-flash
  Embedding       : openai / BAAI/bge-large-zh-v1.5
  数据目录        : ./data
  向量库          : ./chroma_db
==================================================
已加载 'Automotive-SPICE-PAM-v40.pdf'：153 页
共加载 153 个文档片段
切分为 2000 个块（chunk_size=200，overlap=30）
[Embedding] Provider: openai | Model: BAAI/bge-large-zh-v1.5 | Base: https://api.siliconflow.cn/v1
已清除旧向量库：./chroma_db
向量库构建完成：2000 个向量已持久化
==================================================
RAG Pipeline 构建完成！输入问题开始问答（输入 'quit' 退出）
==================================================

你的问题：What is Automotive SPICE?

==================================================
问题：What is Automotive SPICE？
==================================================

答案：
Automotive SPICE（Automotive Software Process Improvement and
Capability Determination）是一种用于评估和改进汽车软件
开发过程能力的框架。它定义了软件开发生命周期中的关键
过程域，并建立了过程能力的等级评估标准...

检索到的来源：
  [1] ./data/Automotive-SPICE-PAM-v40.pdf（第 5 页）：Automotive
      SPICE Process Assessment Model The Process Assessment Model
      (PAM) defines the processes...

Notice how the answer includes citations — we know exactly which pages the information came from. This traceability is critical for production RAG systems where users need to verify claims.

What Changed From Our 100-Line Version?

Let's compare the hand-written RAG from Part 1 with our LangChain pipeline:

Aspect	Hand-written (Part 1)	LangChain (Part 2)
PDF loading	Not supported	One-line `PyPDFLoader`
Text splitting	None (whole docs)	`RecursiveCharacterTextSplitter` with smart boundaries
Vector persistence	In-memory only, lost on restart	ChromaDB persists to disk
Embedding model swap	Rewrite API calls	One-line parameter change
LLM swap	Rewrite client code	One-line parameter change
Vector DB swap	Rewrite storage + search	Swap `Chroma` for `Qdrant` / `Pinecone`
Prompt engineering	Raw string formatting	Templated prompts with `ChatPromptTemplate`
Source citations	Manual tracking	Automatic metadata propagation
Chain construction	Manual `retrieve + generate`	LCEL `
Lines of pipeline code	~80	~25

The abstraction doesn't hide complexity — it isolates it.

LangChain Version Compatibility: The code in this article targets {% raw %}langchain 1.x (current stable). LangChain 1.x performed a breaking reorganization of 0.3.x — create_retrieval_chain and create_stuff_documents_chain were removed. We use LCEL native syntax (| pipe operator) instead, which is functionally equivalent and version-agnostic. When you need to debug retrieval quality, you know exactly which component to tune. When you need to swap the embedding model for a cheaper alternative, you change one line. When your data grows beyond ChromaDB's capabilities, you switch to Qdrant without touching the rest of the pipeline.

Common Pitfalls at Each Stage

Loader Pitfall: "My PDF has tables and they come out garbled"

Raw PDF text extraction flattens tables into a stream of numbers. For table-heavy documents, use UnstructuredPDFLoader or AzureAIDocumentIntelligenceLoader which preserves structural relationships.

Splitter Pitfall: "The answer is split across two chunks and the model only sees half"

Increase chunk_overlap to 100-150 tokens, or reduce chunk_size so that key concepts fit within a single chunk. Better yet, use Parent-Document Retrieval (covered in a later article) which retrieves small chunks but returns the full parent document for context.

Embedding Pitfall: "Questions and documents don't match even though they should"

This is the "asymmetric retrieval" problem. A user asks "How do I reset my password?" but the document says "To reset your password, navigate to Settings → Security." The question and the answer embed to different vectors because their surface text differs. Solutions: use a model fine-tuned for Q&A retrieval (like BGE-M3), or generate hypothetical answers for retrieval (HyDE — also covered later).

Retriever Pitfall: "Top-K=4 isn't enough for complex questions"

If a question requires synthesizing information from five different sections of a document, k=4 will miss one. But increasing k blindly adds noise. A better approach: use Multi-Query Retrieval (generate 3 variants of the question, retrieve for each, deduplicate) or Reranking (retrieve 20, then use a cross-encoder to pick the best 5).

Chain Pitfall: "The model ignores the context and hallucinates"

Your prompt matters. The system prompt must explicitly instruct the model to use only the provided context. Adding "If the reference content does not contain the answer, say so clearly — do not make anything up" dramatically improves faithfulness. We'll measure this quantitatively with RAGAS in the evaluation articles.

Summary

In this article, we took the raw RAG concept from Part 1 and wrapped it in a production-ready framework. Here's what we covered:

The six components of a LangChain RAG pipeline — Loader, Splitter, Embedding, Vector Store, Retriever, and Chain — and what quality risk hides in each one.
A complete, runnable project that loads PDFs, splits them with RecursiveCharacterTextSplitter, embeds them with OpenAI, stores them in ChromaDB, and answers questions via a LangChain LCEL chain.
The chunk size trade-off — in real projects, PDF pages may be very short (e.g., 200 chars). A chunk_size=512 default can produce 0 chunks. 200 + 30 overlap is the safe default.
Common pitfalls at each pipeline stage, from mangled PDF tables to asymmetric retrieval mismatches.

The code in this article is a solid foundation. It handles real PDFs, persists data, and gives you source citations. But it's still a naive RAG pipeline — one query, one retrieval pass, one answer. In the next articles, we'll add the components that separate toy demos from production systems: hybrid search, reranking, query optimization, and evaluation frameworks.

References

LangChain RAG Tutorial — Official LangChain RAG quickstart
LangChain Expression Language (LCEL) — Why and how to use LCEL for composable chains
ChromaDB Documentation — Vector store setup, persistence, and querying