From 100 Lines to a Production Pipeline
In the last article, we built a minimal RAG system in 100 lines of pure Python. It worked. It demonstrated the core idea. But if you tried to productionize that code, you'd quickly run into a wall.
Loading a PDF? You need PyPDF2 or pdfplumber, and you'll discover that tables, headers, and footers are a nightmare to parse cleanly.
Splitting text? Your naive text.split("\n\n") will cut sentences in half, break code blocks, or create chunks so large they blow past the token limit.
Swapping the vector database? Say goodbye to your afternoon — every database has a different API, different distance metrics, different metadata handling.
Switching LLM providers? OpenAI's client, Anthropic's client, local models via llama.cpp — each has its own message format, its own token counting, its own error handling.
This is exactly why LangChain exists. It doesn't do anything magical. It doesn't replace the underlying models or databases. What it does is simple and valuable: it gives you a uniform interface for plugging components together, so you can focus on the logic of your RAG system instead of the plumbing.
In this article, we'll rebuild our RAG pipeline using LangChain's modern API. By the end, you'll have a complete, runnable project that loads PDFs, splits them intelligently, stores them in ChromaDB, and answers questions using a multi-provider LLM — with about 30 lines of actual pipeline code.
LangChain Version Note: The code in this article is based on
langchain 1.x(current stable). LangChain 1.x performed a breaking reorganization of0.3.x— high-level APIs likecreate_retrieval_chainwere removed. We use LCEL native syntax (|pipe operator) instead, which is functionally equivalent and version-agnostic. Full source code: https://github.com/chendongqi/llm-in-action/tree/main/02-langchain-basic
The Six Moving Parts of a RAG Pipeline
LangChain decomposes RAG into six components. Understanding what each one does — and where the quality risks hide — is the key to debugging RAG systems later.
| Component | Role | The Quality Risk |
|---|---|---|
| Document Loader | Reads raw files (PDF, Word, Markdown, HTML) and extracts text | Tables, images, and weird formatting get mangled |
| Text Splitter | Cuts long documents into semantically coherent chunks | Chunks too large = low precision; too small = lost context |
| Embedding Model | Converts text chunks into high-dimensional vectors | Poor model choice = semantically unrelated texts cluster together |
| Vector Store | Persists vectors and enables fast similarity search | Wrong distance metric or no metadata filtering = bad retrieval |
| Retriever | Accepts a query, searches the vector store, returns relevant chunks | Top-K too low = missing information; too high = noisy context |
| Chain | Orchestrates the full flow: query → retrieve → prompt → LLM → answer | Prompt design and context assembly determine answer quality |
Think of these six components as an assembly line in a factory. The Document Loader is the raw material intake. The Text Splitter is the precision cutting station. The Embedding Model and Vector Store form the warehouse and inventory system. The Retriever is the picker who fetches the right parts. The Chain is the foreman who coordinates everything and delivers the final product.
If any station on the line is misconfigured, the final product suffers — and the tricky part is that the failure often looks like an LLM problem when it's actually a retrieval problem.
Hands-On: A Complete LangChain RAG Project
Let's build it. We'll create a RAG system that reads a directory of PDF files, indexes them, and lets you ask questions in natural language.
Project Structure
rag-project/
├── requirements.txt
├── data/
│ └── sample.pdf # Your PDF documents go here
└── rag_pipeline.py
Step 0: Dependencies
langchain>=0.3.0
langchain-text-splitters>=0.3.0
langchain-openai>=0.2.0
langchain-chroma>=0.1.0
langchain-community>=0.3.0
pypdf>=4.0.0
python-dotenv>=1.0.0
Full source code (ready to run): https://github.com/chendongqi/llm-in-action/tree/main/02-langchain-basic
Supports Zhipu AI / OpenAI / Ollama for LLM, SiliconFlow and local Ollama for Embedding.
Install them:
pip install -r requirements.txt
Configure your API keys (copy .env.example to .env):
cp .env.example .env
# Edit .env to fill in LLM_API_KEY and EMBEDDING_API_KEY
Supported Providers:
- LLM: Zhipu AI (default), OpenAI, SiliconFlow, Ollama, Azure
- Embedding: SiliconFlow (default), OpenAI, Ollama
Step 1: Load Documents
The PyPDFLoader handles PDF parsing for us. It extracts text page by page and returns a list of Document objects, each containing the page content and metadata (page number, source file, etc.).
from langchain_community.document_loaders import PyPDFLoader
from pathlib import Path
def load_pdfs(data_dir: str = "./data"):
"""Load all PDF files from the data directory."""
documents = []
pdf_paths = list(Path(data_dir).glob("*.pdf"))
for pdf_path in pdf_paths:
loader = PyPDFLoader(str(pdf_path))
pages = loader.load()
documents.extend(pages)
print(f"Loaded '{pdf_path.name}': {len(pages)} pages")
print(f"Total documents loaded: {len(documents)}")
return documents
Real-world note: PDFs are the wild west of document formats. If your PDFs contain scanned images, you'll need OCR (via
pdfplumberwith Tesseract or Azure Document Intelligence). If they contain tables, considerUnstructuredPDFLoaderwhich preserves table structure better than raw text extraction.
Step 2: Split Documents into Chunks
Remember the chunking problem from Part 1? LangChain's RecursiveCharacterTextSplitter is the industry default for good reason. It tries to split on natural boundaries — paragraphs first, then newlines, then sentences, then words — so it avoids cutting mid-sentence whenever possible.
from langchain_text_splitters import RecursiveCharacterTextSplitter
def split_documents(documents, chunk_size=200, chunk_overlap=30):
"""
Split documents into overlapping chunks.
chunk_overlap ensures context continuity between adjacent chunks.
"""
splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
length_function=len,
separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_documents(documents)
print(f"Split into {len(chunks)} chunks (chunk_size={chunk_size}, overlap={chunk_overlap})")
return chunks
Why chunk_overlap matters: If a key concept spans two chunks — say, "The API rate limit is 100 requests per minute. Exceeding this limit returns a 429 status code" — an overlap of 50 characters ensures the second chunk still contains "100 requests per minute" as context. Without overlap, the retriever might fetch only one chunk and miss the causal relationship.
Chunk size trade-offs:
| Chunk Size | Precision | Context | Best For |
|---|---|---|---|
| 256 tokens | High | Minimal | Fact lookup, Q&A over structured docs |
| 512 tokens | Balanced | Moderate | General-purpose RAG (good default) |
| 1024 tokens | Lower | Rich | Long-form summarization, narrative documents |
| 2048+ tokens | Low | Very rich | Only when the LLM context window is large and queries are broad |
For most use cases, 512 tokens with 50-token overlap is a safe starting point.
Step 3: Embed and Store in ChromaDB
Now we convert each chunk into a vector and store them. ChromaDB is our choice here because it's persistent (data survives restarts), supports metadata filtering, and requires zero setup — it runs locally as an embedded database.
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
import os
persist_directory = "./chroma_db"
def build_vector_store(chunks):
"""Create embeddings and store in ChromaDB."""
embeddings = OpenAIEmbeddings(
model="BAAI/bge-large-zh-v1.5", # SiliconFlow Chinese model
api_key=os.getenv("EMBEDDING_API_KEY"),
base_url="https://api.siliconflow.cn/v1",
dimensions=1024,
chunk_size=32 # SiliconFlow limit: max 32 per batch
)
if os.path.exists(persist_directory):
import shutil
shutil.rmtree(persist_directory)
vector_store = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory=persist_directory
)
print(f"Vector store built: {vector_store._collection.count()} vectors persisted")
return vector_store
Embedding model note: We use BAAI/bge-large-zh-v1.5 on SiliconFlow (excellent for Chinese). Access via the OpenAI-compatible interface in langchain_openai. Switch to text-embedding-3-small for OpenAI. The chunk_size=32 parameter is critical — it's SiliconFlow's batch limit (max 32 per request), while most other providers default to 1000.
Step 4: Build the Retriever
The retriever is a thin wrapper around the vector store that handles the search logic. By default, it performs similarity search and returns the top-K most relevant chunks.
def get_retriever(vector_store, search_k=4):
"""Configure retriever with similarity search."""
retriever = vector_store.as_retriever(
search_type="similarity",
search_kwargs={"k": search_k}
)
return retriever
Step 5: Build the RAG Chain
Here's where LangChain's modern API shines. Instead of the older RetrievalQA class, we use LCEL (LangChain Expression Language) to compose a chain that is explicit, debuggable, and easy to modify.
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
def build_rag_chain(retriever):
"""
Build the full RAG chain using LCEL (langchain 1.x compatible).
retrieve → format_docs → assemble prompt → LLM → StrOutputParser
"""
llm = ChatOpenAI(
model="glm-4-flash", # Zhipu AI, via SiliconFlow or direct
api_key=os.getenv("LLM_API_KEY"),
base_url="https://open.bigmodel.cn/api/paas/v4",
temperature=0
)
# System prompt. {context} filled by format_docs, {question} by the user's input.
system_prompt = (
"You are a precise knowledge assistant. Answer the user's question "
"based solely on the provided reference content below. "
"If the reference content does not contain the answer, say so clearly "
"— do not make anything up.\n\n"
"Reference content:\n{context}"
)
prompt = ChatPromptTemplate.from_messages([
("system", system_prompt),
("human", "{question}")
])
# Helper: convert Document list to a single string for {context}
def format_docs(docs: list) -> str:
return "\n\n".join(doc.page_content for doc in docs)
# LCEL Chain: compose components with the pipe operator
# 1. {"context": retriever | format_docs, "question": RunnablePassthrough()}
# → retriever fetches docs, format_docs converts them to a string
# 2. | prompt → assembles into a full prompt
# 3. | llm → generates the answer
# 4. | StrOutputParser() → returns plain text (not an AIMessage object)
rag_chain = (
{
"context": retriever | format_docs,
"question": RunnablePassthrough()
}
| prompt
| llm
| StrOutputParser()
)
return rag_chain
What's happening here? We're using LCEL (LangChain Expression Language) native syntax — the pipe | operator — instead of the high-level create_retrieval_chain (which was removed in langchain 1.x). The key line is retriever | format_docs: the retriever outputs a list of Document objects, format_docs converts them to a string that fills the {context} placeholder. RunnablePassthrough() passes the user's raw question through to the {question} placeholder. Three lines of declarative code that are functionally equivalent to the 50 lines of imperative Python from Part 1.
Step 6: Query the Pipeline
def query(rag_chain, question: str, retriever):
"""Run a question through the RAG pipeline, print answer and sources."""
print(f"\nQuestion: {question}")
answer = rag_chain.invoke(question) # LCEL chain returns plain text directly
print(f"\nAnswer:\n{answer}")
# Retrieve docs separately to show sources (rag_chain doesn't expose them)
docs = retriever.invoke(question)
print("\nRetrieved sources:")
for i, doc in enumerate(docs, 1):
source = doc.metadata.get("source", "unknown")
page = doc.metadata.get("page", "?")
preview = doc.page_content[:120].replace("\n", " ")
print(f" [{i}] {source} (page {page}): {preview}...")
return answer
Putting It All Together
if __name__ == "__main__":
# 1. Load
docs = load_pdfs("./data")
# 2. Split (smaller chunk_size for short PDF pages)
chunks = split_documents(docs, chunk_size=200, chunk_overlap=30)
# 3. Embed & Store
vector_store = build_vector_store(chunks)
# 4. Retrieve
retriever = get_retriever(vector_store, search_k=4)
# 5. Build Chain (LCEL)
rag_chain = build_rag_chain(retriever)
# 6. Interactive Q&A
while True:
user_input = input("\nYour question (quit to exit): ").strip()
if user_input.lower() in ("quit", "exit", "q"):
break
if user_input:
query(rag_chain, user_input, retriever)
Running the Pipeline
Place a PDF in data/sample.pdf and run:
python rag_pipeline.py
Sample output:
RAG Pipeline 启动
LLM Provider : zhipu
LLM Model : glm-4-flash
Embedding : openai / BAAI/bge-large-zh-v1.5
数据目录 : ./data
向量库 : ./chroma_db
==================================================
已加载 'Automotive-SPICE-PAM-v40.pdf':153 页
共加载 153 个文档片段
切分为 2000 个块(chunk_size=200,overlap=30)
[Embedding] Provider: openai | Model: BAAI/bge-large-zh-v1.5 | Base: https://api.siliconflow.cn/v1
已清除旧向量库:./chroma_db
向量库构建完成:2000 个向量已持久化
==================================================
RAG Pipeline 构建完成!输入问题开始问答(输入 'quit' 退出)
==================================================
你的问题:What is Automotive SPICE?
==================================================
问题:What is Automotive SPICE?
==================================================
答案:
Automotive SPICE(Automotive Software Process Improvement and
Capability Determination)是一种用于评估和改进汽车软件
开发过程能力的框架。它定义了软件开发生命周期中的关键
过程域,并建立了过程能力的等级评估标准...
检索到的来源:
[1] ./data/Automotive-SPICE-PAM-v40.pdf(第 5 页):Automotive
SPICE Process Assessment Model The Process Assessment Model
(PAM) defines the processes...
Notice how the answer includes citations — we know exactly which pages the information came from. This traceability is critical for production RAG systems where users need to verify claims.
What Changed From Our 100-Line Version?
Let's compare the hand-written RAG from Part 1 with our LangChain pipeline:
| Aspect | Hand-written (Part 1) | LangChain (Part 2) |
|---|---|---|
| PDF loading | Not supported | One-line PyPDFLoader
|
| Text splitting | None (whole docs) |
RecursiveCharacterTextSplitter with smart boundaries |
| Vector persistence | In-memory only, lost on restart | ChromaDB persists to disk |
| Embedding model swap | Rewrite API calls | One-line parameter change |
| LLM swap | Rewrite client code | One-line parameter change |
| Vector DB swap | Rewrite storage + search | Swap Chroma for Qdrant / Pinecone
|
| Prompt engineering | Raw string formatting | Templated prompts with ChatPromptTemplate
|
| Source citations | Manual tracking | Automatic metadata propagation |
| Chain construction | Manual retrieve + generate
|
LCEL ` |
| Lines of pipeline code | ~80 | ~25 |
The abstraction doesn't hide complexity — it isolates it.
LangChain Version Compatibility: The code in this article targets {% raw %}
langchain 1.x(current stable). LangChain 1.x performed a breaking reorganization of0.3.x—create_retrieval_chainandcreate_stuff_documents_chainwere removed. We use LCEL native syntax (|pipe operator) instead, which is functionally equivalent and version-agnostic. When you need to debug retrieval quality, you know exactly which component to tune. When you need to swap the embedding model for a cheaper alternative, you change one line. When your data grows beyond ChromaDB's capabilities, you switch to Qdrant without touching the rest of the pipeline.
Common Pitfalls at Each Stage
Loader Pitfall: "My PDF has tables and they come out garbled"
Raw PDF text extraction flattens tables into a stream of numbers. For table-heavy documents, use UnstructuredPDFLoader or AzureAIDocumentIntelligenceLoader which preserves structural relationships.
Splitter Pitfall: "The answer is split across two chunks and the model only sees half"
Increase chunk_overlap to 100-150 tokens, or reduce chunk_size so that key concepts fit within a single chunk. Better yet, use Parent-Document Retrieval (covered in a later article) which retrieves small chunks but returns the full parent document for context.
Embedding Pitfall: "Questions and documents don't match even though they should"
This is the "asymmetric retrieval" problem. A user asks "How do I reset my password?" but the document says "To reset your password, navigate to Settings → Security." The question and the answer embed to different vectors because their surface text differs. Solutions: use a model fine-tuned for Q&A retrieval (like BGE-M3), or generate hypothetical answers for retrieval (HyDE — also covered later).
Retriever Pitfall: "Top-K=4 isn't enough for complex questions"
If a question requires synthesizing information from five different sections of a document, k=4 will miss one. But increasing k blindly adds noise. A better approach: use Multi-Query Retrieval (generate 3 variants of the question, retrieve for each, deduplicate) or Reranking (retrieve 20, then use a cross-encoder to pick the best 5).
Chain Pitfall: "The model ignores the context and hallucinates"
Your prompt matters. The system prompt must explicitly instruct the model to use only the provided context. Adding "If the reference content does not contain the answer, say so clearly — do not make anything up" dramatically improves faithfulness. We'll measure this quantitatively with RAGAS in the evaluation articles.
Summary
In this article, we took the raw RAG concept from Part 1 and wrapped it in a production-ready framework. Here's what we covered:
- The six components of a LangChain RAG pipeline — Loader, Splitter, Embedding, Vector Store, Retriever, and Chain — and what quality risk hides in each one.
-
A complete, runnable project that loads PDFs, splits them with
RecursiveCharacterTextSplitter, embeds them with OpenAI, stores them in ChromaDB, and answers questions via a LangChain LCEL chain. -
The chunk size trade-off — in real projects, PDF pages may be very short (e.g., 200 chars). A
chunk_size=512default can produce 0 chunks. 200 + 30 overlap is the safe default. - Common pitfalls at each pipeline stage, from mangled PDF tables to asymmetric retrieval mismatches.
The code in this article is a solid foundation. It handles real PDFs, persists data, and gives you source citations. But it's still a naive RAG pipeline — one query, one retrieval pass, one answer. In the next articles, we'll add the components that separate toy demos from production systems: hybrid search, reranking, query optimization, and evaluation frameworks.
References
- LangChain RAG Tutorial — Official LangChain RAG quickstart
- LangChain Expression Language (LCEL) — Why and how to use LCEL for composable chains
- ChromaDB Documentation — Vector store setup, persistence, and querying
This article was originally published by DEV Community and written by WonderLab.
Read original article on DEV Community