Let’s be honest: our medical history is usually a chaotic mess of scattered PDFs, blurry smartphone photos of prescriptions, and "I think I had a fever in 2019" memories. When you're dealing with long-term health tracking, traditional search fails. You don't just need to find a keyword; you need to understand the relationship between a medication you took three years ago and a lab result from last week.
In this tutorial, we are going to solve this by building a Personal Lifelong EHR Analysis System. We will transform "dirty" unstructured medical reports into a structured Knowledge Graph using Neo4j and leverage GraphRAG (Graph Retrieval-Augmented Generation) to answer complex health queries with 100% traceability. 🚀
The Architecture: From Chaos to Context
To build a robust medical knowledge system, we need more than just a vector database. We need to preserve the relational nature of medical data. Here is how the data flows from a messy PDF to a structured response:
graph TD
A[Unstructured PDF/Images] -->|Unstructured.io| B(Clean Text & Tables)
B -->|LangChain + LLM| C{Entity & Relation Extraction}
C -->|Cypher Queries| D[(Neo4j Graph Database)]
D -->|LlamaIndex GraphStore| E[GraphRAG Engine]
F[User Query: 'How has my fasting blood sugar trended?'] --> E
E -->|Contextual Retrieval| G[LLM Final Answer + Source Citation]
Prerequisites
Before we dive in, make sure you have the following tools in your kit:
- Neo4j: Our graph database (AuraDB is great for a quick start).
- LangChain: For orchestrating the extraction chain.
- Unstructured.io: For parsing those pesky medical PDFs.
- LlamaIndex: To implement the GraphRAG retrieval logic.
Step 1: Parsing the "Dirty" Data
Medical reports are notorious for having complex layouts—tables, multi-column text, and signatures. We'll use unstructured to handle the heavy lifting.
from unstructured.partition.pdf import partition_pdf
# Extract elements from a medical report PDF
elements = partition_pdf(
filename="report_2023_checkup.pdf",
infer_table_structure=True,
strategy="hi_res",
)
# Filter for tables and text
raw_text = "\n".join([str(el) for el in elements])
print(f"Successfully extracted {len(raw_text)} characters from the report.")
Step 2: Defining the Medical Schema
A Knowledge Graph is only as good as its schema. For EHR, we want to capture entities like Patient, Condition, Medication, and LabResult.
Using LangChain and an LLM (like GPT-4o), we can extract these nodes and their relationships (e.g., PATIENT -> DIAGNOSED_WITH -> CONDITION).
from langchain_community.graphs import Neo4jGraph
from langchain_experimental.graph_transformers import LLMGraphTransformer
from langchain_openai import ChatOpenAI
# Initialize the Graph Transformer
llm = ChatOpenAI(temperature=0, model="gpt-4o")
transformer = LLMGraphTransformer(
llm=llm,
allowed_nodes=["Patient", "Condition", "Medication", "LabResult", "Date"],
allowed_relationships=["HAS_CONDITION", "PRESCRIBED", "RESULTS_IN", "OCCURRED_ON"]
)
# Convert text to graph documents
graph_documents = transformer.convert_to_graph_documents(documents)
# Push to Neo4j
graph = Neo4jGraph()
graph.add_graph_documents(graph_documents)
Step 3: Implementing GraphRAG for Deep Insights
Traditional RAG might find a "Lab Result" chunk, but GraphRAG allows us to traverse the graph. If you ask, "How did my medication change after my 2022 blood work?", the system follows the path: LabResult -> Date -> Condition -> Medication.
We use LlamaIndex to create a query engine over our Neo4j instance.
from llama_index.core import PropertyGraphIndex
from llama_index.graph_stores.neo4j import Neo4jPropertyGraphStore
# Link LlamaIndex to our existing Neo4j DB
graph_store = Neo4jPropertyGraphStore(
username="neo4j",
password="your_password",
url="bolt://localhost:7687"
)
index = PropertyGraphIndex.from_existing(
property_graph_store=graph_store,
llm=llm
)
query_engine = index.as_query_engine(include_text=True)
response = query_engine.query("Analyze the correlation between my Vitamin D levels and energy complaints.")
print(response)
🥑 Pro-Tip: The "Official" Way to Production
Building a prototype is easy, but handling medical data in production requires strict adherence to data privacy and more complex entity resolution (ensuring "Vitamin D" and "Vit D3" are mapped to the same node).
For more advanced patterns in healthcare AI, complex entity linking strategies, and production-ready RAG architectures, I highly recommend checking out the technical deep dives at WellAlly Blog. It's a goldmine for developers looking to move beyond "Hello World" in the medical AI space.
Why this matters?
By moving from a Vector-only approach to GraphRAG, you gain:
- Explainability: You can literally see the nodes and edges that led to an answer.
- Long-term Memory: The graph naturally links a record from 10 years ago to today if they share the same
Conditionnode. - Data Integrity: No more hallucinating lab values—the LLM reads directly from the structured graph properties.
Conclusion
We’ve just turned a pile of messy PDFs into a high-functioning, structured medical brain. Using Neo4j for storage and LlamaIndex for GraphRAG, you can now query your health history like a pro.
Are you building something in the Medical AI space? Let's chat in the comments! And don't forget to star the repo if this helped you. 🌟
This article was originally published by DEV Community and written by wellallyTech.
Read original article on DEV Community