Build a RAG Chatbot Without Pinecone: pgvector + Next.js in Under 100 Lines

Everyone reaches for Pinecone or Weaviate when they need a vector database. For most production workloads at under a million documents, you already have everything you need: Postgres + pgvector.

Here's the full RAG setup in Next.js with zero new infrastructure.

What pgvector Gives You

pgvector adds vector storage and similarity search to Postgres. After a one-line extension install, you can:

Store embeddings as vector columns
Run approximate nearest-neighbor search with <-> (L2), <#> (inner product), or <=> (cosine) operators
Use standard Postgres indexes (IVFFlat, HNSW) for performance at scale

No new service, no new bill, no data leaving your existing infrastructure.

Setup

Enable pgvector on Supabase (or any Postgres):

create extension if not exists vector;

Create the documents table:

create table documents (
  id uuid primary key default gen_random_uuid(),
  content text not null,
  embedding vector(1536),
  metadata jsonb default '{}',
  created_at timestamptz default now()
);

-- HNSW index for fast approximate search
create index on documents using hnsw (embedding vector_cosine_ops)
  with (m = 16, ef_construction = 64);

Ingestion Pipeline

// lib/rag/ingest.ts
import Anthropic from '@anthropic-ai/sdk'
import { db } from '@/db'
import { documents } from '@/db/schema'

async function embed(text: string): Promise<number[]> {
  const response = await fetch('https://api.voyageai.com/v1/embeddings', {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${process.env.VOYAGE_API_KEY}`,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({ input: text, model: 'voyage-3' }),
  })
  const data = await response.json()
  return data.data[0].embedding
}

export async function ingestDocument(content: string, metadata = {}) {
  const embedding = await embed(content)
  await db.insert(documents).values({ content, embedding, metadata })
}

export async function ingestBatch(docs: { content: string; metadata?: object }[]) {
  for (const doc of docs) {
    await ingestDocument(doc.content, doc.metadata)
    await new Promise(r => setTimeout(r, 100))
  }
}

Retrieval

// lib/rag/retrieve.ts
import { sql } from 'drizzle-orm'
import { db } from '@/db'

export async function retrieveContext(
  query: string,
  limit = 5,
  threshold = 0.7
): Promise<string[]> {
  const queryEmbedding = await embed(query)
  const embeddingStr = `[${queryEmbedding.join(',')}]`

  const results = await db.execute(sql`
    select content, 1 - (embedding <=> ${embeddingStr}::vector) as similarity
    from documents
    where 1 - (embedding <=> ${embeddingStr}::vector) > ${threshold}
    order by embedding <=> ${embeddingStr}::vector
    limit ${limit}
  `)

  return results.rows.map((r: any) => r.content)
}

The RAG API Route

// app/api/chat/route.ts
import Anthropic from '@anthropic-ai/sdk'
import { retrieveContext } from '@/lib/rag/retrieve'

const client = new Anthropic()

export async function POST(req: Request) {
  const { messages } = await req.json()
  const lastMessage = messages[messages.length - 1]

  const context = await retrieveContext(lastMessage.content)

  const systemPrompt = context.length > 0
    ? `You are a helpful assistant. Use the following context to answer questions accurately.\n\n<context>\n${context.join('\n\n---\n\n')}\n</context>\n\nIf the context doesn't contain the answer, say so clearly.`
    : 'You are a helpful assistant.'

  const stream = client.messages.stream({
    model: 'claude-sonnet-4-6',
    max_tokens: 1024,
    system: systemPrompt,
    messages,
  })

  return new Response(
    new ReadableStream({
      async start(controller) {
        for await (const chunk of stream) {
          if (chunk.type === 'content_block_delta' && chunk.delta.type === 'text_delta') {
            controller.enqueue(new TextEncoder().encode(chunk.delta.text))
          }
        }
        controller.close()
      },
    }),
    { headers: { 'Content-Type': 'text/plain; charset=utf-8' } }
  )
}

Performance at Scale

pgvector's HNSW index maintains sub-10ms retrieval up to ~1M vectors on standard Postgres hardware.

Metric	pgvector (HNSW)	Pinecone
Query latency (10K docs)	2-5ms	20-50ms
Query latency (1M docs)	5-15ms	15-40ms
Setup complexity	Low (extension)	Medium (new service)
Cost (10K docs/day)	~$0 (existing DB)	~$70/mo
Exact search available	Yes	No (ANN only)
SQL joins on results	Yes	No

The SQL joins point is underrated. You can filter by metadata, join to users, restrict by permissions — all in one query. Pinecone requires a round-trip to Postgres after retrieval to do any of that.

When to Move to a Dedicated Vector DB

Switch when:

You're above 5M vectors and seeing latency >50ms
You need multi-modal embeddings (image + text in same index)
Your embedding dimensions exceed 2000 (pgvector limit)
You need real-time index updates at >1K writes/second

For 95% of SaaS products, you won't hit any of these before you're well-funded enough to care.

What I'm Building This Into

The AI SaaS Starter Kit includes a pre-built RAG module: pgvector schema, Drizzle queries, ingest pipeline, and the streaming chat route above.

If you want the full production-ready stack without starting from scratch:

AI SaaS Starter Kit — $99 one-time — includes RAG module, Claude API integration, auth, Stripe billing, and deployment config
Workflow Automator MCP — $15/mo — gives Claude access to your document store directly from the IDE

→ whoffagents.com