We built TripMind: an AI travel planner that generates a full day-by-day itinerary from a destination, budget, and travel style: then immediately scores it with a second AI pass. The whole thing is built with Next.js 15, Supabase, and the Anthropic Claude API.
But the more interesting story is how we built it. We used Claude Code as our primary development tool throughout the project, and we want to share what actually worked, what didn't, and what we'd do differently.
The Core Architecture: One Claude Call, Not Many
The first design decision was how to structure the AI calls. A naive approach would be to make separate calls for the itinerary, the budget breakdown, the food recommendations, and the attractions list. We went a different direction: one Claude call returns everything at once using tool_use.
const response = await client.messages.create({
model: "claude-sonnet-4-5",
max_tokens: 4096,
tools: [{
name: "submit_itinerary",
input_schema: {
type: "object",
properties: {
itinerary: { type: "array", items: { ... } },
budgetItems: { type: "array", items: { ... } },
attractionItems: { type: "array", items: { ... } },
foodItems: { type: "array", items: { ... } },
}
}
}],
tool_choice: { type: "tool", name: "submit_itinerary" },
});
tool_choice: { type: "tool" } forces Claude to respond in structured JSON via tool use rather than prose. We then parse the result with Zod to validate the shape before touching any of it.
This pattern: tool_use + tool_choice + Zod: gave us reliable structured output without any prompt hacking.
LLM-as-Judge: Evaluating Your Own Output
After generation, a second Claude call scores the itinerary across three dimensions: cost accuracy, diversity, and feasibility. Each dimension gets a score (0–100), a reasoning paragraph, and an overall verdict. We display this as a score card with expandable reasoning.
Why evaluate at all? Because raw generation is easy to make look impressive but hard to trust. By forcing Claude to critique its own output, we surface weaknesses: an itinerary that's beautiful but unrealistic on a $500 budget, or one that crams 12 activities into a single day. The judge scores don't change the output, but they help the user calibrate trust.
What We Actually Learned About Claude Code
We used Claude Code for almost every feature. Here's the honest breakdown.
What worked well:
Scaffolding new features from a skill template was fast. We wrote a /add-feature skill that describes the service + API route + Zod pattern for our project. Claude followed it reliably: every new route came out with proper input validation and try/catch error handling.
Hooks changed how we worked. A PostToolUse hook runs tsc --noEmit after every file edit. A Stop hook runs the full test suite at end of session. We caught type errors and test failures that would have silently slid into CI. The PreToolUse hook blocking .env edits prevented at least one accidental secret exposure.
The GitHub MCP integration made PR creation seamless. We created both PRs with C.L.E.A.R. checklists and AI disclosure metadata without leaving the terminal.
What didn't work:
Claude Code is confidently wrong sometimes. Early on it suggested using generateObject from the Vercel AI SDK: a function that doesn't exist in @anthropic-ai/sdk. The code compiled, the types passed, and it failed at runtime. We now review every service-layer suggestion against the actual SDK docs before accepting it.
Infrastructure decisions need human ownership. The ESLint flat config error took two hours to debug. Claude proposed three solutions; the first two made it worse. The fix (FlatCompat from @eslint/eslintrc) is the standard Next.js 15 pattern, but Claude didn't know it. For anything involving build tooling or CI config, we treat Claude's suggestions as starting points, not answers.
TDD With AI Assistance
The most disciplined part of our process was TDD on the judge components. We committed failing tests before writing a single line of implementation:
git log --oneline
9a7c949 test(judge): add failing tests for JudgeScoreCard # RED
2ea5197 feat(judge): implement JudgeScoreCard # GREEN
We ran npm test on the failing-test commit to confirm it actually failed before moving to implementation. This sounds obvious but it's easy to skip when you're moving fast. Having the Stop hook force a test run at session end made the red-green cycle feel natural rather than like extra work.
Claude Code was useful here in a specific way: it wrote the test cases faster than we would have by hand, but we reviewed every assertion before committing. The combination: AI speed on the boilerplate, human review on the semantics: was genuinely better than either alone.
Parallel Development With Worktrees
Late in Sprint 2 we needed to work on coverage config and sprint documentation simultaneously. Git worktrees let us check out two branches into two different folders and commit independently:
git worktree add -b feat/coverage-config ../tripmind-coverage
git worktree add -b feat/sprint-docs ../tripmind-sprints
Two terminals, two branches, no stashing, no context switching. The interleaved commits in the git log are real evidence of parallel work. It's a small thing but it made the last week of the project noticeably less chaotic.
Security Without Slowing Down
Five of the eight security pipeline gates are active:
- PreToolUse hook blocks
.envfile edits in Claude Code sessions - Gitleaks scans the full git history for leaked secrets on every CI run
-
npm audit --audit-level=highcatches vulnerable dependencies - ESLint runs as SAST on every PR
-
claude-code-actionposts an AI PR review with security acceptance criteria on every PR
The security-reviewer sub-agent in .claude/agents/ checks every API route for Zod validation, RLS bypass, and secret exposure before we open a PR. It caught a missing 401 check on the trips endpoint before it hit code review.
What We'd Do Differently
Write acceptance criteria before starting features. We opened GitHub Issues retroactively. Writing testable specs first would have made the TDD workflow feel more natural and caught scope creep earlier.
Set up CI on day one. We added the CI pipeline in Sprint 2. Discovering that eslint-config-next doesn't support flat config natively would have been less painful two weeks earlier.
Trust Claude Code for structure, verify Claude Code for semantics. It's very good at producing code that looks right. It's less reliable at producing code that is right for your specific SDK, toolchain, or architecture. The skill logs we kept (docs/skills/task1-log.md) were worth every minute: they made v2 of the skill genuinely better than v1.
TripMind is live at https://ai-travel-planner-wkm7.vercel.app/ . The full source is at github.com/arinaa77/ai-travel-planner.
This article was originally published by DEV Community and written by Kelson Qu.
Read original article on DEV Community