Personal Agent Case Study

What I learned building a personal AI agent

My AI journey at work didn't really take off until I connected the models to my own context and data. The personal AI agent was the project where I pushed that idea as far as it would go. The test I set for myself: could I securely connect every part of my digital life and build a real version of Jarvis from Iron Man?

This isn't a pure success story. Some of it went better than I expected, some of it is still half-built, and a couple of the questions it raised I still can't answer.

The plan Architecture Memory What's still hard

Business translation

Plenty of businesses are asking a version of the same question. If connecting AI to a couple of tools is already producing results, what happens when you connect it to everything?

The Plan

The setup was four steps. The fourth one ended up being pretty hard.

Design a secure architecture that could hold all of it.

Connect every part of my digital life: Gmail, notes and PDFs, wearable health data, calendar, spending, browser and YouTube history, and location.

Use that data to build a profile of me — my preferences, my routines, what I believe, and how all of it shifts over time.

Use that profile as context to give me answers that were actually personalized.

Building the profile was the part I cared about most. One fact pulled from one source at a time wasn't interesting to me. What I was after was an agent that could take everything it knew about me and assemble the bigger picture, the way someone who knew me well would.

The Interesting Part

I didn't want a lookup tool. I wanted something that could read the situation.

No single source tells the whole story. My calendar might have the event on it, but it takes my texts to know who's actually coming, and only my location data knows where I really am right now. Any one of those is a thin slice on its own. The interesting part was getting the agent to assemble them into one picture without me having to spell it out.

A sample use case

There's a “Breckenridge trip” on my calendar this weekend.
My profile says I'm married, no kids.
I have a text thread with Jim about the trip. He's bringing his wife and kids.
My location data says I'm now in Breckenridge.

I ask the agent for “a good place for dinner tonight.”

To answer well, it has to work out on its own: Chris is in Breckenridge with his wife and Jim's family, six people total, and he'll want somewhere close to the hotel that takes a reservation and is fine with kids.

Business translation

Pulling one coherent picture out of sources that live in completely different formats is something AI can do now that was close to impossible before. If you can make it reliable, the payoff is large.

Here's what surprised me: building that bigger picture was easier than I expected. The hard part was getting the agent to actually use it, and to do it fast enough to be worth using. That's what pushed me into the architecture, and then into memory.

What I Built

Memory and retrieval were the main event. But I had to get the architecture right first.

The architecture took a while to settle. What I landed on is modular, which is the part that mattered: I can open up any one piece and fix or improve it without disturbing the rest.

Portal surface

The web and mobile portal is the front door: chat, quick actions, dashboards, diagnostics, and a view into what the agent is doing. The cockpit.

Conversation + Codex engine

Where a chat turn becomes real work: thread management, streaming, and the Codex calls that actually run the agent.

Memory + retrieval

Canonical memory, temporal context, source routing, a Qdrant vector store, and human-readable memory views. Durable memory plus semantic recall.

Background tasks + jobs

The agent isn't only reactive. It runs scheduled work and longer user-triggered tasks on its own.

Integration + ingest

Gmail, Apple, Oura, Monarch, Withings, Home Assistant, finance, browser state, and local exports. The agent's senses.

Ops + telemetry

Run records, task and conversation links, token and resource metrics, alerts, and logs. How I see what happened and what needs attention.

The Heart Of It

Good memory turned out to be four different stores, each with one job.

Memory was the most fascinating part of this project, and the part where my data background finally lined up with what I was building. Performance matters for any retrieval system, and knowing the trade-offs between storage technologies shaped the whole design. From there I refined it the way you refine any data system: throw real use cases at it and watch where it breaks.

The job of memory is continuity. It's the difference between a chatbot that forgets you the moment you close the tab and something that behaves like it actually knows you. The mistake would be to treat that as “just a vector database.” What I ended up with is four stores, and the most important distinction is between the one that holds what's true and the ones that help it remember.

.memory.db Canonical memory

Durable facts, preferences, standing instructions, relationships, open loops, and the lessons it's learned — each with evidence behind it and a confidence level attached.

When it's used. Read when the agent assembles context for a turn. This is the layer it's allowed to treat as true.

.chat.db / .ops.db Operational state

Conversations, message history, background tasks, job runs, timing, token usage, and the links between them.

When it's used. Used to resume a conversation, track a task, or answer “what ran, and did it work?”

Qdrant Semantic recall

Vectorized personal data — email, calendar, messages, notes, finance, health, documents, locations, recaps — plus searchable projections of canonical memory.

When it's used. Used when exact stored facts aren't enough and the agent needs to search by meaning.

Obsidian / Markdown Human-readable

Generated memory files, profile notes, daily recaps, the nightly “dream” logs, and archived chats.

When it's used. Mostly so I can inspect what the agent thinks it knows. It also feeds the nightly consolidation.

Qdrant is recall. .memory.db is truth. Keeping those two ideas separate was the most important design decision I made.

Two pieces tie it together. A routing step reads every question and decides which store should answer it; recalling a preference takes a different path than checking where I am right now. The other piece runs while I'm asleep.

One question, start to finish

Here's a real example, run against my actual data. I ask: “What's a good steak restaurant near my hotel for tonight?” Nothing in that sentence says where I am or where I'm staying. The agent has to work it out.

01

Classify the ask

Not a memory lookup. This is a current-state question about places, so it needs to know where I am right now before anything else.
02

Resolve “my hotel”

Temporal memory points to an active trip: a Cabo birthday trip, May 15–20, staying at the Hyatt Vacation Club at Sirena del Mar. “My hotel” becomes a real address.
03

Pull what's already been said

Semantic recall across messages and recaps surfaces a steakhouse a friend had already recommended for the trip, Churrasco Argentino, plus an on-property buffet that isn't really steak.
04

Check the live world

Restaurant hours change, so it verifies against current public data: open daily 5–11 PM, a short drive from the hotel.
05

Answer

Churrasco Argentino tonight, with the on-site restaurant as the no-logistics fallback. Short, specific, and grounded in where I actually am.

The point isn't the restaurant. It's that the agent resolved “my hotel” and “tonight” from memory before it ever searched, instead of stopping to ask me what I meant.

The Dream

The most human thing it does is forget.

Every night, once the day's data has landed, the agent runs a pass I call the dream. It goes back over the day — new messages, recaps, feedback, whatever the integrations pulled in — and sorts out what's worth keeping. The useful facts and lessons get promoted into canonical memory, the loose signals get clustered into a read on what's currently going on in my life, and anything that doesn't earn its place starts to fade.

The fading is the piece I keep thinking about. Not every memory should matter forever. Where I live or who's in my family should stay strong. A weekend trip or a one-off craving shouldn't. So memories carry a strength: the dream reinforces the ones that keep proving useful and lets the rest lose weight over time, until the agent has something like an attention span.

My vault lives at ~/Obsidian/ClawCode Vault

Durable

Prefers concise, direct summaries

Durable

Cabo birthday trip, May 15–20

Active

Wanted Thai takeout one night back in March

Fading

Fading isn't deleting. Durable facts hold their strength, the active trip stays prominent for now, and the March craving has nearly receded. None of it is gone, though: the underlying evidence stays on record, and a faded memory comes back into context the moment it's relevant again.

Memory that only piles up becomes clutter. The real design problem was teaching it what to let go of.

Still Hard

It works. I wouldn't trust it at scale yet.

For every clean example like the steakhouse, there's one like this. A few days earlier I asked where to walk near our hotel. The agent gave me a generic Cabo answer, then asked me which hotel I was staying at, even though it had that exact fact in memory. The data was there. The judgment about when to use it wasn't.

That gap is where the real lessons are, and most of them are the ones a business would hit too.

Reading “my hotel” and “tonight”

Phrases that point at the current situation should trigger a situation lookup first. When the agent nails this, it feels personal. When it misses, it feels dumb.

It still hunts

Performance isn't where I want it. The agent often queries source after source before it lands the answer, when a compact “here's what's going on right now” packet should have answered it on the first step.

Routing sends it down the wrong hallway

Sometimes the right memory exists but the question gets classified wrong, so the agent never looks in the right place. The gatekeeper is the weak link more often than the data.

Measuring accuracy at scale

I can grade every answer because it's all my own data. A business assistant serving many people can't lean on that. You'd need golden test questions, source-grounded grading, and a real way to ask “did it retrieve the right evidence?” rather than “did the answer sound good?”

An audit trail you can replay

When something goes wrong, I want to see exactly why: what intent it detected, which stores it checked, what it skipped, and whether it leaned on stale data. Today I can mostly reconstruct that by hand. It needs to be built in.

The fine line between security and functionality

The portal intentionally includes guardrails that keep it from doing whatever it wants. But those same guardrails can also stop it from fully completing a legitimate request. Getting that balance right is an open challenge I keep having to come back to.

Business translation

This is the part most AI projects hit after the demo works. Proving something can work is one thing. Being able to operate it, inspect it, and trust it next month is the harder, and more valuable, problem.

I built the raw memory muscles, and they're real. The work that's left — making retrieval fast, auditable, measurable, and reliable outside my own feedback loop — is exactly the bridge from a personal experiment to something a business could actually depend on. That's the part I find most interesting, and it's the work I want to do next.

Where to go from here

A few good next stops.

Practical AI

What I learned building a personal AI agent

The setup was four steps. The fourth one ended up being pretty hard.

I didn't want a lookup tool. I wanted something that could read the situation.

Memory and retrieval were the main event. But I had to get the architecture right first.

Good memory turned out to be four different stores, each with one job.

The most human thing it does is forget.

It works. I wouldn't trust it at scale yet.

Reading “my hotel” and “tonight”

It still hunts

Routing sends it down the wrong hallway

Measuring accuracy at scale

An audit trail you can replay

The fine line between security and functionality

A few good next stops.

The thesis behind the build.

Where I can help.

Want to talk?