NatanYagudayev

← Notes/April 2026/

I built a personal AI assistant that lives in my SMS inbox

No app. No login. Just a phone number that thinks with me — and a few opinions about why that interface matters.

Natan Yagudayev

Natan Yagudayev

Engineering Manager – New York, NY

I already have an iMessage thread with myself titled "todo."

It is where most of my errands go to die.

Random thoughts at midnight. Reminders I swear I will come back to. Things my wife asks me to remember. A gym I was thinking of joining. Some code idea I had while walking around. A journal entry I wanted to write but never actually did.

I have tried the usual stuff. Notes apps. Reminder apps. Todo apps. Journaling apps. They all work in theory, but the problem is always the same: I have to open another app.

I did not want another app.

I wanted a phone number.

Something I could text the same way I text myself or other people.

The idea first clicked for me in Las Vegas during AWS re:Invent. I remember thinking: what if I could just text my assistant? Not open a dashboard. Not talk to some voice bot. Just send a message and trust that it would remember, remind me, or find the thing later.

I actually had a rough concept working back then, but it was not really there yet. It did not use proper tool calling. It did not have embeddings. It was more of a clever prototype than something I could actually rely on.

This time, I wanted to do it properly.

There was also a quieter reason. I really wanted to explore embeddings and build my own vector store from scratch. I have spent years around teams shipping search and AI at scale, reading the papers, listening to people I respect talk about how their team built something amazing. That is not the same as doing it. It is one thing to nod along in a design review. It is another to sit at your kitchen table on a Saturday, pick a model, write the schema, tune the index, and feel for yourself why a cosine score of 0.82 is a hit and 0.34 is noise. That gap is what pushed me to actually build this.

Structured tools. Semantic memory. Real reminders. A real schema. A system where the LLM helps, but does not own the truth.

So now I can text it things like:

"remind me to call mom tomorrow at 10" "the garage keypad code is saved under home stuff" "journal: today was rough but the demo landed in the end" "what was that gym I was thinking of joining?"

And it texts back.

A real conversation with my assistant. Memory recall, a reminder being created, and the same todo getting marked complete the next day — all from one SMS thread.

It captures the todo. It saves the memory. It logs the journal entry with an inferred mood. It searches across things I have told it before. When a reminder fires, I can text back "done" or "blocked," and the Kanban board updates itself.

This is not a startup pitch or a polished SaaS launch. It is a personal tool I built around the place I already leave breadcrumbs: my SMS inbox.

This post is about how I built it, what broke along the way, and what I learned about building with tool-calling LLMs.

Why SMS?

Because I can just text it.

That sounds almost too simple, but that is the whole point.

I did not want to think about where the tool lives. I did not want to open a web app. I did not want to log in. I did not want to wait for a page to load just to save a thought that took two seconds to have.

SMS is already on my phone. It works when my data is spotty. It works when Wi-Fi is bad. It works when I am walking around, sitting in the car, in the middle of a conversation, or half-asleep at night.

It is the lowest-effort input I already use every day.

Voice was a stretch goal at first, but honestly, I realized pretty quickly that I would never call my own assistant. I do not want to have a conversation with it. I want to send a quick message and move on.

The constraints I started with

A few constraints shaped the whole build:

  • Single-user. This is for me. No multi-tenancy. No onboarding. No real auth flow beyond my own allowlists.
  • Backed by Postgres. I want to own the data. I do not want my memories, todos, and journal entries locked inside some proprietary AI memory layer.
  • Real reminders, not AI promises. The model can decide when a reminder should be created, but the reminder itself has to be deterministic. If it says I will get a text Tuesday at 9 AM, I need to actually get that text Tuesday at 9 AM.
  • The LLM is not the database. The model can help interpret what I mean, but Postgres is the source of truth.

I gave myself a weekend.

It took a weekend to get the first version working, and then a few months of small fixes to make it something I actually trust.

"You work at Algolia. Why didn't you use Algolia?"

Fair question. I get it a lot.

Short answer: Algolia is absolutely capable of powering this — and at production scale, I would trust it first.

If you care about search quality, ranking, relevance, semantic retrieval, and not babysitting infra, Algolia is world-class at all of it. It is managed, battle-tested, and built for the exact problems teams eventually run into. Agent Studio pushes that even further.

So this was not a "could Algolia do this?" decision. It can.

This project is single-user, low-volume, and personal. For this build, I intentionally used Postgres + pgvector because I wanted to experience the constraints a person has when they are not using Algolia out of the box.

I wanted to see the rough edges directly:

  • relational data for todos, reminders, and conversations
  • where ranking starts to drift
  • how semantic recall behaves with noisy input
  • what index and threshold tuning actually feels like
  • which operational problems show up first

Part of me wanted to understand what a non-Algolia user would run into. Another part of me wanted the challenge for myself.

Building it this way made the tradeoffs real in my hands. But the punchline is still the same: Algolia can do all of this — exceptionally well — and it removes a lot of complexity the moment you care about scale, consistency, and relevance as a product surface.

The architecture

The flow is pretty simple.

Twilio receives an SMS and calls a Supabase Edge Function. The function checks that the message came from my phone number, opens or resumes a 30-minute conversation window, loads recent message history, and calls OpenAI's Responses API with a set of tools.

Those tools include things like create_todo, update_todo, bulk_create_todos, create_memory, recall_memories, find_similar_todos, OpenAI's hosted web_search, and a few others.

The model reads the message, decides what needs to happen, calls tools when needed, and returns a reply. The Edge Function sends that reply back through Twilio.

The data lives in Postgres. Todos and memories get embedded into pgvector columns at write time, so semantic search works without introducing a separate vector database. A pg_cron job runs every minute, checks the reminder queue, and sends SMS reminders for anything due.

There is also a React dashboard for the times I actually want to look at everything directly.

Show the architecture diagram
Phone (SMS)
   ↓
Twilio number  ──webhook──▶  personal-assistant-sms (Edge Function)
                                 │
                                 ├──▶ OpenAI Responses API
                                 │     (web_search + function tools)
                                 │
                                 └──▶ Postgres + pgvector
                                       (todos, memories, conversations, queue)

pg_cron (every minute) ──▶ personal-assistant-reminders ──▶ Twilio ──▶ Phone

Dashboard ──JWT──▶ personal-assistant-api ──▶ Postgres

That is basically the whole thing.

The important part is not that an LLM can create todos. The important part is that the LLM is only allowed to act through tools backed by a schema I trust.

The schema is the product

This was probably the biggest lesson.

Most of the product is not the prompt. It is not the model. It is not even the SMS interface.

It is the schema.

Once the schema felt right, the tool-calling layer became much easier to reason about.

Show schema details by table

personal_assistant_todos

Todos have the obvious fields: title, notes, due date, reminder date, priority, and status.

The status is Kanban-shaped: pending, in_progress, blocked, done, and cancelled. Those values map directly to the columns in my dashboard.

Todos can also have a parent_id, which lets a todo become a project.

For example, "Move apartments" could have subtasks like "buy boxes," "book movers," and "change address." Each child can have its own status, due date, and reminder.

Every todo also gets an embedding, which means I can search by meaning later instead of remembering the exact words I used.

personal_assistant_memories

Memories are things I want to save, but not necessarily do.

They have content, an optional title, a kind, tags, mood metadata, and an embedding.

The kind can be fact, note, or journal.

A fact might be something short, like a code, a name, or a detail I know I will forget. A note is more like a thought or longer-form idea. A journal entry is a dated reflection with an inferred mood label and score.

Same table. Different rendering.

personal_assistant_reminder_queue

This is a row per SMS reminder.

The todo says, "I want to be reminded at this time." The queue is the thing the system actually sends.

That distinction matters. The reminder queue is derived state. The cron job should not ask the model what to do. It should just look for due reminders and send them.

Conversations and messages

Every SMS exchange and tool call gets written to an append-only log.

For continuity, the assistant gets the last 20 messages inside a 30-minute rolling window. Tool rows are saved for debugging, but I do not replay old tool calls into the model. The assistant's text replies usually summarize the result well enough.

There are RLS policies on every table, with service-role access inside the Edge Functions.

Why pgvector, and what I actually did with it

I used pgvector because I already had Postgres, it is good enough at personal scale, and I did not want to run a separate vector database for a side project.

The setup is intentionally boring: two vector(1536) columns, two HNSW cosine indexes, two RPC functions for search, and a small embed() helper that calls OpenAI's text-embedding-3-small.

The more interesting part was how I used the embeddings.

1. Duplicate detection on todo creation

Before creating a todo, I embed the title and notes, search for the nearest pending todo, and check the cosine distance.

If the new todo looks too similar to an existing pending one, the tool does not immediately write it. Instead, it returns a duplicate warning to the model. Then the assistant can ask:

Looks like you already have "call mom this week." Add another one or skip this?

This catches the very real thing where I text the same task twice because I forgot I already saved it.

2. Semantic recall over memories

The recall_memories tool combines vector search and full-text search using Reciprocal Rank Fusion.

Vector search is good when I ask for something by meaning. For example, "what was that gym I was thinking about?"

Full-text search is better when the exact token matters, like a PR number, a code, a name, or something with numbers in it.

RRF lets me combine both rankings into one result list. It is simple and works surprisingly well.

3. IVFFlat vs HNSW in normal human language

I started with IVFFlat indexes and the default Supabase-style RPC pattern.

IVFFlat (Inverted File with Flat compression) is like sorting a huge pile of memories into buckets, then only searching the buckets that look most relevant. It is fast, but you have to pick the right number of buckets up front. Too few and the search gets noisy. Too many and you can miss things, or end up retuning it every time your data grows.

Mid-build, I switched to HNSW (Hierarchical Navigable Small World). HNSW is more like building a map of shortcuts between similar memories — a graph where each node points to its closest neighbors at multiple zoom levels. Instead of guessing which bucket to search, Postgres can hop from one nearby item to another until it lands on the closest matches.

For a personal assistant, that tradeoff felt better. I would rather pay the slightly slower index-build cost once than keep thinking about index tuning as my data grows.

The tool-calling loop is the whole product

This is where most of the magic happens, and also where most of the bugs came from.

One inbound SMS goes through roughly this flow:

  1. Twilio calls my Edge Function.
  2. The function validates that the From number is mine. If not, it returns empty TwiML and ignores the message.
  3. It resumes the current 30-minute conversation window or opens a new one.
  4. It persists my message immediately, before any model call, so the message is not lost if the LLM fails.
  5. It loads the last 20 messages.
  6. It calls OpenAI's Responses API with the system prompt, recent history, my new message, and the tool schema.
  7. If the model calls a tool, the function executes it, persists the tool row, and sends the result back using previous_response_id and function_call_output.
  8. When the model returns final text, the function sends that text through Twilio and saves the assistant turn.

Two implementation choices ended up mattering a lot.

First, I switched from Chat Completions to the Responses API so I could use OpenAI's hosted web_search tool in the same loop. That makes turns like this possible:

What is the weather tomorrow? If it is raining, remind me to bring an umbrella.

The assistant can search, reason, create the reminder, and reply in one flow.

Second, the tool-call loop runs in EdgeRuntime.waitUntil. That lets me return empty TwiML to Twilio immediately and send the real reply when the model finishes. Twilio has timeouts, and LLMs do not always respond instantly.

One SMS in, one create_todo call out. The dashboard exposes the actual tool calls so I can see exactly what the model decided to do.

Later that morning the reminder fires, I tell it the task is done, and the assistant calls complete_todo on the same row — no IDs, no app switching, just plain text.

The reminder lands at 8:45, I reply "Alright that one is done," and the model resolves the todo by title query.

What I learned about tool-calling LLMs

A few patterns stuck with me.

Empty strings are not the same as "I did not mean it." Models will pad optional fields with empty strings. Do not let that mutate your data. Treat empty strings as no-ops, or create an explicit primitive like clear_fields for destructive changes.

Better tools beat longer prompts. Every time the model did something dumb, the fix was usually not "write a more emotional prompt begging it to behave." The fix was a better tool. bulk_create_todos happened because the model was bad at orchestrating multiple create_todo calls. clear_fields happened because the model was bad at distinguishing sentinel values.

Prompt rules need examples. "Use null to clear a field" got ignored. Showing a literal JSON snippet of clear_fields: ["reminder_at"] worked. The closer the instruction is to a concrete snippet the model can copy, the better.

Keep deterministic things deterministic. The LLM can interpret intent. It should not be responsible for whether a reminder actually sends. That is the job of the database, the queue, and the cron.

The dashboard

The SMS path is great for capture, but it is not great for review. So I built a dashboard.

The Todos tab is a Kanban board with columns for pending, in progress, blocked, and done. Cards can have subtasks, and each subtask can have its own status, due date, and reminder.

Todos tab. Same statuses as the SMS verbs — "on it," "done," "blocked" — so the chat and the board never disagree.

The Memories tab is split into facts, notes, and journal entries because they all want different layouts. Facts are quick reference items. Notes are longer thoughts. Journal entries show up in a date-grouped timeline with mood metadata.

Memories. The model auto-classifies each capture as a fact, note, or journal entry and tags it on the way in.

Categories are the connective tissue. Todos and captures share the same category vocabulary, so a Work todo and a Work note end up in the same neighborhood without me having to think about it.

Categories are shared across todos, facts, notes, and journal — system defaults plus the ones I've added over time.

The Overview tab is what I open in the morning. It shows what is in progress, what is due today, what is overdue, what is coming up this week, recent captures, and a mood trend from journal entries.

SMS is for capture. The dashboard is for review.

The things I deliberately did not build

Knowing when to stop is half the work.

  • No voice integration. I tried it briefly. Then I admitted I would never use it. I deleted the code.
  • No auto-rollup on parent todos. When all subtasks are done, the parent does not automatically mark itself done. Felt too magical for v1.
  • No cross-conversation memory in the prompt. The model only sees the current 30-minute SMS conversation. Older context comes back through tools like recall_memories and find_similar_todos.
  • No multi-user support. This is for me. Phone allowlist. Email allowlist. Personal threat model.
  • No web UI for capture. The web UI is for viewing and editing, not capturing. Adding a chat box would dilute the whole point.

Cost

At my usage, this is roughly $20–30 per month. Twilio number around $1, Twilio SMS around $10 at heavier usage, OpenAI tokens a few dollars, embeddings cents, web search a few dollars. Supabase is already covered by the parent project.

For something I use every day, that is a no-brainer.

What's next

I am trying not to build things just because they are cool. I want to build the things I actually find myself wanting twice in the same week.

MMS file uploads is probably the highest-leverage thing I have not built yet. Photo of a receipt → extract merchant, amount, and date. Photo of a whiteboard → summarize the diagram and extract action items. Screenshot of a confirmation → save the useful details and create follow-up reminders.

iMessage, WhatsApp, anywhere I already am. Twilio SMS works, but it has papercuts. Message length. Green bubbles. Awkward media support. The most interesting part of moving beyond SMS is group chats — being able to add my assistant to a thread the same way I would add another person. If my wife and I are coordinating something, I do not want to copy that into a separate todo app later. I want to say it right there in the thread. WhatsApp is in the same category — most of the world lives there, and the assistant should meet people where their conversations actually happen. The interesting part is the permission model: my private memories should stay private, and a group chat should only get access to what belongs to that group.

MCP integrations. Outbound MCP is probably the most useful direction — my assistant could talk to Linear, GitHub, Calendar, and other tools without me hand-writing every integration. That is where this starts to feel less like a clever todo app and more like a personal operating layer.

Smarter reminders. Right now every reminder is equal. Everything fires at the time you set, in the order you set it. The version I want has a sense of priority — what actually matters today, what can wait, what the model thinks I should be looking at first. The difference between "pick up milk" and "review the term sheet" should not be left to the timestamp.

A local model in the loop. Plenty of the calls this thing makes do not need a frontier model. Classification, memory lookups, simple tool calls — those could run on a local LLM, with the frontier model reserved for the harder reasoning. The economics get better. So does the privacy story.

The list could keep going. The discipline is the same one from the top of this section: build the thing I find myself wanting twice in the same week, and leave the rest in the notes file.

Closing

The thing I keep coming back to is the difference between an AI feature and a tool with AI in it.

The difference is where the determinism lives.

The model can be flaky. My reminder cron should not be. My database should not lose data because the LLM padded a field. My queue should not send ghost reminders forever because one update path missed a cleanup.

The schema is the contract. The model is just one writer.

This took a weekend to build, but it took months of tiny fixes to make it feel dependable. That is honestly the best part of building personal tools. Every time something annoys me, I can open my laptop for 20 minutes and fix it. The bugs are not just bugs. They are the iteration log.

If you want to build something like this, my advice is:

  1. Pick one input surface. SMS, voice, web, but not all three.
  2. Make the schema solid before you obsess over the prompt.
  3. Treat tools like an API. Each tool should map to something the user actually means.
  4. Trust nothing the model emits. Validate it. Reconcile it. Make derived state rebuildable.
  5. Build the dashboard only as much as you need. The real value is capture.

The thing actually works. I have not lost a todo since I shipped it. My journal has more entries this month than the previous twelve combined. When my wife asked me what our top vacation spots was, I texted my assistant and had it back in three seconds.

That is the whole pitch.

§ If it landed

claps

Tap up to 50 times if it landed

Natan Yagudayev

— Natan

New York, NY