I wanted to get better at using AI to automate tasks. I already use ChatGPT for everyday chores, but I wanted to use it on something more challenging: receipts. They may look deceptively simple, but each squeezes a lot of information into a tiny space, making it surprisingly difficult to decode.
Receipts are a difficult problem to solve. To turn a crumpled piece of paper into structured data, you have to read three layers of data:
Since these layers arrive in messy, mixed formats, receipts are an ideal test case for learning how to use AI to automate real-world tasks. The rest of this post describes how I disentangle those three layers and stitch the answers back together.
Every receipt's journey begins as raw pixels, whether a crisp scan or a hastily snapped photo. But hidden within those pixels is structured information waiting to be extracted.
Text recognition finds words, but not context. Multiple receipts become one confusing jumble. The critical step: determining which words belong together.
This is where the difference between scanned receipts and photographed receipts becomes crucial. Scans are the easy case: flat, well-lit, and aligned. But photos? They're captured at odd angles, under harsh store lighting, with shadows and curves that confuse even the best text recognition. Each type needs its own approach to group related words and separate one receipt from another.
With scans being the easier of the two, we can say that the receipt is parallel to the image sensor.
Grouping the scanned words is straightforward. Since the scanner captures everything flat and aligned, we can use simple rules: words that line up horizontally likely belong to the same line item, and everything on the page belongs to the same receipt. It's like reading a well-organized spreadsheet.
Photos are trickier. When you snap a picture of a receipt on a counter, the camera captures it from an angle. Words at the top might appear smaller than words at the bottom. Lines that should be straight look curved. And if there are multiple receipts in frame? Now you need to figure out which words belong to which piece of paper.
To solve this puzzle, I look for clusters of words that seem to move together—like finding constellations in a sky full of stars. Words that are close together and follow similar patterns likely belong to the same receipt. Once identified, I can digitally "flatten" each receipt, making it as clean as a scan.
Now that we can group words correctly, we face a new challenge: processing thousands of receipts efficiently. My solution? My laptop handles the tricky word orientation while the cloud handles the visual processing (finding and flattening receipts).
This hybrid approach has processed hundreds of receipts, transforming messy photos and scans into organized, searchable data:
I've found that these “AI agents” are pretty dumb on their own: the meme is a dumb intern that needs more information to figure out how to do the job. Retrieval-Augmented Generation gives the intern a window of context through a set of tools, but the answers can still be non-deterministic. The fix is to encode the data so the retrieval is precise and learning is repeatable.
One of the best tools I've found has been semantic search. I found the best way I can understand it is this example:
King is to queen as man is to woman
This shows how king and queen have a similar meaning as man and woman (gender). This relationship is semantically explained by embeddings, which place related words near each other.
I embed the words with OpenAI.
The relationships can be queried using a database like Chroma, and I've been able to run it for less than a dollar a month using docker and Amazon's serverless service, Fargate.
I've also learned how powerful Google Maps is with very little information.
Once the receipt has the place it came from and the words are semantically comparable, these AI Agents can start labeling the data.
The dumb-intern still needs review. The data I get through Google Maps is disorganized. The Google Maps data is cleaned using entity resolution: build a small graph of merchants where the edges between them mean “same phone,” “same address,” or “name similarity.” Stronger signals (phone + address) outweigh weaker ones (name only). I pick a “golden” merchant in these clustered groups to ensure all receipts from a specific store have the most correct data.
Next, I narrow the vocabulary to receipt words (totals, taxes, dates, phone, address). The AI Agent is given the definition per label and gives an initial guess. The Agent then validates the guess using the tools we spoke of earlier.
Word needing validation: 'GO' Label being validated: DATE Image: 8388d1f1... Receipt context: Line 45, Word 1 📄 Receipt Context: 42: Whse: 117 Trm:201 Trn: 266 OP:701 43: Aga in 44: Items Sold: 2 → 45: →GO← 06/17/2024 ← TARGET LINE 📊 Evidence Analysis: 🎯 EXACT MATCHES: ❌ 'GO' was previously marked INVALID 🧠 SEMANTIC SIMILARITY: Similar words where 'DATE' was VALID: (10 found) ✓ '06/27/2024' (distance: 0.169) - Sprouts Farmers Market ✓ '06/20/2024' (distance: 0.170) - Sprouts Farmers Market ✓ '06/17/2024' (distance: 0.170) - Sprouts Farmers Market Similar words where 'DATE' was INVALID: (3 found) ✗ 'GO' (distance: 0.000) - Costco Wholesale ✗ '20:23' (distance: 0.176) - Costco Wholesale ✗ '20:23' (distance: 0.176) - Costco Wholesale 🎯 DECISION: ❌ REJECT this label 🔒 DEFINITIVE - Strong evidence 💡 Same text 'GO' was previously marked invalid ✨ Recommended action: Apply this decision automatically
This technique not only gives the AI enough context to make the right decision but also helps it learn from its mistakes. This approach has allowed me to increase accuracy and use less compute and time.
I've been able to speed up the receipt labeling even further by using LayoutLM, a document understanding model that takes both text and layout into account. Trained on my agent-validated labels, it predicts a token, address, date, total, etc. in one forward pass. In production this gives me:
To improve accuracy, I can add more receipts, synthesize new receipts by finding patterns within and outside of different merchants, and adding noise to existing receipts. This model learns from more data while poisoning the truth. This gives me consistent, repeatable predictions explaining what a receipt is.
I'm still working on this part. My experience in data engineering gave me a great head start into structuring and organizing data.
Optimizing this has been fun. I've learned a lot about open source models. I've used Ollama to organize how I deploy the models and LangChain to explain how the models are using the tools.
AI has made typing cheap, but the bottlenecks remain understanding, testing, reviewing, and trusting. By shifting my focus from typing code to testing, understanding, and architecting solutions, I prototype and experiment faster. I can talk to AI, try a new approach, and see my changes in the cloud. With a 10-second feedback loop and tests gating production deploys, I meet the no-downtime requirement while iterating quickly. I no longer spend most of my time researching how to accomplish the task, I just build it.
I've been able to iterate quickly using Pulumi's Infrastructure as Code (IaC).
This has allowed me to try something new at scale quickly. These cheap tries are allowing me to “skill-max“ my cloud expertise.
This rapid iteration with developer best practices allows me to prototype in a safe environment, review and test my changes, and ship with confidence using GitHub and Pulumi.
Understanding how software works, testing to make sure it works, and trusting that changes don't break things still takes a long time. Prototyping and shipping to cloud to test new features in seconds speeds up development, makes learning fun, and allows me to focus on trying something new without being afraid of breaking things. I now optimize development loop speed, not keystrokes.
You can also drag and drop images anywhere on this page