I wanted to get better at using AI to automate tasks. I already use ChatGPT for everyday chores, but I wanted to use it on something more challenging: receipts. They may look deceptively simple, but each squeezes a lot of information into a tiny space, making it surprisingly difficult to decode.
Receipts are a difficult problem to solve. To turn a crumpled piece of paper into structured data, you have to read three layers of data:
Since these layers arrive in messy, mixed formats, receipts are an ideal test case for learning how to use AI to automate real-world tasks. The rest of this post describes how I disentangle those three layers and stitch the answers back together.
Every receipt's journey begins as raw pixels, whether a crisp scan or a hastily snapped photo. But hidden within those pixels is structured information waiting to be extracted.
Text recognition finds words, but not context. Multiple receipts become one confusing jumble. The critical step: determining which words belong together.
This is where the difference between scanned receipts and photographed receipts becomes crucial. Scans are the easy case: flat, well-lit, and aligned. But photos? They're captured at odd angles, under harsh store lighting, with shadows and curves that confuse even the best text recognition. Each type needs its own approach to group related words and separate one receipt from another.
With scans being the easier of the two, we can say that the receipt is parallel to the image sensor.
Grouping the scanned words is straightforward. Since the scanner captures everything flat and aligned, we can use simple rules: words that line up horizontally likely belong to the same line item, and everything on the page belongs to the same receipt. It's like reading a well-organized spreadsheet.
Photos are trickier. When you snap a picture of a receipt on a counter, the camera captures it from an angle. Words at the top might appear smaller than words at the bottom. Lines that should be straight look curved. And if there are multiple receipts in frame? Now you need to figure out which words belong to which piece of paper.
To solve this puzzle, I look for clusters of words that seem to move together—like finding constellations in a sky full of stars. Words that are close together and follow similar patterns likely belong to the same receipt. Once identified, I can digitally "flatten" each receipt, making it as clean as a scan.
Now that we can group words correctly, we face a new challenge: processing thousands of receipts efficiently. My solution? My laptop handles the tricky word orientation while the cloud handles the visual processing (finding and flattening receipts).
This hybrid approach has processed hundreds of receipts, transforming messy photos and scans into organized, searchable data:
Knowing which business a receipt comes from allows for faster processing. I wrote an agent that uses Google Maps to identify the business the receipt came from. The AI agent is able to consider OCR errors, "Mestlake" instead of "Westlake", and still identify the correct business.
The agent tries multiple strategies to identify the business. I give it the tools to search by phone, address, and text. When one approach fails, it combines the other approaches to identify the business.
Once I've identified the business, I can reuse that information for similar receipts. I use Chroma to find similar receipts by comparing addresses, phone numbers, and URLs.
Chroma stores text as vector embeddings: a numerical representation of the text. Describing text as numbers allows for easy comparison in larger datasets. When I search for '1012 Westlake Blvd', it finds similar addresses even if the wording is slightly different.
When Chroma finds another receipt with the same address, phone number, or URL, I can skip Google Maps and reuse the information from the previous receipt, making this process faster and cheaper.
After finding the business and similar receipts, I can use this context to label the words more accurately. Each label structures the data for processing.
The agent verifies the label guesses using a chain of verification. It generates a series of questions to answer, and uses the answers to verify the label. If the label is incorrect, the agent generates a new label and verifies it again.
This dataset is used to train a model to label the words faster and cheaper. I found a model, LayoutLM, on Hugging Face that can label the words given the OCR data.
LayoutLM is a transformer model that understands both text and layout information. By training it on my labeled receipts, it learns to identify entities like merchant names, dates, addresses, and amounts with high accuracy.
After reading the research paper, I learned that the model works best using evenly distributed labels. In order to do this, I had to compromise with labeled data I'd know to be disproportionate: an average receipt has more line item prices than it does totals and taxes. Here is how I balanced the labels:
Training the model to produce the best results means finding the right settings. Instead of trying every possible setting, I use an LLM to review training results and suggest which settings to try next. It learns what works and what doesn't, helping me find better configurations faster.
The custom model processes receipts in about 5 seconds, compared to 30-60 seconds with the AI Agent. The tradeoff is coverage: the model focuses on 4 core labels, while the AI Agent provides comprehensive labeling including product names, quantities, and unit prices.
AI has made typing cheap, but the bottlenecks remain understanding, testing, reviewing, and trusting. By shifting my focus from typing code to testing, understanding, and architecting solutions, I prototype and experiment faster.
I've been able to iterate quickly using Pulumi's Infrastructure as Code (IaC). Before, I would strictly use Terraform.
Pulumi allows me to write my infrastructure in Python. Not only am I more comfortable with Python, but I can hack my way into making Pulumi my own. My docker builds were killing me with their 7 minute deploy times.
After asking around, I learned that the real pros are building and deploying their containers in the cloud. I looked into CodeBuild and CodePipeline, and wrote a new component that manages the container-based lambda functions entirely in AWS. This allows me to quickly iterate the infrastructure while not waiting on builds to deploy.
This has allowed me to try something new at scale quickly. These cheap tries are allowing me to “skill-max“ my cloud expertise.
This was also my first time using vector embeddings. I started with Pinecone, but found it too expensive for my use case. They market their service as “serverless”, but there's no scale to zero option which means you're always paying for it, even when you're not using it. After deciding Pinecone wasn't the right fit, I found Chroma, an open-source vector database.
Since Chroma is open-source, I was able to hack it to work with my existing DynamoDB infrastructure. I continue my serverless approach by using Dynamo Streams to trigger compaction Lambda functions. This allows me to scale to zero and only pay for what I use. The problem is that I need to coordinate the writing of the new receipts. I developed a distributed locking mechanism that queues writes.
This distributed system allows me to write 1,000+ receipts per second while not having to pay the $200+ Pinecone is asking for. Keeping the data in AWS also means that my query times are 10 times faster than querying from Pinecone's servers.
Running Large Language Models (LLMs) is incredibly expensive. I started with OpenAI's batch API, but it didn't provide me with the fast feedback I needed.
After asking around, I learned about graph RAG (Retrieval-Augmented Generation) and agents. More specifically, I learned about LangChain, a framework for building chain-of-thought LLM applications.
Not only was I able to start testing different agentic workflows, but I was able to save each agent's answers and compare them to other agents's answers. Again, this got expensive quickly, and I had to find a cheaper way to run LLMs. I found Ollama, an open-source LLM server, that I could run locally.
Ollama is a great way to run LLMs locally. It's free and easy to use. I was able to run small models locally, but my MacBook was definitely a limiting factor. Thankfully, Ollama released a new cloud service that allows me to run larger models in the cloud. I'm still writing new agents. This is definitely the largest place to grow for this project.
This rapid iteration with developer best practices allows me to prototype in a safe environment, review and test my changes, and ship with confidence using GitHub and Pulumi.
GitHub allows me to separate and structure my work into manageable chunks. I can ask AI "what if I do X?" try it out, debug it, and review it before adding it to what I know works. This is a great way to learn and iterate quickly. Action produces information. Even when I'm unsure what to do, I just do anything and see what happens. This gives me the information about what I should actually be doing.
Building with AI isn't about finding the perfect tool. It's about moving fast enough to learn what actually works. The process is simple: build quickly, test, and iterate. The best way to learn is to build something. Please look at the GitHub repository for the full code and documentation.
You can also drag and drop images anywhere on this page