LLMs and unstructured data

How LLMs are changing the way unstructured data is being processed and understood.

August 31,2024

last update: 2025-06-25

Close on the heels of running a batch processing job over 1M documents, I was thinking about how LLMs are changing the way we process unstructured data.

This is the only moment in history where we have the ability to process unstructured data at scale. The ability to understand and process unstructured data is a game changer. It is the difference between understanding our chaotic world and being lost in it.

One awesome property of the LLM war is that it is too cheap to hit hosted APIs. It’s incredible how cheap how cheap gpt-4o and Sonnet 3.5 are. You can decode a billion tokens and not even worry about costs. Google is even providing a free tier of 1.5 Billion tokens per day. That’s a lot of tokens for free.

identifying patterns in a boiling sea of data is super exiciting to me because. A lot of data that organizations are storing are still left untouched because of the sheer volume and complexity of it.

There is scope for a lot of tooling around this. going from a wall of text to a comprehensible structure and then deriving insights from it.

One of the datasets that I’ve been looking at is te Library of Congress, Chronicling america Dataset. Overall, It was 1 TB of pure text content containing text data from newspapers from 1789 to 1963. The dataset is a goldmine of information. It’s a treasure trove of information that can be used to understand the history of the United States.

I’m working on making it structured so that it can be used for further analysis. I’m excited to see what insights we can derive from it.

Hear from me.