library of congress information
Looking through the library of congress for a project
August 17, 2024 · 1 min read · tech, programming
Library of Congress is really an incredible repository of content. Just scrolling through library of congress is a fascinating mental exercise.
One of my focus was creating news snippets across time from newspapers. Library of Congress has a sereis called Chronicling America . which has collected photocopied and OCRd newspapers from 1800s
It’s incredibly well managed and beautifully managed.
Every newspaper is identified using an LCCN and there’s pretty viewer as well a bulk downloader. which provides all content with the data. Based on my initial exploration the OCR is fairly on point and I like how it handles the column and row separations really well.
I’m working on converting this wall of text into a dataset of news content with date.
bulk downloading.
based on the task assignment each newspaper is bundled as tar files.
There’s a file count limit of 10 file for 10 minutes. But other than that everything is fine.
lccn
Library of Congress Control Number is a well thought system to identify individual resources. there’s an easy way to go from lccn to newspaper name and information.