Library of Congress information
Looking through the library of congress for a project
August 17,2024
last update: 2024-11-19
Library of Congress is really an incredible repository of content. Just scrolling through library of congress is a fascinating mental exercise.
One of my focus was creating news snippets across time from newspapers. Library of Congress has a sereis called Chronicling America. which has collected photocopied and OCRd newspapers from 1800s
It’s incredibly well managed and beautifully managed.
Every newspaper is identified using an LCCN and there’s pretty viewer as well a bulk downloader. which provides all content with the data. Based on my initial exploration the OCR is fairly on point and I like how it handles the column and row separations really well.
I’m working on converting this wall of text into a dataset of news content with date.
Bulk downloading.
based on the task assignment each newspaper is bundled as tar files.
There’s a file count limit of 10 file for 10 minutes. But other than that everything is fine.
LCCN
Library of Congress Control Number is a well thought system to identify individual resources. there’s an easy way to go from lccn to newspaper name and information.