📚

Library of Congress information

Looking through the library of congress for a project

Library of Congress is really an incredible repository of content. Just scrolling through library of congress is a fascinating mental exercise.

One of my focus was creating news snippets across time from newspapers. Library of Congress has a sereis called Chronicling America. which has collected photocopied and OCRd newspapers from 1800s

It’s incredibly well managed and beautifully managed.

Every newspaper is identified using an LCCN and there’s pretty viewer as well a bulk downloader. which provides all content with the data. Based on my initial exploration the OCR is fairly on point and I like how it handles the column and row separations really well.

I’m working on converting this wall of text into a dataset of news content with date.

Bulk downloading.

based on the task assignment each newspaper is bundled as tar files.

There’s a file count limit of 10 file for 10 minutes. But other than that everything is fine.

LCCN

Library of Congress Control Number is a well thought system to identify individual resources. there’s an easy way to go from lccn to newspaper name and information.