Earlier this year, communications technology scholar Kalev Leetaru began culling over 14 million images from the Internet Archive’s public domain ebooks and uploading them to the Internet Archive’s Flickr account. As of today, 2.6 million images are now easily searchable and downloadable.
When the Internet Archive originally scanned the books, they used Optical Character Recognition (OCR), which made the book text searchable, but that didn’t mean much if you were looking for images. So Leetaru wrote some software to take advantage of the OCR program that the Internet Archive had used to scan public domain works published and written between 1500 and 1922.
According to the BBC, the OCR program scanned the books and discarded sections of the text that it recognized as images. Leetaru had his software go back and find those discarded portions of text, automatically converting those sections into Jpeg images and uploading them to Flickr. "The software also copied the caption for each image and the text from the paragraphs immediately preceding and following it in the book,” the BBC wrote.
Although the tagging for the images is admittedly imprecise, the potential for such an easily accessible archive is massive. "Any library could repeat this process,” Leetaru told the BBC. "That's actually my hope, that libraries around the world run this same process of their digitized books to constantly expand this universe of images.”