intern

From: Nate Hill 
------------------------------------------------------
Hi all,
Over at the library in our local history department we've got some pretty
neat oral histories.
The transcripts are all typed out on paper and the content is all burned to
CDs.
I'd love to find an intern, perhaps a student, who would be interested in
OCRing all of those transcripts and making everything accessible on the web.
If you have experience with this kind of thing and want to take on a
project, please drop me a note.
Thanks
Nate

-- 
Nate Hill
nathanielhill@gmail.com
http://4thfloor.chattlibrary.org/
http://www.natehill.net

=============================================================== From: Ed King ------------------------------------------------------ instead of OCR, I wonder if this project would be a good way to (ab)use Google's speech-to-text api. Are the original audio recordings availabe?

=============================================================== From: Nate Hill ------------------------------------------------------ Great idea! They are. Ed, are you volunteering? :)

=============================================================== From: Ed King ------------------------------------------------------ sure, I'll take a crack at it. are the audio recordings already digitized? I can digitize the recordings if they are not already digitized. I'm thinking I'd write a script to process the digitized file... use sox to break the file into small chunks feed the small audio chunks to google speech-to-text save the text to a file, append mode repeat until entire audio file is converted to text once I get a script that works, I'll share the code, perhaps other folks could jump in and we can turn this into a distributed, parallel processing project ;-) what kind of "volume" are we looking at? (how many oral histories, how long is a typical recording, etc)

=============================================================== From: rdflowers ------------------------------------------------------ I am volunteering, Nate, using either method. ----- Message from nathanielhill@gmail.com --------- Date: Wed, 19 Sep 2012 16:19:04 -0400 From: Nate Hill Reply-To: Chattanooga Unix Gnu Android Linux Users Group Subject: Re: [Chugalug] intern To: Chattanooga Unix Gnu Android Linux Users Group

=============================================================== From: Ryan Van Dolson ------------------------------------------------------ Looks like I was too slow to throw my hat in the ring. But if this becomes a collaborative effort, count me in! :)

=============================================================== From: Sean Brewer ------------------------------------------------------ I wish there was an open source reCAPTCHA. This would be a great way for libraries to digitize their archives easily.

=============================================================== From: Rob Huston ------------------------------------------------------ While I don't have the technical expertise, if proofreading becomes necessary, I could volunteer some time . Thanks, -rob From: chugalug-bounces@chugalug.org [mailto:chugalug-bounces@chugalug.org] On Behalf Of Sean Brewer Sent: Wednesday, September 19, 2012 11:29 PM To: Chattanooga Unix Gnu Android Linux Users Group Subject: Re: [Chugalug] intern I wish there was an open source reCAPTCHA. This would be a great way for libraries to digitize their archives easily. Hi all, Over at the library in our local history department we've got some pretty neat oral histories. The transcripts are all typed out on paper and the content is all burned to CDs. I'd love to find an intern, perhaps a student, who would be interested in OCRing all of those transcripts and making everything accessible on the web. If you have experience with this kind of thing and want to take on a project, please drop me a note. Thanks Nate -- Nate Hill nathanielhill@gmail.com http://4thfloor.chattlibrary.org/ http://www.natehill.net

=============================================================== From: Nate Hill ------------------------------------------------------ isn't that what CAPTCHA is doing now? I thought that was the genius behind it... that every time you fill one out you are helping with character correction in a digitization project. This would be an interesting thing to make. A lot of libraries and businesses have a 'labs' division. I'm sort of toying with giving our library a 'public labs' division that could meet and work on things like this during regular events like this 'Hack the Library' thing I'm sort of cooking up right now (stay tuned). What kind of resources might go into making something like this?

=============================================================== From: Rob Huston ------------------------------------------------------ Damnit, that was more meant for Nate than for the group, as a whole, and here I went and made it worse by double posting.. Sorry -r From: Rob Huston [mailto:hellinabucket@gmail.com] Sent: Thursday, September 20, 2012 8:37 AM To: 'Chattanooga Unix Gnu Android Linux Users Group' Subject: RE: [Chugalug] intern While I don't have the technical expertise, if proofreading becomes necessary, I could volunteer some time . Thanks, -rob From: chugalug-bounces@chugalug.org [mailto:chugalug-bounces@chugalug.org] On Behalf Of Sean Brewer Sent: Wednesday, September 19, 2012 11:29 PM To: Chattanooga Unix Gnu Android Linux Users Group Subject: Re: [Chugalug] intern I wish there was an open source reCAPTCHA. This would be a great way for libraries to digitize their archives easily. Hi all, Over at the library in our local history department we've got some pretty neat oral histories. The transcripts are all typed out on paper and the content is all burned to CDs. I'd love to find an intern, perhaps a student, who would be interested in OCRing all of those transcripts and making everything accessible on the web. If you have experience with this kind of thing and want to take on a project, please drop me a note. Thanks Nate -- Nate Hill nathanielhill@gmail.com http://4thfloor.chattlibrary.org/ http://www.natehill.net

=============================================================== From: Nate Hill ------------------------------------------------------ Amazing overwhelming response from everyone! Thanks. Ed got to me first so I'm gonna pass him the files to play with. Once we have digital transcripts of everything there is an opportunity to do all kinds of fun visualizations and ngrams and stuff, so stay tuned... On Thu, Sep 20, 2012 at 8:45 AM, Rob Huston wrote= : eb.

=============================================================== From: Ed King ------------------------------------------------------ Based on the translation accuracy I've seen from Google's speech-to-text = =0Aresults, I suspect that there will be a fair amount of proofreading that= will =0Aneed to be done, and I would be glad to have some help with that p= art of the =0Aproject. Let me get the files and start chewing on them... = I'll keep =0Aeveryone updated with my progress and share my scripts/notes= , etc=0A=0A=0A=0A=0A

=============================================================== From: Rod-Lists ------------------------------------------------------ ripping cd's should be no problem. What file format you want? It is amazing what my wife is doing with Wiltsy diarys. Not everyone font match and correct 200 year old fonts. ----- Original Message -----

=============================================================== From: Sean Brewer ------------------------------------------------------ That's what reCAPTCHA is doing, yeah. But as far as I know, they aren't accepting collections outside of the New York Times and the books in Google's collection for Google Books. Basically it would go like this: Scan items -> automatically extract word images from scans and store them (not sure how to do this) -> pair unknown words from scans with known ones for user to digitize and repeat until certain requirements are met I'd have to check Von Ahn's paper for more details, but that's the gist. There's also distributed proofreaders: http://www.pgdp.net/c/, that would be another way to do it.

=============================================================== From: "Alex Smith (K4RNT)" ------------------------------------------------------ That would probably take a *very long time*, longer than just having someone digitize what OCR can't accomplish automatically. s : f : r te: the

=============================================================== From: Dan Lyke ------------------------------------------------------ On Thu, 20 Sep 2012 11:12:10 -0400 "Alex Smith (K4RNT)" wrote: I thought the point with reCAPTCHA was to deal with some of the cases that OCR doesn't do very well. And, especially given some of the silliness people are using the Amazon Turk for, never underestimate the power of an infinite number of monkeys on typewriters... Dan