Old Computer Mag Search Engine

Post by **IngoQ** » Sat Sep 02, 2017 10:16 am

Hiya,

I recently had the idea (most likely I was not the first) to set up a site where you can full-text search old computer mags. Technically it should not be too hard, provided I can find a simple file-based search engine that will search PDFs.

More difficult will be the task of finding good quality PDFs of mags and OCR them, if not already done. Finding again might be easier, because there are already a bunch of collections. But till now, I did not find any that were already OCRed, some even would not have the quality to allow OCR.

So the general process would be (after having set up the system of course):

Find collections of relevant computer mags
check quality
upload them to the search engine server
OCR them
build index for them

What do you think of this idea? Would you use a site like that and find it helpful? Would you participate in for example finding collections? Where would you see possible issues or problems? Do you know of a project like this, or is it already done somewhere?

Let me know

Post by **IngoQ** » Sat Sep 02, 2017 2:36 pm

I did some testing with PDF files, to see how good OCRing will actually work. Here are some examples of the results:

First of all an excerpt of german "Atari Magazin". Although on first glance looking okay, the quality is actually that bad, that it is not even recognised as text. File size is about 16-20 MB per Issue:

Atari-Magazin-87-01.excerpt.pdf: (435.47 KiB) Downloaded 209 times

Next there is one of Exxos scans of "Atari computing". File size on this one was 150MB, and as you can imagine, quality is really good:

AC1.ocr.excerpt.pdf: (3.55 MiB) Downloaded 205 times

And last but not least the "ST Amiga Format" from here: http://stformat.com/
File size is 44 MB and the quality still is good enough to allow OCR:

staf01.ocr.excerpt.pdf: (3.22 MiB) Downloaded 206 times

Post by **exxos** » Sat Sep 02, 2017 5:18 pm

Interesting idea

I know it would be useful when hunting for info.

I did considering it some years ago, but the OCR got that confused, and was so many mistakes I gave up in the end. Though if you can do it for mags like STF which are online, then that would be really cool. Though a huge amount of work in checking every word ?

A searchable database would be good, I guess if the issues was done in a simple text file , then a php script could search each issues text and state the page and pdf it is located in.

If you check my floppy shop site, that does just that, searches through all the texts, does it pretty fast as well.

Post by **IngoQ** » Sat Sep 02, 2017 5:51 pm

exxos wrote: ↑Sat Sep 02, 2017 5:18 pm Though a huge amount of work in checking every word ?

Yes, it definately would, but I am not insane and try

My two options of OCRing are Abby Finereader and Tesseract. Even with the smaller ST Format issue I have recognition rates about 99%. This is absolutely sufficient for a full text search. The whole process for this issue (100 pages) took roughly 5 Minutes on my system (from PDF to PDF/A with text layer), no interaction required. And I guess this could still be improved, it's only a test to get a rough idea.

Regarding the interface I am not completely sure. I have set up a test machine with Open Semantic Search, but I am not entirely happy with it...

: 2017-09-02 18_49_17-Search word processor.png (49.13 KiB) Viewed 5125 times

Post by **exxos** » Sat Sep 02, 2017 6:39 pm

IngoQ wrote: ↑Sat Sep 02, 2017 5:51 pm Yes, it definately would, but I am not insane and try
My two options of OCRing are Abby Finereader and Tesseract. Even with the smaller ST Format issue I have recognition rates about 99%. This is absolutely sufficient for a full text search. The whole process for this issue (100 pages) took roughly 5 Minutes on my system (from PDF to PDF/A with text layer), no interaction required. And I guess this could still be improved, it's only a test to get a rough idea.

Sounds like you could blast though the OCR'ing. I guess we could get some people into checking original text and such. I know when I tried it, it got confused with 0 & o & O.. that really broke some some code examples

IngoQ wrote: ↑Sat Sep 02, 2017 5:51 pm Regarding the interface I am not completely sure. I have set up a test machine with Open Semantic Search, but I am not entirely happy with it...

It looks fine for a test. I can't remember what programming my PDL site took off-hand, but that is basic and doe the job. So that type of thing could be a option on a final site. I mean searching a file doesn't really take much programming, its basically just a string search.

Post by **IngoQ** » Sat Sep 02, 2017 6:54 pm

exxos wrote: ↑Sat Sep 02, 2017 6:39 pm Sounds like you could blast though the OCR'ing. I guess we could get some people into checking original text and such. I know when I tried it, it got confused with 0 & o & O.. that really broke some some code examples

Yes, I would not rely on OCRed source code at all. Even humans have difficulties in distinguishing zero and letter O, the number 1 with upper and lower case L, etc. But my goal is to allow full text search, not copy and paste of source code. Otherwise this would take years for only one mag series, and I plan to have many of them

exxos wrote: ↑Sat Sep 02, 2017 6:39 pm It looks fine for a test. I can't remember what programming my PDL site took off-hand, but that is basic and doe the job. So that type of thing could be a option on a final site. I mean searching a file doesn't really take much programming, its basically just a string search.

In theory, yes. Searching one issue is not a problem, although it would take a couple of seconds. Multiply this with 100 issues, you are in minutes. Multiply by several mags... several languages... you get the idea. This is why building an index is the only way.

I used "Open Semantic Search" https://www.opensemanticsearch.org/ which in turn uses Apache Solr for the index search. This is a nice basis, but I am not entirely satisfied with the interface. Right now it would only tell you, that your keywords appear in a specific issue, but not where. But I guess you're right, it will be sufficient for a start. More important are the OCRed PDFs.

Post by **exxos** » Sat Sep 02, 2017 7:04 pm

I've not heard of Apache Solr.. sounds like its built for this stuff.

I do remember years ago I did a search system (cant remember what for), was in VB6, It was really slow, so I asked in the groups, a few people wrote methods and it was really interesting. But basically what the best one was, was where all the words were sorted alphabetically, so rather than searching the entire text, if just looking for one word, you would only need to find the first word, and all trialing words would be right after it, so you only had to loop until the end of the current word count.

Old Computer Mag Search Engine

Old Computer Mag Search Engine

Re: Old Computer Mag Search Engine

Re: Old Computer Mag Search Engine

Re: Old Computer Mag Search Engine

Re: Old Computer Mag Search Engine

Re: Old Computer Mag Search Engine

Re: Old Computer Mag Search Engine