Old Computer Mag Search Engine

Blogs & guides and tales of woo by forum members.
Post Reply
User avatar
IngoQ
Posts: 399
Joined: Tue Aug 22, 2017 8:38 am
Location: Germany

Old Computer Mag Search Engine

Post by IngoQ » Sat Sep 02, 2017 10:16 am

Hiya,

I recently had the idea (most likely I was not the first) to set up a site where you can full-text search old computer mags. Technically it should not be too hard, provided I can find a simple file-based search engine that will search PDFs.

More difficult will be the task of finding good quality PDFs of mags and OCR them, if not already done. Finding again might be easier, because there are already a bunch of collections. But till now, I did not find any that were already OCRed, some even would not have the quality to allow OCR.

So the general process would be (after having set up the system of course):
  1. Find collections of relevant computer mags
  2. check quality
  3. upload them to the search engine server
  4. OCR them
  5. build index for them
What do you think of this idea? Would you use a site like that and find it helpful? Would you participate in for example finding collections? Where would you see possible issues or problems? Do you know of a project like this, or is it already done somewhere?

Let me know :)
Ingo :geek:

“Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.” - Antoine de Saint-Exupéry

User avatar
IngoQ
Posts: 399
Joined: Tue Aug 22, 2017 8:38 am
Location: Germany

Re: Old Computer Mag Search Engine

Post by IngoQ » Sat Sep 02, 2017 2:36 pm

I did some testing with PDF files, to see how good OCRing will actually work. Here are some examples of the results:

First of all an excerpt of german "Atari Magazin". Although on first glance looking okay, the quality is actually that bad, that it is not even recognised as text. File size is about 16-20 MB per Issue:
Atari-Magazin-87-01.excerpt.pdf
(435.47 KiB) Downloaded 6 times
Next there is one of Exxos scans of "Atari computing". File size on this one was 150MB, and as you can imagine, quality is really good:
AC1.ocr.excerpt.pdf
(3.55 MiB) Downloaded 5 times
And last but not least the "ST Amiga Format" from here: http://stformat.com/
File size is 44 MB and the quality still is good enough to allow OCR:
staf01.ocr.excerpt.pdf
(3.22 MiB) Downloaded 6 times
Ingo :geek:

“Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.” - Antoine de Saint-Exupéry

User avatar
exxos
Site Admin
Posts: 1448
Joined: Wed Aug 16, 2017 11:19 pm
Location: UK
Contact:

Re: Old Computer Mag Search Engine

Post by exxos » Sat Sep 02, 2017 5:18 pm

Interesting idea :) I know it would be useful when hunting for info.

I did considering it some years ago, but the OCR got that confused, and was so many mistakes I gave up in the end. Though if you can do it for mags like STF which are online, then that would be really cool. Though a huge amount of work in checking every word ?

A searchable database would be good, I guess if the issues was done in a simple text file , then a php script could search each issues text and state the page and pdf it is located in.

If you check my floppy shop site, that does just that, searches through all the texts, does it pretty fast as well.
4MB STFM 1.44 FD- VELOCE+ 020 STE - 4MB STE 32MHz - STFM 16MHz - STM - MEGA ST - Falcon 030 CT60 - Atari 2600 - Atari 7800 - Gigafile - SD Floppy Emulator - PeST - HxC - CosmosEx - Ultrasatan - various clutter

https://www.exxoshost.co.uk/atari/ All my hardware guides - mods - games - STOS
https://www.exxoshost.co.uk/atari/last/storenew/ - All my hardware mods for sale - Please help support by making a purchase.

User avatar
IngoQ
Posts: 399
Joined: Tue Aug 22, 2017 8:38 am
Location: Germany

Re: Old Computer Mag Search Engine

Post by IngoQ » Sat Sep 02, 2017 5:51 pm

exxos wrote:
Sat Sep 02, 2017 5:18 pm
Though a huge amount of work in checking every word ?
Yes, it definately would, but I am not insane and try ;)

My two options of OCRing are Abby Finereader and Tesseract. Even with the smaller ST Format issue I have recognition rates about 99%. This is absolutely sufficient for a full text search. The whole process for this issue (100 pages) took roughly 5 Minutes on my system (from PDF to PDF/A with text layer), no interaction required. And I guess this could still be improved, it's only a test to get a rough idea.

Regarding the interface I am not completely sure. I have set up a test machine with Open Semantic Search, but I am not entirely happy with it...
2017-09-02 18_49_17-Search word processor.png
2017-09-02 18_49_17-Search word processor.png (49.13 KiB) Viewed 259 times
Ingo :geek:

“Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.” - Antoine de Saint-Exupéry

User avatar
exxos
Site Admin
Posts: 1448
Joined: Wed Aug 16, 2017 11:19 pm
Location: UK
Contact:

Re: Old Computer Mag Search Engine

Post by exxos » Sat Sep 02, 2017 6:39 pm

IngoQ wrote:
Sat Sep 02, 2017 5:51 pm
Yes, it definately would, but I am not insane and try ;)
My two options of OCRing are Abby Finereader and Tesseract. Even with the smaller ST Format issue I have recognition rates about 99%. This is absolutely sufficient for a full text search. The whole process for this issue (100 pages) took roughly 5 Minutes on my system (from PDF to PDF/A with text layer), no interaction required. And I guess this could still be improved, it's only a test to get a rough idea.
Sounds like you could blast though the OCR'ing. I guess we could get some people into checking original text and such. I know when I tried it, it got confused with 0 & o & O.. that really broke some some code examples ;)

IngoQ wrote:
Sat Sep 02, 2017 5:51 pm
Regarding the interface I am not completely sure. I have set up a test machine with Open Semantic Search, but I am not entirely happy with it...
It looks fine for a test. I can't remember what programming my PDL site took off-hand, but that is basic and doe the job. So that type of thing could be a option on a final site. I mean searching a file doesn't really take much programming, its basically just a string search.
4MB STFM 1.44 FD- VELOCE+ 020 STE - 4MB STE 32MHz - STFM 16MHz - STM - MEGA ST - Falcon 030 CT60 - Atari 2600 - Atari 7800 - Gigafile - SD Floppy Emulator - PeST - HxC - CosmosEx - Ultrasatan - various clutter

https://www.exxoshost.co.uk/atari/ All my hardware guides - mods - games - STOS
https://www.exxoshost.co.uk/atari/last/storenew/ - All my hardware mods for sale - Please help support by making a purchase.

User avatar
IngoQ
Posts: 399
Joined: Tue Aug 22, 2017 8:38 am
Location: Germany

Re: Old Computer Mag Search Engine

Post by IngoQ » Sat Sep 02, 2017 6:54 pm

exxos wrote:
Sat Sep 02, 2017 6:39 pm
Sounds like you could blast though the OCR'ing. I guess we could get some people into checking original text and such. I know when I tried it, it got confused with 0 & o & O.. that really broke some some code examples ;)
Yes, I would not rely on OCRed source code at all. Even humans have difficulties in distinguishing zero and letter O, the number 1 with upper and lower case L, etc. But my goal is to allow full text search, not copy and paste of source code. Otherwise this would take years for only one mag series, and I plan to have many of them :)
exxos wrote:
Sat Sep 02, 2017 6:39 pm
It looks fine for a test. I can't remember what programming my PDL site took off-hand, but that is basic and doe the job. So that type of thing could be a option on a final site. I mean searching a file doesn't really take much programming, its basically just a string search.
In theory, yes. Searching one issue is not a problem, although it would take a couple of seconds. Multiply this with 100 issues, you are in minutes. Multiply by several mags... several languages... you get the idea. This is why building an index is the only way.

I used "Open Semantic Search" https://www.opensemanticsearch.org/ which in turn uses Apache Solr for the index search. This is a nice basis, but I am not entirely satisfied with the interface. Right now it would only tell you, that your keywords appear in a specific issue, but not where. But I guess you're right, it will be sufficient for a start. More important are the OCRed PDFs.
Ingo :geek:

“Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.” - Antoine de Saint-Exupéry

User avatar
exxos
Site Admin
Posts: 1448
Joined: Wed Aug 16, 2017 11:19 pm
Location: UK
Contact:

Re: Old Computer Mag Search Engine

Post by exxos » Sat Sep 02, 2017 7:04 pm

I've not heard of Apache Solr.. sounds like its built for this stuff.

I do remember years ago I did a search system (cant remember what for), was in VB6, It was really slow, so I asked in the groups, a few people wrote methods and it was really interesting. But basically what the best one was, was where all the words were sorted alphabetically, so rather than searching the entire text, if just looking for one word, you would only need to find the first word, and all trialing words would be right after it, so you only had to loop until the end of the current word count.
4MB STFM 1.44 FD- VELOCE+ 020 STE - 4MB STE 32MHz - STFM 16MHz - STM - MEGA ST - Falcon 030 CT60 - Atari 2600 - Atari 7800 - Gigafile - SD Floppy Emulator - PeST - HxC - CosmosEx - Ultrasatan - various clutter

https://www.exxoshost.co.uk/atari/ All my hardware guides - mods - games - STOS
https://www.exxoshost.co.uk/atari/last/storenew/ - All my hardware mods for sale - Please help support by making a purchase.

Post Reply