exxos wrote: ↑Sat Sep 02, 2017 6:39 pm
Sounds like you could blast though the OCR'ing. I guess we could get some people into checking original text and such. I know when I tried it, it got confused with 0 & o & O.. that really broke some some code examples
Yes, I would not rely on OCRed source code at all. Even humans have difficulties in distinguishing zero and letter O, the number 1 with upper and lower case L, etc. But my goal is to allow full text search, not copy and paste of source code. Otherwise this would take years for only one mag series, and I plan to have many of them
exxos wrote: ↑Sat Sep 02, 2017 6:39 pm
It looks fine for a test. I can't remember what programming my PDL site took off-hand, but that is basic and doe the job. So that type of thing could be a option on a final site. I mean searching a file doesn't really take much programming, its basically just a string search.
In theory, yes. Searching one issue is not a problem, although it would take a couple of seconds. Multiply this with 100 issues, you are in minutes. Multiply by several mags... several languages... you get the idea. This is why building an index is the only way.
I used "Open Semantic Search"
https://www.opensemanticsearch.org/ which in turn uses Apache Solr for the index search. This is a nice basis, but I am not entirely satisfied with the interface. Right now it would only tell you, that your keywords appear in a specific issue, but not where. But I guess you're right, it will be sufficient for a start. More important are the OCRed PDFs.