KnowledgeTree 3.7.0.2 Document Indexing and Indexer issues (SOLVED)

Written by

code, knowledgetree, open source software, php, sysadmin, ubuntu

Argh. Well, it’s really not quite out of the box, at least on Ubutu Server 9.10 (after reinstallations required after initial failures…), so this is just a selection of the fixes that made the import and indexing of the 160 000 files at 102GB possible. So for KnowledgeTree 3.7.0.2 Commercial Edition (the same holds true for the Community Edition), the following should help:

Use the best-practice advice when doing the local file system import – rather do 10 000 files at a time rather than 100 000 at once. Really. Trust me. It defeats the whole idea of just running a batch job. Completely. As you’d expect the option of saying – just transfer all data in directory X. But alas, that doesn’t work. So do it in batches. Manually.
Tika Apache Indexer for Lucene – not so much on PDFs, Docs, XLS or PPT files. Install catdoc (which includes catppt and xls2csv) and pdftotext (which you’ll find in xpdf-utils).
1. apt-get install catdoc pdftotext
2. modify knowledgetree/search2/indexing/extractors/TikaApacheExtractor.inc.php and comment out the mime types that are affected above from the returned array in getSupportedMimeTypes() – PDF, XLS, DOC and PPT:
  - 'application/pdf'
  - 'application/vnd.ms-excel'
  - 'application/vnd.ms-powerpoint'
  - 'application/msword'
  Then it’s a matter of updating and commenting out the blank array being returned in the indexing parsers:
  - PDFExtractor.inc.php
  - ExcelExtractor.inc.php
  - PowerpointExtractor.inc.php
  - WordExtractor.inc.php
  Look for the same getSupportedMimeTypes() function and comment out the return array() which overrides the file’s notification of which mime type it supports, so typically, it’s just a
  // return array(); return array( 'application/vnd.ms-excel');
  for the Excel File Format.
3. Update the database and force the Mime table that’s used for the extractors etc:
  UPDATE system_settings SET value = 0 WHERE name = 'mimeTypesRegistered';
4. You’ll find another write-up that I used as a source here.
Force indexing manually
1. You’re almost done. When trying to run the cronIndexer.php manually (which I needed to do as after 1997 files – no documentIndexer.lock files present – no further indexing was being done), there were complaints about the getFileSize() function for $document. Here, /usr/share/knowledgetree/search2/bin/cronIndexer.php needs modification.
  - In the publc function processDocument, change it from processDocument($document, $docinfo) to processDocument($docinfo)
  - after
    Indexer::incrementCount();
    insert
    $document = & Document::get($docinfo['document_id']);
  - Removing the $document passed variable…? Yup, it seems that $document = $docinfo being passed, else all would fail completely, I guess…
2. Calling the cronIndexer.php manually did not work from php, but zend-php for me:
  - /usr/local/zend/bin/php -f /usr/share/knowledgetree/search2/bin/cronIndexer.php

Perhaps you don’t have any of these issues. If not, good for you! If you do, hope this helps.

document indexing document management knowledgetree solved

Comments

4 responses to “KnowledgeTree 3.7.0.2 Document Indexing and Indexer issues (SOLVED)”

March 22, 2010

Sven Welzel

Quick update: use of cronIndexer.php is deprecated, it seems.

Rather call the document processor cron in /usr/share/knowledgetree/search2/bin/cronDocumentProcessor.php.

This also resolves the issue that would keep popping up relating to a Fatal error: Call to a member function temporaryFile() on a non-object in /usr/share/knowledgetree/search2/indexing/indexerCore.inc.php on line 1434.

So replace the last point above with /usr/local/zend/bin/php -f /usr/share/knowledgetree/search2/bin/cronDocumentProcessor.php

Happy thoughts!!
April 7, 2010

Sven Welzel

Here’s a fun new one:

Cannot use object of type Document as array in /usr/share/knowledgetree/search2/indexing/indexerCore.inc.php on line 1347

More on that when it’s solved…
June 29, 2011

Scott

Make sure the cache directory exists. For some reason, the cache directory under /knowlegetree/var/cache doesn’t exist on the initial install (I think, can’t imagine I deleted it by accident), and without it, the indexer can’t write a couple of lock files to it. Without it, I get the error as described by Sven regarding the use of Document as an array.
June 29, 2011

Sven Welzel

Thanks, Scott!

Will give this a try at the next install…

KnowledgeTree 3.7.0.2 Document Indexing and Indexer issues (SOLVED)

Comments

4 responses to “KnowledgeTree 3.7.0.2 Document Indexing and Indexer issues (SOLVED)”

More posts

Endlich wieder ein deutscher Buchladen in Kapstadt

Finally – a German bookshop in Cape Town

Some recent pictures

Add the Loadshedding Lookup Tool for Cape Town to your mobile