Argh. Well, it’s really not quite out of the box, at least on Ubutu Server 9.10 (after reinstallations required after initial failures…), so this is just a selection of the fixes that made the import and indexing of the 160 000 files at 102GB possible. So for KnowledgeTree 3.7.0.2 Commercial Edition (the same holds true for the Community Edition), the following should help:
- Use the best-practice advice when doing the local file system import – rather do 10 000 files at a time rather than 100 000 at once. Really. Trust me. It defeats the whole idea of just running a batch job. Completely. As you’d expect the option of saying – just transfer all data in directory X. But alas, that doesn’t work. So do it in batches. Manually.
- Tika Apache Indexer for Lucene – not so much on PDFs, Docs, XLS or PPT files. Install catdoc (which includes
catppt
andxls2csv
) andpdftotext
(which you’ll find inxpdf-utils
).apt-get install catdoc pdftotext
- modify
knowledgetree/search2/indexing/extractors/TikaApacheExtractor.inc.php
and comment out the mime types that are affected above from the returned array ingetSupportedMimeTypes()
– PDF, XLS, DOC and PPT:'application/pdf'
'application/vnd.ms-excel'
'application/vnd.ms-powerpoint'
'application/msword'
Then it’s a matter of updating and commenting out the blank array being returned in the indexing parsers:
PDFExtractor.inc.php
ExcelExtractor.inc.php
PowerpointExtractor.inc.php
WordExtractor.inc.php
Look for the same
getSupportedMimeTypes()
function and comment out thereturn array()
which overrides the file’s notification of which mime type it supports, so typically, it’s just a
// return array();
return array( 'application/vnd.ms-excel');
for the Excel File Format. - Update the database and force the Mime table that’s used for the extractors etc:
UPDATE system_settings SET value = 0 WHERE name = 'mimeTypesRegistered';
- You’ll find another write-up that I used as a source here.
- Force indexing manually
- You’re almost done. When trying to run the cronIndexer.php manually (which I needed to do as after 1997 files – no
documentIndexer.lock
files present – no further indexing was being done), there were complaints about thegetFileSize()
function for $document. Here,/usr/share/knowledgetree/search2/bin/cronIndexer.php
needs modification.- In the publc function processDocument, change it from
processDocument($document, $docinfo)
toprocessDocument($docinfo)
- after
Indexer::incrementCount();
insert
$document = & Document::get($docinfo['document_id']);
- Removing the
$document
passed variable…? Yup, it seems that $document = $docinfo being passed, else all would fail completely, I guess…
- In the publc function processDocument, change it from
- Calling the cronIndexer.php manually did not work from php, but zend-php for me:
/usr/local/zend/bin/php -f /usr/share/knowledgetree/search2/bin/cronIndexer.php
- You’re almost done. When trying to run the cronIndexer.php manually (which I needed to do as after 1997 files – no
Perhaps you don’t have any of these issues. If not, good for you! If you do, hope this helps.