[siren-user] basic usage questions
Mike Grove
mike at clarkparsia.com
Wed May 12 20:54:34 IST 2010
On Wed, May 12, 2010 at 2:17 PM, Renaud Delbru <renaud.delbru at deri.org>wrote:
>
>> Is this the correct way I should be saving and indexing the data?
>>
>
> Ok, this is the right way to do. So you are not storing the data, which is
> good. However, I see that you are calling commit after adding the document.
> Are you doing that each time you are adding a document ?
>
> If the answer is yes, then this is the problem. You should call commit
> after a certain number of documents. Or, let Lucene handle the commit for
> you.
> Here is the way I have configured the IndexWriter for indexing the BTC
> dataset:
>
> dir = new NIOFSDirectory(indexDir);
> final PerFieldAnalyzerWrapper analyzer = new PerFieldAnalyzerWrapper(new
> StandardAnalyzer(Version.LUCENE_30));
> analyzer.addAnalyzer(DEFAULT_TUPLE_FIELD, new TupleAnalyzer());
> writer = new IndexWriter(dir, analyzer, MaxFieldLength.UNLIMITED);
> // Configre auto-commit to occur whenever 32MB of Ram is used
> writer.setRAMBufferSizeMB(32);
> writer.setMaxBufferedDocs(IndexWriter.DISABLE_AUTO_FLUSH);
> writer.setMaxBufferedDeleteTerms(IndexWriter.DISABLE_AUTO_FLUSH);
> // Disable compound file
> writer.setUseCompoundFile(false);
> // Increase mergeFactor to optimise indexing
> writer.setMergeFactor(20);
>
> By using setRAMBufferSizeMB(32), I am telling Lucene to perform a commit
> when 32MB of Ram is used. This is the best practice to use with Lucene. You
> can increase it if you have more memory, or decrease it if your memory is
> limited. But even with 8 or 16 MB, you should see a big boost in term of
> indexing performance.
>
> Then, the other parameters are used to optimise indexing
> (setUseCompoundFile(false) and setMergeFactor(20)).
>
> When the index is created, don't forget to optimise it (call
> IndexWriter.optimise). This operation will take a certain time, especially
> if your index is large, but it will improve the query performance.
>
Yep, that greatly decreased the index time, we're seeing between 3.5 and 4
minutes for the same file. The index generated by lucene is now up to 1.7G
for the 471M RDF file, but searches are definitely faster. I don't care so
much about the index size since disk space is cheap and the searches are
plenty fast for our use case, but I am surprised it's that big. I guess
that has to do with our resource per document indexing strategy?
I appreciate the speedy and quality answers on this stuff =)
Cheers,
Mike
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.deri.org/pipermail/siren/attachments/20100512/3477f0fe/attachment.htm
More information about the siren
mailing list