[siren-user] basic usage questions

Mike Grove mike at clarkparsia.com
Wed May 12 20:54:34 IST 2010


On Wed, May 12, 2010 at 2:17 PM, Renaud Delbru <renaud.delbru at deri.org>wrote:

>
>> Is this the correct way I should be saving and indexing the data?
>>
>
> Ok, this is the right way to do. So you are not storing the data, which is
> good. However, I see that you are calling commit after adding the document.
> Are you doing that each time you are adding a document ?
>
> If the answer is yes, then this is the problem. You should call commit
> after a certain number of documents. Or, let Lucene handle the commit for
> you.
> Here is the way I have configured the IndexWriter for indexing the BTC
> dataset:
>
>    dir = new NIOFSDirectory(indexDir);
>    final PerFieldAnalyzerWrapper analyzer = new PerFieldAnalyzerWrapper(new
> StandardAnalyzer(Version.LUCENE_30));
>    analyzer.addAnalyzer(DEFAULT_TUPLE_FIELD, new TupleAnalyzer());
>    writer = new IndexWriter(dir, analyzer, MaxFieldLength.UNLIMITED);
>    // Configre auto-commit to occur whenever 32MB of Ram is used
>    writer.setRAMBufferSizeMB(32);
>    writer.setMaxBufferedDocs(IndexWriter.DISABLE_AUTO_FLUSH);
>    writer.setMaxBufferedDeleteTerms(IndexWriter.DISABLE_AUTO_FLUSH);
>    // Disable compound file
>    writer.setUseCompoundFile(false);
>    // Increase mergeFactor to optimise indexing
>    writer.setMergeFactor(20);
>
> By using setRAMBufferSizeMB(32), I am telling Lucene to perform a commit
> when 32MB of Ram is used. This is the best practice to use with Lucene. You
> can increase it if you have more memory, or decrease it if your memory is
> limited. But even with 8 or 16 MB, you should see a big boost in term of
> indexing performance.
>
> Then, the other parameters are used to optimise indexing
> (setUseCompoundFile(false) and setMergeFactor(20)).
>
> When the index is created, don't forget to optimise it (call
> IndexWriter.optimise). This operation will take a certain time, especially
> if your index is large, but it will improve the query performance.
>

Yep, that greatly decreased the index time, we're seeing between 3.5 and 4
minutes for the same file.  The index generated by lucene is now up to 1.7G
for the 471M RDF file, but searches are definitely faster.  I don't care so
much about the index size since disk space is cheap and the searches are
plenty fast for our use case, but I am surprised it's that big.  I guess
that has to do with our resource per document indexing strategy?

I appreciate the speedy and quality answers on this stuff =)

Cheers,

Mike
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.deri.org/pipermail/siren/attachments/20100512/3477f0fe/attachment.htm 


More information about the siren mailing list