[siren-user] basic usage questions

Renaud Delbru renaud.delbru at deri.org
Wed May 12 23:30:19 IST 2010


Mike,

On 12/05/10 20:54, Mike Grove wrote:
>
>
>     By using setRAMBufferSizeMB(32), I am telling Lucene to perform a
>     commit when 32MB of Ram is used. This is the best practice to use
>     with Lucene. You can increase it if you have more memory, or
>     decrease it if your memory is limited. But even with 8 or 16 MB,
>     you should see a big boost in term of indexing performance.
>
>     Then, the other parameters are used to optimise indexing
>     (setUseCompoundFile(false) and setMergeFactor(20)).
>
>     When the index is created, don't forget to optimise it (call
>     IndexWriter.optimise). This operation will take a certain time,
>     especially if your index is large, but it will improve the query
>     performance.
>
>
> Yep, that greatly decreased the index time, we're seeing between 3.5 
> and 4 minutes for the same file.� The index generated by lucene is now 
> up to 1.7G for the 471M RDF file, but searches are definitely faster.� 
> I don't care so much about the index size since disk space is cheap 
> and the searches are plenty fast for our use case, but I am surprised 
> it's that big.� I guess that has to do with our resource per document 
> indexing strategy?
>
No, this is definitely unusual. Even if you are indexing on a per entity 
basis instead of on a per document basis, this should not increase that 
much the index size. At the end, you are indexing the same amount of data.

First, between each index try, have you wiped out the previous created 
indexes ? I am asking that because if you are performing multiple times 
the indexing of your dataset on the same index directory, Lucene/SIREn 
will not erase the previous indexed document/entity, even if they have 
the same URL/URI. The notion of unique key is absent from Lucene/SIREn 
(you need to manage it by yourself, first executing a delete query, then 
adding the entity).

If this is not the case, I would recommend to check your pre-processing 
step, the one that read the RDF document and split it into entities. 
Maybe there is something wrong at this point, and data are duplicated 
which will explain the index size.

Cheers,
-- 
Renaud Delbru



More information about the siren mailing list