[siren-user] basic usage questions
renaud.delbru at deri.org
Wed May 12 23:30:19 IST 2010
On 12/05/10 20:54, Mike Grove wrote:
> By using setRAMBufferSizeMB(32), I am telling Lucene to perform a
> commit when 32MB of Ram is used. This is the best practice to use
> with Lucene. You can increase it if you have more memory, or
> decrease it if your memory is limited. But even with 8 or 16 MB,
> you should see a big boost in term of indexing performance.
> Then, the other parameters are used to optimise indexing
> (setUseCompoundFile(false) and setMergeFactor(20)).
> When the index is created, don't forget to optimise it (call
> IndexWriter.optimise). This operation will take a certain time,
> especially if your index is large, but it will improve the query
> Yep, that greatly decreased the index time, we're seeing between 3.5
> and 4 minutes for the same file.� The index generated by lucene is now
> up to 1.7G for the 471M RDF file, but searches are definitely faster.�
> I don't care so much about the index size since disk space is cheap
> and the searches are plenty fast for our use case, but I am surprised
> it's that big.� I guess that has to do with our resource per document
> indexing strategy?
No, this is definitely unusual. Even if you are indexing on a per entity
basis instead of on a per document basis, this should not increase that
much the index size. At the end, you are indexing the same amount of data.
First, between each index try, have you wiped out the previous created
indexes ? I am asking that because if you are performing multiple times
the indexing of your dataset on the same index directory, Lucene/SIREn
will not erase the previous indexed document/entity, even if they have
the same URL/URI. The notion of unique key is absent from Lucene/SIREn
(you need to manage it by yourself, first executing a delete query, then
adding the entity).
If this is not the case, I would recommend to check your pre-processing
step, the one that read the RDF document and split it into entities.
Maybe there is something wrong at this point, and data are duplicated
which will explain the index size.
More information about the siren