[siren-user] basic usage questions
Mike Grove
mike at clarkparsia.com
Thu May 13 16:24:34 IST 2010
On Wed, May 12, 2010 at 6:30 PM, Renaud Delbru <renaud.delbru at deri.org>wrote:
> Mike,
>
>
> On 12/05/10 20:54, Mike Grove wrote:
>
>>
>>
>> By using setRAMBufferSizeMB(32), I am telling Lucene to perform a
>> commit when 32MB of Ram is used. This is the best practice to use
>> with Lucene. You can increase it if you have more memory, or
>> decrease it if your memory is limited. But even with 8 or 16 MB,
>> you should see a big boost in term of indexing performance.
>>
>> Then, the other parameters are used to optimise indexing
>> (setUseCompoundFile(false) and setMergeFactor(20)).
>>
>> When the index is created, don't forget to optimise it (call
>> IndexWriter.optimise). This operation will take a certain time,
>> especially if your index is large, but it will improve the query
>> performance.
>>
>>
>> Yep, that greatly decreased the index time, we're seeing between 3.5 and 4
>> minutes for the same file.� The index generated by lucene is now up to 1.7G
>> for the 471M RDF file, but searches are definitely faster.� I don't care so
>> much about the index size since disk space is cheap and the searches are
>> plenty fast for our use case, but I am surprised it's that big.� I guess
>> that has to do with our resource per document indexing strategy?
>>
>> No, this is definitely unusual. Even if you are indexing on a per entity
> basis instead of on a per document basis, this should not increase that much
> the index size. At the end, you are indexing the same amount of data.
>
> First, between each index try, have you wiped out the previous created
> indexes ? I am asking that because if you are performing multiple times the
> indexing of your dataset on the same index directory, Lucene/SIREn will not
> erase the previous indexed document/entity, even if they have the same
> URL/URI. The notion of unique key is absent from Lucene/SIREn (you need to
> manage it by yourself, first executing a delete query, then adding the
> entity).
>
Yep, building the index from a clean slate seemed to still generate an index
of the same size.
>
> If this is not the case, I would recommend to check your pre-processing
> step, the one that read the RDF document and split it into entities. Maybe
> there is something wrong at this point, and data are duplicated which will
> explain the index size.
>
I checked this and I'm only creating one document per entity.
I'm seeing some curious search behavior, hopefully you can help me
understand it -- I don't know much about either lucene or siren.
If I create and execute this query:
BooleanQuery bq = new BooleanQuery();
bq.add(new SirenTermQuery(new
org.apache.lucene.index.Term(SearchConstants.DEFAULT_FIELD, "diffusion")),
BooleanClause.Occur.MUST_NOT);
bq.add(new SirenTermQuery(new
org.apache.lucene.index.Term(SearchConstants.DEFAULT_FIELD, "engineer")),
BooleanClause.Occur.MUST);
I get 15719 results, which I'd guess is correct. If I remove the first
clause, 'not diffusion', I get 15810, which makes sense -- there's only a
few entries on diffusion out of all the engineering related entries. But if
I remove the second clause, so my search is just "not diffusion" I get zero
results. From the previous two searches, I know there's at least 91 entries
that would be search hits for 'diffusion' and there's ~300000 total entries
in the index, so I would have expected to see the hit count be nearly every
document, not zero.
I seem to get this behavior any time there's a single MUST_NOT in a boolean
query. if I use the above code, but add the following three lines:
BooleanQuery bq2 = new BooleanQuery();
bq2.add(new SirenTermQuery(new
org.apache.lucene.index.Term(SearchConstants.DEFAULT_FIELD, "rocket")),
BooleanClause.Occur.MUST_NOT);
bq.add(bq2, BooleanClause.Occur.MUST);
I again get zero results -- I think I just added a clause to the original
query saying that my results should include hits that don't include the term
'rocket' I know there are documents that should satisfy this constraint, so
it should not be zero results.
I think I'm missing something obvious about constructing the query objects,
but I don't know what that is. Any ideas?
Thanks for the help.
Mike
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.deri.org/pipermail/siren/attachments/20100513/9b55d730/attachment.htm
More information about the siren
mailing list