[siren-user] basic usage questions

Mike Grove mike at clarkparsia.com
Thu May 13 16:24:34 IST 2010


On Wed, May 12, 2010 at 6:30 PM, Renaud Delbru <renaud.delbru at deri.org>wrote:

> Mike,
>
>
> On 12/05/10 20:54, Mike Grove wrote:
>
>>
>>
>>    By using setRAMBufferSizeMB(32), I am telling Lucene to perform a
>>    commit when 32MB of Ram is used. This is the best practice to use
>>    with Lucene. You can increase it if you have more memory, or
>>    decrease it if your memory is limited. But even with 8 or 16 MB,
>>    you should see a big boost in term of indexing performance.
>>
>>    Then, the other parameters are used to optimise indexing
>>    (setUseCompoundFile(false) and setMergeFactor(20)).
>>
>>    When the index is created, don't forget to optimise it (call
>>    IndexWriter.optimise). This operation will take a certain time,
>>    especially if your index is large, but it will improve the query
>>    performance.
>>
>>
>> Yep, that greatly decreased the index time, we're seeing between 3.5 and 4
>> minutes for the same file.� The index generated by lucene is now up to 1.7G
>> for the 471M RDF file, but searches are definitely faster.� I don't care so
>> much about the index size since disk space is cheap and the searches are
>> plenty fast for our use case, but I am surprised it's that big.� I guess
>> that has to do with our resource per document indexing strategy?
>>
>>  No, this is definitely unusual. Even if you are indexing on a per entity
> basis instead of on a per document basis, this should not increase that much
> the index size. At the end, you are indexing the same amount of data.
>
> First, between each index try, have you wiped out the previous created
> indexes ? I am asking that because if you are performing multiple times the
> indexing of your dataset on the same index directory, Lucene/SIREn will not
> erase the previous indexed document/entity, even if they have the same
> URL/URI. The notion of unique key is absent from Lucene/SIREn (you need to
> manage it by yourself, first executing a delete query, then adding the
> entity).
>

Yep, building the index from a clean slate seemed to still generate an index
of the same size.


>
> If this is not the case, I would recommend to check your pre-processing
> step, the one that read the RDF document and split it into entities. Maybe
> there is something wrong at this point, and data are duplicated which will
> explain the index size.
>

I checked this and I'm only creating one document per entity.

I'm seeing some curious search behavior, hopefully you can help me
understand it -- I don't know much about either lucene or siren.

If I create and execute this query:

        BooleanQuery bq = new BooleanQuery();
        bq.add(new SirenTermQuery(new
org.apache.lucene.index.Term(SearchConstants.DEFAULT_FIELD, "diffusion")),
BooleanClause.Occur.MUST_NOT);
        bq.add(new SirenTermQuery(new
org.apache.lucene.index.Term(SearchConstants.DEFAULT_FIELD, "engineer")),
BooleanClause.Occur.MUST);

I get 15719 results, which I'd guess is correct.  If I remove the first
clause, 'not diffusion', I get 15810, which makes sense -- there's only a
few entries on diffusion out of all the engineering related entries.  But if
I remove the second clause, so my search is just "not diffusion" I get zero
results.  From the previous two searches, I know there's at least 91 entries
that would be search hits for 'diffusion' and there's ~300000 total entries
in the index, so I would have expected to see the hit count be nearly every
document, not zero.

I seem to get this behavior any time there's a single MUST_NOT in a boolean
query.  if I use the above code, but add the following three lines:

        BooleanQuery bq2 = new BooleanQuery();
        bq2.add(new SirenTermQuery(new
org.apache.lucene.index.Term(SearchConstants.DEFAULT_FIELD, "rocket")),
BooleanClause.Occur.MUST_NOT);

        bq.add(bq2, BooleanClause.Occur.MUST);

I again get zero results -- I think I just added a clause to the original
query saying that my results should include hits that don't include the term
'rocket'  I know there are documents that should satisfy this constraint, so
it should not be zero results.

I think I'm missing something obvious about constructing the query objects,
but I don't know what that is.  Any ideas?

Thanks for the help.

Mike
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.deri.org/pipermail/siren/attachments/20100513/9b55d730/attachment.htm 


More information about the siren mailing list