[siren-user] basic usage questions
Renaud Delbru
renaud.delbru at deri.org
Thu May 13 17:47:46 IST 2010
Hi,
On 13/05/10 16:24, Mike Grove wrote:
>
> First, between each index try, have you wiped out the previous
> created indexes ? I am asking that because if you are performing
> multiple times the indexing of your dataset on the same index
> directory, Lucene/SIREn will not erase the previous indexed
> document/entity, even if they have the same URL/URI. The notion of
> unique key is absent from Lucene/SIREn (you need to manage it by
> yourself, first executing a delete query, then adding the entity).
>
>
> Yep, building the index from a clean slate seemed to still generate an
> index of the same size.
There is definitely a problem. Is it possible to share the code and the
data to have a look, or is it private/confidential data ?
How are you splitting the document to create entity description ? By
taking all triples having the same URI on the subject position (i.e.,
all outgoing RDF triples) or all triples having the same URI on the
subject and object position (i.e. all incoming and outgoing triples).
> If I create and execute this query:
>
> BooleanQuery bq = new BooleanQuery();
> bq.add(new SirenTermQuery(new
> org.apache.lucene.index.Term(SearchConstants.DEFAULT_FIELD,
> "diffusion")), BooleanClause.Occur.MUST_NOT);
> bq.add(new SirenTermQuery(new
> org.apache.lucene.index.Term(SearchConstants.DEFAULT_FIELD,
> "engineer")), BooleanClause.Occur.MUST);
>
> I get 15719 results, which I'd guess is correct. If I remove the
> first clause, 'not diffusion', I get 15810, which makes sense --
> there's only a few entries on diffusion out of all the engineering
> related entries. But if I remove the second clause, so my search is
> just "not diffusion" I get zero results. From the previous two
> searches, I know there's at least 91 entries that would be search hits
> for 'diffusion' and there's ~300000 total entries in the index, so I
> would have expected to see the hit count be nearly every document, not
> zero.
Yes, pure negative queries is not allowed, and therefore returns 0
results. MUST_NOT clause should always be used with another clause,
being either MUST or SHOULD.
--
Renaud Delbru
More information about the siren
mailing list