[siren-user] basic usage questions

Mike Grove mike at clarkparsia.com
Thu May 13 17:58:58 IST 2010


On Thu, May 13, 2010 at 12:47 PM, Renaud Delbru <renaud.delbru at deri.org>wrote:

> Hi,
>
>
> On 13/05/10 16:24, Mike Grove wrote:
>
>>
>>    First, between each index try, have you wiped out the previous
>>    created indexes ? I am asking that because if you are performing
>>    multiple times the indexing of your dataset on the same index
>>    directory, Lucene/SIREn will not erase the previous indexed
>>    document/entity, even if they have the same URL/URI. The notion of
>>    unique key is absent from Lucene/SIREn (you need to manage it by
>>    yourself, first executing a delete query, then adding the entity).
>>
>>
>> Yep, building the index from a clean slate seemed to still generate an
>> index of the same size.
>>
> There is definitely a problem. Is it possible to share the code and the
> data to have a look, or is it private/confidential data ?
>

Yes, unfortunately the exact data I'm testing on is private.  I have a
public dataset [1] that is about half the size (3+ million triples) that
probably would exhibit the same behavior.



> How are you splitting the document to create entity description ? By taking
> all triples having the same URI on the subject position (i.e., all outgoing
> RDF triples) or all triples having the same URI on the subject and object
> position (i.e. all incoming and outgoing triples).


This is the code snippet I'm using to split up the kb into documents (uses
our public sesame utils, so won't compile with just sesame API, but you
should get the idea of what I'm doing).

               Collection<URI> aSubjs = new HashSet<URI>();

                TupleQueryResult aResult =
mRepo.selectQuery(SesameQuery.serql("select distinct s from {s} p {o}"));
                for (BindingSet aBinding : iterable(aResult)) {
                    Value aValue = aBinding.getValue("s");
                    if (aValue instanceof URI) {
                        aSubjs.add((URI) aValue);
                    }
                }

        for (URI aSubj : aSubjects) {

            ExtGraph aGraph = aSource.describe(aSubj);

            StringWriter aStrWriter = new StringWriter();
            aGraph.write(aStrWriter, RDFFormat.NTRIPLES);
            String aGraphAsNTriples = aStrWriter.toString();

            Document aDoc = new Document();
            aDoc.add(new Field("url", aSubj.stringValue(), Field.Store.YES,
Field.Index.NOT_ANALYZED_NO_NORMS));
            aDoc.add(new Field(SearchConstants.DEFAULT_FIELD,
aGraphAsNTriples, Field.Store.NO, Field.Index.ANALYZED_NO_NORMS));

            aWriter.addDocument(aDoc);
        }

Pretty basic code.


>  If I create and execute this query:
>>
>>        BooleanQuery bq = new BooleanQuery();
>>        bq.add(new SirenTermQuery(new
>> org.apache.lucene.index.Term(SearchConstants.DEFAULT_FIELD, "diffusion")),
>> BooleanClause.Occur.MUST_NOT);
>>        bq.add(new SirenTermQuery(new
>> org.apache.lucene.index.Term(SearchConstants.DEFAULT_FIELD, "engineer")),
>> BooleanClause.Occur.MUST);
>>
>> I get 15719 results, which I'd guess is correct.  If I remove the first
>> clause, 'not diffusion', I get 15810, which makes sense -- there's only a
>> few entries on diffusion out of all the engineering related entries.  But if
>> I remove the second clause, so my search is just "not diffusion" I get zero
>> results.  From the previous two searches, I know there's at least 91 entries
>> that would be search hits for 'diffusion' and there's ~300000 total entries
>> in the index, so I would have expected to see the hit count be nearly every
>> document, not zero.
>>
> Yes, pure negative queries is not allowed, and therefore returns 0 results.
> MUST_NOT clause should always be used with another clause, being either MUST
> or SHOULD.
>

Ok, that seems reasonable, thanks for clarifying that.

Cheers,

Mike

[1] http://clarkparsia.com/files/baseball.rdf.gz
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.deri.org/pipermail/siren/attachments/20100513/e043e018/attachment.htm 


More information about the siren mailing list