[siren-user] basic usage questions

Renaud Delbru renaud.delbru at deri.org
Wed May 12 19:17:57 IST 2010


Mike,

answers below.

On 12/05/10 18:58, Mike Grove wrote:
>>
>     > This way, once the KB is indexed, my search hits are the resources
>     > that match the search term. �Not 100% what I was going for, but
>     close
>     > enough.
>     Can you tell more about that ? Why is it not 100% ? Is there some
>     entities that are not retrieved by your queries (but which should be
>     retrieved) ?
>
>
> It's not 100% what I was looking for because I was trying to get the 
> specific triples where the search query occurs, so only knowing which 
> resource has a triple where the search term occurs is less granular 
> than I was hoping.� But for what I hope to ultimately build out with 
> this framework, that will probably be sufficient.
You can know which triples is matching with a little bit of coding and 
hacking with SIREn. The information is there, but I currently does not 
return it to the user. If you are interested, we can discuss later on 
how to do it. Let's try first to solve your problem.
>>
>     Please, correct me if I am wrong.
>     I think you are currently indexing the triples, but also storing them
>     within Lucene/SIREn, that is you define your Lucene document field as:
>
>     new Field("url", myData, Store.YES, Index.ANALYZED_NO_NORMS)
>
>
> I'm not sure =)
>
> This is my exact code for creating the Document which gets indexed, I 
> adapted this from the demo code:
>
> Document aDoc = new Document();
> aDoc.add(new Field("url", aSubjURI, Field.Store.YES, 
> Field.Index.NOT_ANALYZED_NO_NORMS));
> aDoc.add(new Field(DEFAULT_FIELD, aSubjGraphAsNTriples, Field.Store.NO 
> <http://Field.Store.NO>, Field.Index.ANALYZED_NO_NORMS));
> aWriter.addDocument(aDoc);
> aWriter.commit();
>
> aSubjURI is a string which is the URI of the subject whose graph is 
> being indexed, aSubjGraphAsNTriples as what it sounds like, the graph 
> of triples where the specified URI is the subject, serialized as NTriples.
>
> Is this the correct way I should be saving and indexing the data?
Ok, this is the right way to do. So you are not storing the data, which 
is good. However, I see that you are calling commit after adding the 
document. Are you doing that each time you are adding a document ?

If the answer is yes, then this is the problem. You should call commit 
after a certain number of documents. Or, let Lucene handle the commit 
for you.
Here is the way I have configured the IndexWriter for indexing the BTC 
dataset:

     dir = new NIOFSDirectory(indexDir);
     final PerFieldAnalyzerWrapper analyzer = new 
PerFieldAnalyzerWrapper(new StandardAnalyzer(Version.LUCENE_30));
     analyzer.addAnalyzer(DEFAULT_TUPLE_FIELD, new TupleAnalyzer());
     writer = new IndexWriter(dir, analyzer, MaxFieldLength.UNLIMITED);
     // Configre auto-commit to occur whenever 32MB of Ram is used
     writer.setRAMBufferSizeMB(32);
     writer.setMaxBufferedDocs(IndexWriter.DISABLE_AUTO_FLUSH);
     writer.setMaxBufferedDeleteTerms(IndexWriter.DISABLE_AUTO_FLUSH);
     // Disable compound file
     writer.setUseCompoundFile(false);
     // Increase mergeFactor to optimise indexing
     writer.setMergeFactor(20);

By using setRAMBufferSizeMB(32), I am telling Lucene to perform a commit 
when 32MB of Ram is used. This is the best practice to use with Lucene. 
You can increase it if you have more memory, or decrease it if your 
memory is limited. But even with 8 or 16 MB, you should see a big boost 
in term of indexing performance.

Then, the other parameters are used to optimise indexing 
(setUseCompoundFile(false) and setMergeFactor(20)).

When the index is created, don't forget to optimise it (call 
IndexWriter.optimise). This operation will take a certain time, 
especially if your index is large, but it will improve the query 
performance.

-- 
Renaud Delbru


More information about the siren mailing list