[siren-user] basic usage questions
Renaud Delbru
renaud.delbru at deri.org
Wed May 12 19:17:57 IST 2010
Mike,
answers below.
On 12/05/10 18:58, Mike Grove wrote:
> �
>
> > This way, once the KB is indexed, my search hits are the resources
> > that match the search term. �Not 100% what I was going for, but
> close
> > enough.
> Can you tell more about that ? Why is it not 100% ? Is there some
> entities that are not retrieved by your queries (but which should be
> retrieved) ?
>
>
> It's not 100% what I was looking for because I was trying to get the
> specific triples where the search query occurs, so only knowing which
> resource has a triple where the search term occurs is less granular
> than I was hoping.� But for what I hope to ultimately build out with
> this framework, that will probably be sufficient.
You can know which triples is matching with a little bit of coding and
hacking with SIREn. The information is there, but I currently does not
return it to the user. If you are interested, we can discuss later on
how to do it. Let's try first to solve your problem.
> �
>
> Please, correct me if I am wrong.
> I think you are currently indexing the triples, but also storing them
> within Lucene/SIREn, that is you define your Lucene document field as:
>
> new Field("url", myData, Store.YES, Index.ANALYZED_NO_NORMS)
>
>
> I'm not sure =)
>
> This is my exact code for creating the Document which gets indexed, I
> adapted this from the demo code:
>
> Document aDoc = new Document();
> aDoc.add(new Field("url", aSubjURI, Field.Store.YES,
> Field.Index.NOT_ANALYZED_NO_NORMS));
> aDoc.add(new Field(DEFAULT_FIELD, aSubjGraphAsNTriples, Field.Store.NO
> <http://Field.Store.NO>, Field.Index.ANALYZED_NO_NORMS));
> aWriter.addDocument(aDoc);
> aWriter.commit();
>
> aSubjURI is a string which is the URI of the subject whose graph is
> being indexed, aSubjGraphAsNTriples as what it sounds like, the graph
> of triples where the specified URI is the subject, serialized as NTriples.
>
> Is this the correct way I should be saving and indexing the data?
Ok, this is the right way to do. So you are not storing the data, which
is good. However, I see that you are calling commit after adding the
document. Are you doing that each time you are adding a document ?
If the answer is yes, then this is the problem. You should call commit
after a certain number of documents. Or, let Lucene handle the commit
for you.
Here is the way I have configured the IndexWriter for indexing the BTC
dataset:
dir = new NIOFSDirectory(indexDir);
final PerFieldAnalyzerWrapper analyzer = new
PerFieldAnalyzerWrapper(new StandardAnalyzer(Version.LUCENE_30));
analyzer.addAnalyzer(DEFAULT_TUPLE_FIELD, new TupleAnalyzer());
writer = new IndexWriter(dir, analyzer, MaxFieldLength.UNLIMITED);
// Configre auto-commit to occur whenever 32MB of Ram is used
writer.setRAMBufferSizeMB(32);
writer.setMaxBufferedDocs(IndexWriter.DISABLE_AUTO_FLUSH);
writer.setMaxBufferedDeleteTerms(IndexWriter.DISABLE_AUTO_FLUSH);
// Disable compound file
writer.setUseCompoundFile(false);
// Increase mergeFactor to optimise indexing
writer.setMergeFactor(20);
By using setRAMBufferSizeMB(32), I am telling Lucene to perform a commit
when 32MB of Ram is used. This is the best practice to use with Lucene.
You can increase it if you have more memory, or decrease it if your
memory is limited. But even with 8 or 16 MB, you should see a big boost
in term of indexing performance.
Then, the other parameters are used to optimise indexing
(setUseCompoundFile(false) and setMergeFactor(20)).
When the index is created, don't forget to optimise it (call
IndexWriter.optimise). This operation will take a certain time,
especially if your index is large, but it will improve the query
performance.
--
Renaud Delbru
More information about the siren
mailing list