[siren-user] basic usage questions

Renaud Delbru renaud.delbru at deri.org
Wed May 12 18:46:08 IST 2010


Hi Mike,

Se my comments below.

On 12/05/10 17:35, Mike Grove wrote:
> Hi.  I'm just getting into playing around with Siren and my initial 
> results are a little surprising.  Let me describe my approach and the 
> results and hopefully I can get an idea if I'm going at this the wrong 
> way or if this is expected.
>
> I have a single RDF file I want to index; getting this indexed works 
> fine (and quickly), but there's only ever one search result, the 
> original document.  So that doesn't tell me anything, I already knew 
> that my search term was in the kb document.
>
> Ideally, what I want to be able to search for some search string, and 
> be able to get the list of triples that match.  I didn't see any way 
> to make this work -- someone correct me if I'm wrong.
>
> So what I did was parse the RDF and grab the list of all resources 
> used as subjects in the graph.  Then for each subject, I grabbed all 
> its triples, serialized them as ntriples, and created a "virtual" 
> document where the URI of the document was the URI of the subject of 
> the triples and I indexed these.
Yes, this is the right way to do if you wnat to retrieve the "entity" 
that matches your query.
> This way, once the KB is indexed, my search hits are the resources 
> that match the search term.  Not 100% what I was going for, but close 
> enough.
Can you tell more about that ? Why is it not 100% ? Is there some 
entities that are not retrieved by your queries (but which should be 
retrieved) ?
> This however creates a massive index, and is unexpectedly slow.  The 
> index is about 2x the size of the original RDF; I know the lucene and 
> siren pages claim the indexes are supposed to be 10-30% the size of 
> the original data, so for it to be 2x larger is surprising (471M for 
> the RDF, 929M for the index in my test).
Please, correct me if I am wrong.
I think you are currently indexing the triples, but also storing them 
within Lucene/SIREn, that is you define your Lucene document field as:

new Field("url", myData, Store.YES, Index.ANALYZED_NO_NORMS)

The first remark is that Lucene/SIREn is meant to be an index, not a 
data store. Storing that amount of data (the full RDF entity description 
for each document) is not efficient. It will slow down the indexing, and 
increase dramatically the size of the index (the data stored in not 
compressed). If you are planning to index a large dataset, don't store 
this amount of data directly within Lucene/SIREn. However, you can store 
a few short values with Lucene document fields (timestamp, url / uri, 
list of classes, etc.) that can be useful at retrieval time. But keep in 
mind that more you will store data inside the index, more your index 
will grow and the performance is likely to decrease. The performance I 
am reporting on the SIREn website is by indexing data with only two fields:
- 'url/uri', stored and indexed, in order to retrieve the uri of the 
entity at query time
- 'content', non stored and indexed, to index RDF data with SIREn.

You should rely on an external system to store and retrieve the entity 
description. In the Sindice project, we are using HBase as a "document 
repository". SIREn is solely used to index entities and store a few 
values (like the uri of the entity). Then, after having retrieved the 
entities of interests, we are performing request to hbase to retrieve 
the content of the entity. This works quite well, and this is quite 
scalable.

A second remark, check if you are using the parameter 
Index.ANALYZED_NO_NORMS (lucene norms is not useful when using SIREn). 
It will save reduce the index size, and improve the performance.

Hope this helps, keep me informed if you encounter other problems.

as a side note, I have used SIREn recently to index the Billion Triple 
Challenge dataset for participating to the SemSEarch challenge. I was 
able to index the full BTC dataset in 2 or 3 hours (don't remember 
exactly the exact time) (with the scahme given above).

Cheers,
-- 
Renaud Delbru


More information about the siren mailing list