[siren-user] basic usage questions
renaud.delbru at deri.org
Wed May 12 18:46:08 IST 2010
Se my comments below.
On 12/05/10 17:35, Mike Grove wrote:
> Hi. I'm just getting into playing around with Siren and my initial
> results are a little surprising. Let me describe my approach and the
> results and hopefully I can get an idea if I'm going at this the wrong
> way or if this is expected.
> I have a single RDF file I want to index; getting this indexed works
> fine (and quickly), but there's only ever one search result, the
> original document. So that doesn't tell me anything, I already knew
> that my search term was in the kb document.
> Ideally, what I want to be able to search for some search string, and
> be able to get the list of triples that match. I didn't see any way
> to make this work -- someone correct me if I'm wrong.
> So what I did was parse the RDF and grab the list of all resources
> used as subjects in the graph. Then for each subject, I grabbed all
> its triples, serialized them as ntriples, and created a "virtual"
> document where the URI of the document was the URI of the subject of
> the triples and I indexed these.
Yes, this is the right way to do if you wnat to retrieve the "entity"
that matches your query.
> This way, once the KB is indexed, my search hits are the resources
> that match the search term. Not 100% what I was going for, but close
Can you tell more about that ? Why is it not 100% ? Is there some
entities that are not retrieved by your queries (but which should be
> This however creates a massive index, and is unexpectedly slow. The
> index is about 2x the size of the original RDF; I know the lucene and
> siren pages claim the indexes are supposed to be 10-30% the size of
> the original data, so for it to be 2x larger is surprising (471M for
> the RDF, 929M for the index in my test).
Please, correct me if I am wrong.
I think you are currently indexing the triples, but also storing them
within Lucene/SIREn, that is you define your Lucene document field as:
new Field("url", myData, Store.YES, Index.ANALYZED_NO_NORMS)
The first remark is that Lucene/SIREn is meant to be an index, not a
data store. Storing that amount of data (the full RDF entity description
for each document) is not efficient. It will slow down the indexing, and
increase dramatically the size of the index (the data stored in not
compressed). If you are planning to index a large dataset, don't store
this amount of data directly within Lucene/SIREn. However, you can store
a few short values with Lucene document fields (timestamp, url / uri,
list of classes, etc.) that can be useful at retrieval time. But keep in
mind that more you will store data inside the index, more your index
will grow and the performance is likely to decrease. The performance I
am reporting on the SIREn website is by indexing data with only two fields:
- 'url/uri', stored and indexed, in order to retrieve the uri of the
entity at query time
- 'content', non stored and indexed, to index RDF data with SIREn.
You should rely on an external system to store and retrieve the entity
description. In the Sindice project, we are using HBase as a "document
repository". SIREn is solely used to index entities and store a few
values (like the uri of the entity). Then, after having retrieved the
entities of interests, we are performing request to hbase to retrieve
the content of the entity. This works quite well, and this is quite
A second remark, check if you are using the parameter
Index.ANALYZED_NO_NORMS (lucene norms is not useful when using SIREn).
It will save reduce the index size, and improve the performance.
Hope this helps, keep me informed if you encounter other problems.
as a side note, I have used SIREn recently to index the Billion Triple
Challenge dataset for participating to the SemSEarch challenge. I was
able to index the full BTC dataset in 2 or 3 hours (don't remember
exactly the exact time) (with the scahme given above).
More information about the siren