[siren-user] basic usage questions

Mike Grove mike at clarkparsia.com
Wed May 12 18:58:20 IST 2010


On Wed, May 12, 2010 at 1:46 PM, Renaud Delbru <renaud.delbru at deri.org>wrote:

> Hi Mike,
>
> >
> > So what I did was parse the RDF and grab the list of all resources
> > used as subjects in the graph.  Then for each subject, I grabbed all
> > its triples, serialized them as ntriples, and created a "virtual"
> > document where the URI of the document was the URI of the subject of
> > the triples and I indexed these.
> Yes, this is the right way to do if you wnat to retrieve the "entity"
> that matches your query.
>

Ok, thanks for the sanity check.


> > This way, once the KB is indexed, my search hits are the resources
> > that match the search term.  Not 100% what I was going for, but close
> > enough.
> Can you tell more about that ? Why is it not 100% ? Is there some
> entities that are not retrieved by your queries (but which should be
> retrieved) ?
>

It's not 100% what I was looking for because I was trying to get the
specific triples where the search query occurs, so only knowing which
resource has a triple where the search term occurs is less granular than I
was hoping.  But for what I hope to ultimately build out with this
framework, that will probably be sufficient.


> > This however creates a massive index, and is unexpectedly slow.  The
> > index is about 2x the size of the original RDF; I know the lucene and
> > siren pages claim the indexes are supposed to be 10-30% the size of
> > the original data, so for it to be 2x larger is surprising (471M for
> > the RDF, 929M for the index in my test).
>


> Please, correct me if I am wrong.
> I think you are currently indexing the triples, but also storing them
> within Lucene/SIREn, that is you define your Lucene document field as:
>
> new Field("url", myData, Store.YES, Index.ANALYZED_NO_NORMS)
>

I'm not sure =)

This is my exact code for creating the Document which gets indexed, I
adapted this from the demo code:

Document aDoc = new Document();
aDoc.add(new Field("url", aSubjURI, Field.Store.YES,
Field.Index.NOT_ANALYZED_NO_NORMS));
aDoc.add(new Field(DEFAULT_FIELD, aSubjGraphAsNTriples, Field.Store.NO,
Field.Index.ANALYZED_NO_NORMS));
aWriter.addDocument(aDoc);
aWriter.commit();

aSubjURI is a string which is the URI of the subject whose graph is being
indexed, aSubjGraphAsNTriples as what it sounds like, the graph of triples
where the specified URI is the subject, serialized as NTriples.

Is this the correct way I should be saving and indexing the data?


>
> The first remark is that Lucene/SIREn is meant to be an index, not a
> data store.


Agreed.


> Storing that amount of data (the full RDF entity description
> for each document) is not efficient. It will slow down the indexing, and
> increase dramatically the size of the index (the data stored in not
> compressed).


Again, agreed.  I don't want to save the data in the index, I've got an RDF
KB for that.  I just want the data indexed.


> If you are planning to index a large dataset, don't store
> this amount of data directly within Lucene/SIREn.


If I've inadvertently stored the data in the index (see above code) it was
not intentional =)


> However, you can store
> a few short values with Lucene document fields (timestamp, url / uri,
> list of classes, etc.) that can be useful at retrieval time. But keep in
> mind that more you will store data inside the index, more your index
> will grow and the performance is likely to decrease. The performance I
> am reporting on the SIREn website is by indexing data with only two fields:
> - 'url/uri', stored and indexed, in order to retrieve the uri of the
> entity at query time
> - 'content', non stored and indexed, to index RDF data with SIREn.
>

This sounds like what I'm going for.


>
> You should rely on an external system to store and retrieve the entity
> description. In the Sindice project, we are using HBase as a "document
> repository". SIREn is solely used to index entities and store a few
> values (like the uri of the entity). Then, after having retrieved the
> entities of interests, we are performing request to hbase to retrieve
> the content of the entity. This works quite well, and this is quite
> scalable.
>

Again, this is fine.  I've already got the RDF in a database, I planned on
using the URI from the search hits to cobble together the relevant pieces of
RDF from the database to return to the user.


>
> A second remark, check if you are using the parameter
> Index.ANALYZED_NO_NORMS (lucene norms is not useful when using SIREn).
> It will save reduce the index size, and improve the performance.
>
> Hope this helps, keep me informed if you encounter other problems.
>

Sounds like I've just got a newbie setup problem, so I'm curious to hear
your sanity check on my above code.

Thanks for the help.

Cheers,

Mike
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.deri.org/pipermail/siren/attachments/20100512/17d17bbf/attachment.htm 


More information about the siren mailing list