[siren-user] siren question

Marc Hadfield marc at hadfield.org
Fri Jan 8 22:59:18 GMT 2010


Hi Renaud -

Thanks for the quick response.  I'm very familiar with Lucene, but 
wasn't sure what role the Document had in your library -- thanks for the 
clarification.  For example, a "Document" could be a named graph.  The 
use-case of each unique URI being a unique document makes sense -- in 
fact, that's how I have indexed semantic data previously via Lucene.

Regarding the svn, I wanted to make sure I had the most up-to-date 
source of code.

Best,
Marc


Renaud Delbru wrote:
> Hi Marc,
>
> On 08/01/10 22:41, Marc Hadfield wrote:
>> i just started to take a look at the siren-0.1 library.  very 
>> interesting.
>>    
> Thanks.
>> some questions:
>>
>> 1) what is the role of Lucene Documents in the index?  I'm looking at
>> the N-triples example, and I see multiple documents being added, and
>> documents being reported as the "hits".  At index time should triples be
>> grouped into documents -- such as each unique URI getting a document?
>> Each document should be a unique "entity"?
>>    
> In Lucene, the concept of document represents the unit of information 
> that will be retrieved. In fact, you can use it to index not only 
> documents per se (web documents, documents on your desktop, etc.) but 
> also other kind of entities (a record in a database, a RDF resource, 
> etc.). The position to adopt depends on your use-case.
> In the n-triple example, we are indexing RDF documents (we adopt a 
> document-centric view), that is the RDF files that have been crawled 
> on the web. In that case, the content of the "Lucene Document" will be 
> the list n-triples found in the RDF file. Hence, it can contain 
> descriptions of multiples RDF resources (more than one set of 
> n-triples with the same subject).
> In the second example "Entity centric indexing and searching", a 
> Lucene document represents a single RDF resource (we adopt an entity 
> centric-view). In that case, the content of the lucene document will 
> be a set of triples having the same subject. In general, some data 
> pre-processing is required in order to group triples having the same 
> subject from an RDF document or database.
> So, as you can see with these two examples, SIREn is flexible in the 
> way data is indexed. You could either choose to group triples on a 
> per-document basis if you want to retrieve the document or URL of the 
> document matching a query (as it is currently done in Sindice beta1), 
> or to choose to group triples on a per-entity basis if you want to 
> retrieve the entity or URI of the entity matching a query (as it will 
> be done in Sindice beta2).
>> 2) is there a SVN or other repository for the siren library?
>>    
> For the moment, the code repository is restricted to the people in our 
> institute. We need to look at some solutions to provide read-only 
> access to the repository. Will you be interested to contribute ?
>> 3) is the code available that was used for the benchmarking, such as
>> loading the billion triple set?
>>    
> The benchmark has been performed with an early prototype of SIREn, but 
> you should get similar results with the code available in the first 
> SIREn release.
>
> If you have more questions, feel free to ask.
> Cheers,



More information about the siren mailing list