[siren-user] siren question
Marc Hadfield
marc at hadfield.org
Fri Jan 8 22:59:18 GMT 2010
Hi Renaud -
Thanks for the quick response. I'm very familiar with Lucene, but
wasn't sure what role the Document had in your library -- thanks for the
clarification. For example, a "Document" could be a named graph. The
use-case of each unique URI being a unique document makes sense -- in
fact, that's how I have indexed semantic data previously via Lucene.
Regarding the svn, I wanted to make sure I had the most up-to-date
source of code.
Best,
Marc
Renaud Delbru wrote:
> Hi Marc,
>
> On 08/01/10 22:41, Marc Hadfield wrote:
>> i just started to take a look at the siren-0.1 library. very
>> interesting.
>>
> Thanks.
>> some questions:
>>
>> 1) what is the role of Lucene Documents in the index? I'm looking at
>> the N-triples example, and I see multiple documents being added, and
>> documents being reported as the "hits". At index time should triples be
>> grouped into documents -- such as each unique URI getting a document?
>> Each document should be a unique "entity"?
>>
> In Lucene, the concept of document represents the unit of information
> that will be retrieved. In fact, you can use it to index not only
> documents per se (web documents, documents on your desktop, etc.) but
> also other kind of entities (a record in a database, a RDF resource,
> etc.). The position to adopt depends on your use-case.
> In the n-triple example, we are indexing RDF documents (we adopt a
> document-centric view), that is the RDF files that have been crawled
> on the web. In that case, the content of the "Lucene Document" will be
> the list n-triples found in the RDF file. Hence, it can contain
> descriptions of multiples RDF resources (more than one set of
> n-triples with the same subject).
> In the second example "Entity centric indexing and searching", a
> Lucene document represents a single RDF resource (we adopt an entity
> centric-view). In that case, the content of the lucene document will
> be a set of triples having the same subject. In general, some data
> pre-processing is required in order to group triples having the same
> subject from an RDF document or database.
> So, as you can see with these two examples, SIREn is flexible in the
> way data is indexed. You could either choose to group triples on a
> per-document basis if you want to retrieve the document or URL of the
> document matching a query (as it is currently done in Sindice beta1),
> or to choose to group triples on a per-entity basis if you want to
> retrieve the entity or URI of the entity matching a query (as it will
> be done in Sindice beta2).
>> 2) is there a SVN or other repository for the siren library?
>>
> For the moment, the code repository is restricted to the people in our
> institute. We need to look at some solutions to provide read-only
> access to the repository. Will you be interested to contribute ?
>> 3) is the code available that was used for the benchmarking, such as
>> loading the billion triple set?
>>
> The benchmark has been performed with an early prototype of SIREn, but
> you should get similar results with the code available in the first
> SIREn release.
>
> If you have more questions, feel free to ask.
> Cheers,
More information about the siren
mailing list