[siren-user] basic usage questions

Renaud Delbru renaud.delbru at deri.org
Thu May 13 18:36:22 IST 2010


On 13/05/10 17:58, Mike Grove wrote:
>
> Yes, unfortunately the exact data I'm testing on is private.� I have a 
> public dataset [1] that is about half the size (3+ million triples) 
> that probably would exhibit the same behavior.
Ok, I will have maybe a look, but it seems it is the way you are 
splitting your kb.
>
>     How are you splitting the document to create entity description ?
>     By taking all triples having the same URI on the subject position
>     (i.e., all outgoing RDF triples) or all triples having the same
>     URI on the subject and object position (i.e. all incoming and
>     outgoing triples).
>
>
> This is the code snippet I'm using to split up the kb into documents 
> (uses our public sesame utils, so won't compile with just sesame API, 
> but you should get the idea of what I'm doing).
>
> �������������� Collection<URI> aSubjs = new HashSet<URI>();
>
> ��� ��� ��� ��� TupleQueryResult aResult = 
> mRepo.selectQuery(SesameQuery.serql("select distinct s from {s} p {o}"));
> ��� ��� ��� ��� for (BindingSet aBinding : iterable(aResult)) {
> ��� ��� ��� ��� ��� Value aValue = aBinding.getValue("s");
> ��� ��� ��� ��� ��� if (aValue instanceof URI) {
> ��� ��� ��� ��� ��� ��� aSubjs.add((URI) aValue);
> ��� ��� ��� ��� ��� }
> ��� ��� ��� ��� }
>
> ��� ��� for (URI aSubj : aSubjects) {
>
> ��� ��� ��� ExtGraph aGraph = aSource.describe(aSubj);
>
> ��� ��� ��� StringWriter aStrWriter = new StringWriter();
> ��� ��� ��� aGraph.write(aStrWriter, RDFFormat.NTRIPLES);
> ��� ��� ��� String aGraphAsNTriples = aStrWriter.toString();
>
> ��� ��� ��� Document aDoc = new Document();
> ��� ��� ��� aDoc.add(new Field("url", aSubj.stringValue(), 
> Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS));
> ��� ��� ��� aDoc.add(new Field(SearchConstants.DEFAULT_FIELD, 
> aGraphAsNTriples, Field.Store.NO <http://Field.Store.NO>, 
> Field.Index.ANALYZED_NO_NORMS));
>
> ��� ��� ��� aWriter.addDocument(aDoc);
> ��� ��� }
>> Pretty basic code.
Do you know what is doing aSource.describe(aSubj) ? I know that, 
depending on the triple store, the semantic of the describe clause can 
be different [1]. Some will split based on the subject only, other based 
on the subject and object, or based on more complex techniques (like 
cbd, scbd, msgs [2]).

If you are not splitting by subject only, then you start to duplicate 
data (many identical triples will appear in different entities). And 
given your index size, (3 to 4 times the size of the original dataset), 
I guess that your describe method is not splitting by subject only. I 
doubt that the overhead of index size is due to storing the URI of each 
entities into the 'url' field.

[1] http://www.w3.org/TR/2005/WD-rdf-sparql-query-20050419/#describe
[2] http://sw.deri.org/2007/07/sitemapextension/#slicing
-- 
Renaud Delbru


More information about the siren mailing list