[siren-user] basic usage questions
Renaud Delbru
renaud.delbru at deri.org
Thu May 13 18:36:22 IST 2010
On 13/05/10 17:58, Mike Grove wrote:
>
> Yes, unfortunately the exact data I'm testing on is private.� I have a
> public dataset [1] that is about half the size (3+ million triples)
> that probably would exhibit the same behavior.
Ok, I will have maybe a look, but it seems it is the way you are
splitting your kb.
>
> How are you splitting the document to create entity description ?
> By taking all triples having the same URI on the subject position
> (i.e., all outgoing RDF triples) or all triples having the same
> URI on the subject and object position (i.e. all incoming and
> outgoing triples).
>
>
> This is the code snippet I'm using to split up the kb into documents
> (uses our public sesame utils, so won't compile with just sesame API,
> but you should get the idea of what I'm doing).
>
> �������������� Collection<URI> aSubjs = new HashSet<URI>();
>
> ��� ��� ��� ��� TupleQueryResult aResult =
> mRepo.selectQuery(SesameQuery.serql("select distinct s from {s} p {o}"));
> ��� ��� ��� ��� for (BindingSet aBinding : iterable(aResult)) {
> ��� ��� ��� ��� ��� Value aValue = aBinding.getValue("s");
> ��� ��� ��� ��� ��� if (aValue instanceof URI) {
> ��� ��� ��� ��� ��� ��� aSubjs.add((URI) aValue);
> ��� ��� ��� ��� ��� }
> ��� ��� ��� ��� }
>
> ��� ��� for (URI aSubj : aSubjects) {
>
> ��� ��� ��� ExtGraph aGraph = aSource.describe(aSubj);
>
> ��� ��� ��� StringWriter aStrWriter = new StringWriter();
> ��� ��� ��� aGraph.write(aStrWriter, RDFFormat.NTRIPLES);
> ��� ��� ��� String aGraphAsNTriples = aStrWriter.toString();
>
> ��� ��� ��� Document aDoc = new Document();
> ��� ��� ��� aDoc.add(new Field("url", aSubj.stringValue(),
> Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS));
> ��� ��� ��� aDoc.add(new Field(SearchConstants.DEFAULT_FIELD,
> aGraphAsNTriples, Field.Store.NO <http://Field.Store.NO>,
> Field.Index.ANALYZED_NO_NORMS));
>
> ��� ��� ��� aWriter.addDocument(aDoc);
> ��� ��� }
> �
> Pretty basic code.
Do you know what is doing aSource.describe(aSubj) ? I know that,
depending on the triple store, the semantic of the describe clause can
be different [1]. Some will split based on the subject only, other based
on the subject and object, or based on more complex techniques (like
cbd, scbd, msgs [2]).
If you are not splitting by subject only, then you start to duplicate
data (many identical triples will appear in different entities). And
given your index size, (3 to 4 times the size of the original dataset),
I guess that your describe method is not splitting by subject only. I
doubt that the overhead of index size is due to storing the URI of each
entities into the 'url' field.
[1] http://www.w3.org/TR/2005/WD-rdf-sparql-query-20050419/#describe
[2] http://sw.deri.org/2007/07/sitemapextension/#slicing
--
Renaud Delbru
More information about the siren
mailing list