[siren-user] NumericRangeQuery in Siren...?
Renaud Delbru
renaud.delbru at deri.org
Tue Dec 14 16:44:57 GMT 2010
Hi Mike, Jeroen,
On 09/12/10 14:46, Mike Hugo wrote:
> I was also looking at extending SIREn to support prefix and wildcard
> queries but probably won't be able to get to it for a while. I
> noticed that classes like SirenPhraseQuery and SirenTupleQuery were
> basically taken from Lucene and adapted for the Siren use case -- is
> there a reason these classes cannot just extend from the Lucene Query
> classes, rather than wholesale copying and adapting?
No, this is not possible, because of the way SIREn works.
SIREn is using another type of index data structure than Lucene. SIREn
is storing additional information, such as the tuple and cell ids, which
are used during query processing. Lucene query classes are not aware of
this additional information, and therefore cannot be used to construct
Cell or Tuple type of queries.
How do query processign works in Lucene and SIREn. In fact, all is
starting with basic query class, such as TermQuery (or SirenTermQuery)
or PhraseQuery (or SirenPhraseQuery) (PrefixTermQuery, FuzzyTermQuery,
RangeQuery, and others are what I call basic query classes). These
classes are the building blocks when creating more complex queries.
During query processing, Lucene and SIREn are first processing these
queries in order to get the necessary information (doc ids and term
positions in Lucene, doc ids, term positions, tuple and cell ids in
SIREn) to answer more complex query upfront.
If you are using a Lucene's TermQuery, then the tuple and cell
information will be not available upfront and therefore, SIREn will be
not able to compute the correct results for SirenTupleQuery or
SirenCellQuery.
This is the reason why, in SIREn, we have to reimplement all the basic
query classes. Most of the time, it represents not too much work: it is
just a question of making the tuple and cell ids available for upfront
query processing.
> Most of the code looks relatively the same with the exception of the
> scorer - is that the essential difference? In the current form it's
> difficult to see what needs to be changed to support the SIREn use case.
Yes, most of the time, it is only the Scorer that needs to be
reimplemented. In Lucene, the *Query classes are just a end-user
interface. All the query processing is done in fact in the associated
*Scorer classes (which are hidden from the end user).
> SIREn has been a huge help to us in searching RDF triples, I'm hoping
> that we'll be able to contribute back at some point. I know there
> isn't a public source repository (yet) - what do you think of creating
> a repository at github.com <http://github.com> (git),
> http://bitbucket.org/ (mercurial) or code.google.com
> <http://code.google.com> / assembla.com <http://assembla.com>
> (subversion) ?
The main problem is that SIREn repository is part of the Sindice
project, and therefore linked to other component of the Sindice project.
We are not able to make the svn repository public at the moment due to
security reason.
One solution will be to periodically synchronise our private svn
repository to a public one (e.g., github), but I don't know if this is
something easy to do. IF you have experience with such problem, comments
and advices are welcome.
Kind Regards,
--
Renaud Delbru
More information about the siren
mailing list