[siren-user] Length normalisation
renaud.delbru at deri.org
Tue Dec 14 16:52:09 GMT 2010
On 09/12/10 14:51, Jeroen Steggink wrote:
> Hi Renaud,
> If I'm correct, there is currently no length normalisation in the
> object cell. I'd like to add this, but I'm not sure how I can get a
> count of the total number of words in the object cell.
Do you want to know the total number of words in the object cell for
filtering purpose ? E.g., ignore certain object cells if they contain
too much words ?
If this is the case, then you just have to implement a new Filter, which
count the number of tokens processed and stop to return tokens after a
certain number of tokens has been processed.
> It is quite hard to get a good understanding of the workings of Siren
> because of it's complexity and at some points the lack of
> documentation in the code. Do you have any documentation available to
> make it all a little bit more clear so I know how to extend it?
For the theoretical part, you can refer to the research papers published
about SIREn [1,2]. It will give you an idea on the type of index data
structure we are using, and the way query processing works.
One time this is clear in your mind, I can explain you how we implement
this on top of Lucene, and point you to the important part of the code.
> Kind regards,
> siren mailing list
> siren at lists.deri.org
More information about the siren