[siren-user] Length normalisation

Renaud Delbru renaud.delbru at deri.org
Tue Dec 14 16:52:09 GMT 2010


Hi Jeroen,

On 09/12/10 14:51, Jeroen Steggink wrote:
> Hi Renaud,
> If I'm correct, there is currently no length normalisation in the 
> object cell. I'd like to add this, but I'm not sure how I can get a 
> count of the total number of words in the object cell.
Do you want to know the total number of words in the object cell for 
filtering purpose ? E.g., ignore certain object cells if they contain 
too much words ?
If this is the case, then you just have to implement a new Filter, which 
count the number of tokens processed and stop to return tokens after a 
certain number of tokens has been processed.
> It is quite hard to get a good understanding of the workings of Siren 
> because of it's complexity and at some points the lack of 
> documentation in the code. Do you have any documentation available to 
> make it all a little bit more clear so I know how to extend it?
For the theoretical part, you can refer to the research papers published 
about SIREn [1,2]. It will give you an idea on the type of index data 
structure we are using, and the way query processing works.
One time this is clear in your mind, I can explain you how we implement 
this on top of Lucene, and point you to the important part of the code.

[1] http://renaud.delbru.fr/doc/pub/eswc2010-siren.pdf
[2] http://renaud.delbru.fr/doc/pub/fdia2009-siren.pdf

Kind Regards,
-- 
Renaud Delbru
> Kind regards,
> Jeroen
>
>
> _______________________________________________
> siren mailing list
> siren at lists.deri.org
> http://lists.deri.org/mailman/listinfo/siren



More information about the siren mailing list