[siren-user] Length normalisation
renaud.delbru at deri.org
Wed Dec 15 12:05:43 GMT 2010
On 15/12/10 09:57, Jeroen Steggink wrote:
> Hi Renaud,
> I have read the research papers and I get the theory. And after
> inspecting the code some more I also understand most of the code
> behind SIREn.
> To implement length normalisation to give shorter literal objects a
> higher score or filter on literals with a certain token length, I'm
> not sure what is the most efficient way to do this.
To filter literals with a certain token length, you just have to
implement a new Filter, as explained previously. This does not require
any changes on SIREn code.
To give a higher score (so, for ranking purpose), this is a bit more
tricky, since it requires a modification/extension of the SIREn index
Currently, SIREn extends the Lucene data structure by using the payloads
feature. SIREn is storing the tuple and cell ids for each token in its
payload. In your case, you'll have to store also the literal length
within the payload. I think this will be the best solution, even if it
means an increase of index size.
But first, I'll discuss the pros and cons of each of the solution you
> I have thought of different solutions. Maybe you can give your view on
> these or have a better solution?
> - Add another solr field and store the number of tokens with the
> corresponding tuple id and maybe cell id.
This could work, this will be quite efficient in term of index size, but
it will be quite inefficient in term of query time. Normally, the score
and boost is performed at query time (during query processing). By
storing this information in another field, this cannot be done. You'll
have first to compute the result set, then for each document of the
- retrieve the stored field,
- extract the required information,
- recompute the score of each document based on the extracted information,
- sort the documents in the result set based on your new score.
So, it depends on what you are looking for. If you don't really care
about query time, this solution might be the best.
> - Using payload on every token of the literal field. But this means
> unnecessary increase of memory usage, which I don't want.
Yes, this will incur an increase of index size, but this is the most
efficient approach in term of query time.
At query time, you retrieve the literal length information from the
payload, and then use this information to boost the term in the scoring
However, this requires a bit of change in all the SIREn code, since you
- modify the way payload are stored
- modify the existing SIREn query classes (SIREnTermQuery and
SIREnPhraseQuery) and their associated scorer classes in order to take
into account the literal length to compute the score.
Concerning the increase of index size, this could be limited by storing
a kind of normalised literal length boost. For example, you could
normalise your literal length boost on a small range of values, like 1
to 16, which would represent just 4 bits per token. OR if you need more
precise scale, you could increase the range to 1 to 256, which would
represent 8bits per token.
> - Using payload on the complete predicate URI. Which means I can't
> tokenise the predict URI. But this means I won't have to store a
> payload on every token. But indexing complete URIs could also mean
> an increase of the size of the index.
This will be more efficient in term of index size, but this will limit
the query expressiveness. You'll be able to compute the score boost only
if your query includes the predicate URIs. So for simple keyword search,
you will not be able to retrieve the information stored in the payload
of the predicate URI, since your query does not include the predicate URI.
> - Using payload on the last part of the URI after the slash.
Same problem than before.
hope this helps.
More information about the siren