2012年4月8日 星期日

Convert MI score to MI(k): Setting the cut-off threshold of collocation

Mutual Information or MI score is widely used as the statistical measure of collocation in linguistic studies.  The number of bits of "shared information" between two words can be calculated by observed co-occurrence (O) and expected co-occurrence (E).
MI = log2(O/E)
The MI score, then, is implemented as cut-off threshold for collocate selection.  In practical applications, however, MI was found to have a tendency to assign inflated scores to low-frequency word pair with E << 1,  especially for data from large corpora.  Thus, even a single concurrence of two word types might result in a fairly high association score (see Evert's Extended manuscript of orpora and collocations).  Multiplication with O is used to increase the influence of observed concurrence frequency compared to the expected, result in the formula log2(Ok/E) with k >= 1 (the well known MIk family).


A MI score of 2.00 was found useful to produce a collocational network (see Magnusson and Vanharanta's Visualizing Sequences of Texts Using Collocational Networks) for a 3000-word corpus.  But what if we want to use MI3 (for example)?  How do we choose the starting value of MIk score, compared to MI score?  Here is my solution.

MI score is calculated by formula
MI = log2(O/E) = log2(O) - log2(E)
while MIk score is done by
MIk = log2(Ok/E) = k * log2(O) - log2(E)
Thus, a given MI score can be converted into MIk by
MIk = MI + (k-1) * log2(O)

The observed co-occurrence (O) is the minimal co-occurrences that you are interesting.  In the case of O=10, an MI3 score is 8.644 compared to MI=2.0.

A simple converter is shown below.

MI score:
Observed:
Convert to: MI
Value:

沒有留言 :

張貼留言

Related Posts Plugin for WordPress, Blogger...