跳到主要內容

Convert MI score to MI(k): Setting the cut-off threshold of collocation

Mutual Information or MI score is widely used as the statistical measure of collocation in linguistic studies.  The number of bits of "shared information" between two words can be calculated by observed co-occurrence (O) and expected co-occurrence (E).
MI = log2(O/E)
The MI score, then, is implemented as cut-off threshold for collocate selection.  In practical applications, however, MI was found to have a tendency to assign inflated scores to low-frequency word pair with E << 1,  especially for data from large corpora.  Thus, even a single concurrence of two word types might result in a fairly high association score (see Evert's Extended manuscript of orpora and collocations).  Multiplication with O is used to increase the influence of observed concurrence frequency compared to the expected, result in the formula log2(Ok/E) with k >= 1 (the well known MIk family).


A MI score of 2.00 was found useful to produce a collocational network (see Magnusson and Vanharanta's Visualizing Sequences of Texts Using Collocational Networks) for a 3000-word corpus.  But what if we want to use MI3 (for example)?  How do we choose the starting value of MIk score, compared to MI score?  Here is my solution.

MI score is calculated by formula
MI = log2(O/E) = log2(O) - log2(E)
while MIk score is done by
MIk = log2(Ok/E) = k * log2(O) - log2(E)
Thus, a given MI score can be converted into MIk by
MIk = MI + (k-1) * log2(O)

The observed co-occurrence (O) is the minimal co-occurrences that you are interesting.  In the case of O=10, an MI3 score is 8.644 compared to MI=2.0.

A simple converter is shown below.

MI score:
Observed:
Convert to: MI
Value:

留言

熱門文章

差不多食譜:手工巧克力餅乾 Chocolate Cookies

又是手工餅乾,最近一連出了兩份餅乾食譜,這個「手工巧克力餅乾」已經是第三份了。會不會有更多呢?我可以告訴大家,這是肯定的。 要怪就怪這個陰鬱的冬季雨天,哪裡都不方便去,也懶得出去。餅乾櫃空在那邊已經很久了,雖然有時候會嘴饞,但也沒有迫切去補貨的必要。反正經常開伙,平常該有的材料都會有,自己弄個成分完全透明的零食,也是個不錯的選擇。再說,用烤箱進行烘焙時,房間會變得比較乾燥,也比較溫暖。在夏天是個折磨,但到了冬天,這種感覺還滿不錯的。 話不多說,開始進行這一道「手工巧克力餅乾」的準備工作。

差不多食譜:壽桃 Birthday Bunns

「壽桃」可不是老人家生日的專利,小巧玲瓏的壽桃超級受到小朋友歡迎,直說「好可愛喔!」其實壽桃就是一種造型饅頭/包子,只要掌握了這些方法,要做其他的造型都沒問題。

差不多食譜:檸檬餅乾 Lemon Biscuits

寒流來襲,氣象局持續發布低溫特報。在這冷颼颼的冬日,差不多食譜為您準備了一支有溫度的影片食譜「檸檬餅乾 Lemon Biscuits」。檸檬的酸味能夠讓您有清新的味覺,用檸檬做的餅乾則讓您解除冬日過份進補的油膩感,同時又滿足一直想吃東西的衝動。但我可沒說這種吃法的卡路里不高,對您的身材不會有影響。恐怕您還是得自己稍微節制些! 不過,說老實話,我單純是因為天氣太冷,所以把烤箱拿來當暖爐用。坐在烤箱後面等待餅乾完成,果真有暖呼呼的感覺。