Indexing policy

1.structureness
    

descriptive indexing:search fields, attributes
subject indexing:facetd analysis

2.exhaustivity
3.specificity

term weighting

用boolean去找東西的缺點是沒有彈性、不是全有就是全無,無法反映現實狀況;另外它也無法表達文法關係。
ex "bite and man and dog" 找出來不知道是人咬狗還是狗咬人都有可能。
當你找一個文件的時候,字詞的份量應該是不一樣的。
"not all terms are created equal."

常常出現的字代表份量重?對一個文件來說是這樣,但對一整個資料庫的文件來說就不是如此了。於是一種常用的計算weight of a term 的做法是:use the combination of the frequency of occurence of term in record and the reciprocal of the frequency in the database.  
〈TF* IDF〉

The significance (weighting) of a index term to a document should reflect:  
1.Its importance in relation to a document (TF)
2.Its discrimination value in a collection (IDF)
 
 

假設下列文章中出現詞彙的頻率是

                  A    B    C
1文件        5    6    2
2文件        4    1    7
3文件        3    2    4

而A,B, C詞在整個資料庫中出現的次數為:50, 30,200

那麼在1文件中A,B,C的重量就是:
A= 5*1/50=0.1
B= 6*1/30=0.2
C= 2*2/200=0.01

“A useful index term must fulfill a dual function: on the one hand, it must be related to the information content so as to render the item retrievable when it is wanted (the recall function); on the other hand, a good index term also distinguishes the documents to which it is assigned from the remainder to prevent the indiscriminate retrieval of all items, whether wanted or not (the precision function)”
 (Salton 1989).

 這種TF*IDF的做法和human indexing不同,human indexing要嘛就給這個字、要嘛就不給,沒有辦法知道term在文件中的重要程度,只有「有」或「無」。


arrow
arrow
    全站熱搜

    山沙拉 發表在 痞客邦 留言(0) 人氣()