automatic indexing的步驟

把字全部列出來。
去掉重複的字還有停用字。
stemming
去掉出現頻率太少的字。
用同義字索引典conflate stems.
將剩下的字定位〈在哪個文件當中〉
計算每個字的權重

...there are pre-specified definitions of the interesting concepts in a domain and the relationships between those concept. The quadrant is called knowledge-based information retrieval.
而automatic indexing呢,
I've labelled that quadrant statistical information retrieval because the models of retrieval on which these automatic indexing techniques are  based are statistical models.They look at the meaning of texts by looking at the statistics of word occurences in texts. 

它的好處是1)省錢;2)獨立於學科。〈可跨學科檢索〉

free text 和 controlled vocabulary的差異
首先,record的長度越長,access point 越多,就有越多的redundancy,而也越有存在spurious relationship的可能。free text會有很多的redundancy,雖然這樣access point越多,但是因為一個概念可使用的詞彙天差地遠,所以還是要有controlled vocabulary提供搜尋上的建議,要不然recall還是會很低。

例如:
對一篇文章而言,一篇文章中寫了programmed instuction和programmed learning,你只要車到一個字就找得到這個文章。
但是對一套collection而言,假設你是要找digital library的概念,文件一號用了digital library,文件二號用了web library,文件三號用了digital archive,你用digital library下去找只能摸到文件一號,因此降低了recall。

free text 有更多access point、更有及時性(more current),更有redundancy的特色,適合hot topic, special topic,and highly comprehensive topic,使用者也較為熟悉檢索用詞。

controlled vocabulary則提供更多關於「概念」的表達詞彙,詞和詞也較有階層、參見的關係。及時性低。適合broad conceptual 的檢索。

Free text records tend to be longer and thus provide more access points, will frequently include some terms more specific or more current than those in any controlled vocabulary, and will usually provide greater redundancy.

The controlled vocabulary, on the other hand, impose consistency in the representation of subject matter among documents, provides the broad “concept” term that are frequently lacking in text, and, by means of hierarchical and cross-reference structure, gives the user positive aid in identifying appropriate search terms.”( Lancaster 2003, p.270)
arrow
arrow
    全站熱搜

    山沙拉 發表在 痞客邦 留言(0) 人氣()