scispace - formally typeset
Open Access

基於《知網》的辭彙語義相似度計算 (Word Similarity Computing Based on How-net).

Qun Liu, +1 more
- Vol. 7
Reads0
Chats0
TLDR
In this article, the similarity between sememes, that between sets, and that between feature structures are given, and a study on the algorithm used to compute word similarity based on How-net is presented.
Abstract
Word similarity is broadly used in many applications, such as information retrieval, information extraction, text classification, word sense disambiguation, example -based machine translation, etc. There are two different methods used to compute similarity: one is based on ontology or a semantic taxonomy; the other is based on collocations of words in a corpus. As a lexical knowledgebase with rich semantic information, How-net has been employed in various researches. Unlike other thesauri, such as WordNet and Tongyici Cilin, in which word similarity is defined based on the distance between words in a semantic taxonomy tree, How-net defines a word in a complicated multi-dimensional knowledge description language. As a result, a series of problems arise in the process of word similarity computation using How-net. The difficulties are outlined below: 1. The description of each word consists of a group of sememes. For example, the Chinese word “暗箱(camera obscura)” is described as: “part|部件, #TakePicture|拍攝, %tool|用具 , body|身”, and the Chinese word “寫信 (write a letter)” is described as: “write|寫, ContentProduct=letter|信件”; 2. The meaning of a word is not a simple combination of these sememes. Sememes are organized using a specific knowledge description language. To meet these challenges, our work includes: 1. A study on the How-net knowledge description language. We rewrite the How-net definition of a word in a more structural format, using the abstract data structure of set and feature structure. 2. A study on the algorithm used to compute word similarity based on How-net. The similarity between sememes, that between sets , and that between feature structures are given. To compute the similarity between two sememes, we

read more

Content maybe subject to copyright    Report

Computational Linguistics and Chinese Language Processing
Vol. 7, No. 2 , August 2002, pp. 59-76
59
© The Association for Computational Linguistics and Chinese Language Processing
基於《知網》的辭彙語義相似度計算
1
Word Similarity Computing Based on How-net
劉群
*
、李素建
+
Qun LIU , Sujian LI
摘要
詞義相似度計算在很多領域中都有廣泛的應用例如資訊檢索資訊抽取
本分類詞義排歧基於實例的機器翻譯等等詞義相似度計算的兩種基本方
法是基於世界知識Ontology或某種分類體系Taxonomy的方法和基於統
計的上下文向量空間模型方法。這兩種方法各有優缺點。
《知網》是一部比較詳盡的語義知識詞典,受到了人們普遍的重視不過,由
《知網》中對於一個詞的語義採用的是一種多維的知識表示形式這給詞語
相似度的計算帶來了麻煩。這一點與 WordNet 和《同義詞詞林》不同。在
WordNet 《同義詞詞林》,所有同類的語義項WordNet synset 或《
義詞詞林》的詞群)構成一個樹狀結構要計算語義項之間的距離,只要計算
樹狀結構中相應結點的距離即可而在《知網》中辭彙語義相似度的計算存在
以下問題:
1. 每一個詞的語義描述由多個義原組成;
2. 詞語的語義描述中各個義原並不是平等的它們之間有著複雜的關係
過一種專門的知識描述語言來表示。
我們的工作主要包括:
1. 研究《知網》中知識描述語言的語法瞭解其描述一個詞義所用的多個義
原之間的關係區分其在詞語相似度計算中所起的作用我們採用一種更
1
本項研究受國家重點基礎研究計畫(973)支持,項目編號是 G1998030507-4 G1998030510
*
北京大學計算語言學研究所 & 中國科學院計算技術研究所 E-mail: liuqun@ict.ac.cn
Institute of Computational Linguistics, Peking University &
Institute of Computing Technology, Chinese Academy of Science
+
中國科學院計算技術研究所 E-mail: lisujian@ict.ac.cn
Institute of Computing Technology, Chinese Academy of Sciences

60
劉群、李素建
為結構化的方式改寫了《知網》中詞的定義DEF其中採用了“集合"
和“特徵結構"這兩種抽象資料結構。
2. 研究了義原的相似度計算方法集合和特徵結構的相似度計算方法並在
此基礎上提出了利用《知網》進行詞語相似度計算的演算法;
3. 通過實驗驗證該演算法的有效性,並與其他演算法進行比較。
關鍵字:《知網》 辭彙語義相似度計算 自然語言處理
Abstract
Word similarity is broadly used in many applications, such as information retrieval,
information extraction, text classification, word sense disambiguation,
example-based machine translation, etc. There are two different methods used to
compute similarity: one is based on ontology or a semantic taxonomy; the other is
based on collocations of words in a corpus.
As a lexical knowledgebase with rich semantic information, How-net has been
employed in various researches. Unlike other thesauri, such as WordNet and
Tongyici Cilin, in which word similarity is defined based on the distance between
words in a semantic taxonomy tree, How-net defines a word in a complicated
multi-dimensional knowledge description language. As a result, a series of
problems arise in the process of word similarity computation using How-net. The
difficulties are outlined below:
1. The description of each word consists of a group of sememes. For example,
the Chinese word “暗箱(camera obscura) is described as: part|部件,
#TakePicture|拍攝, %tool|用具, body|身", and the Chinese word “寫信
(write a letter) is described as: write|, ContentProduct=letter|信件";
2. The meaning of a word is not a simple combination of these sememes.
Sememes are organized using a specific knowledge description language.
To meet these challenges, our work includes:
1. A study on the How-net knowledge description language. We rewrite the
How-net definition of a word in a more structural format, using the abstract
data structure of set and feature structure.
2. A study on the algorithm used to compute word similarity based on How-net.
The similarity between sememes, that between sets, and that between feature
structures are given. To compute the similarity between two sememes, we

基於《知網》的辭彙語義相似度計算
61
use the distance between the sememes in the semantic taxonomy, as is done in
Wordnet and Tongyici Cilin. To compute the similarity between two sets or
two feature structures, we first establish a one-to-one mapping between the
elements of the sets or the feature structures. Then, the similarity between
the sets or feature structures is defined as the weighted average of the
similarity between their elements. For feature structures, a one-to-one
mapping is established according to the attributes. For sets, a one-to-one
mapping is established according to the similarity between their elements.
3. Finally, we give experiment results to show the validity of the algorithm and
compare them with results obtained using other algorithms. Our results for
word similarity agree with people’s intuition to a large extent, and they are
better than the results of two comparative experiments.
Keywords: How-net, Word Similarity Computing, Natural Language Processing
1. 引言
自然語言的詞語之間有著非常複雜的關係,在實際的應用中,有時需要把這種複雜的關
係用一種簡單的數量來度量,而詞義相似度就是其中的一種。
詞義相似度計算在很多領域中都有廣泛的應用,例如資訊檢索、資訊抽取、文本分
類、詞義排歧、基於實例的機器翻譯等等[Gauch&Chong 1995LI, Szpakowicz & Matwin
1995,王斌,1999李涓子1999]本文的研究背景是基於實例的機器翻譯在基於實
例的機器翻譯中,詞語相似度的計算有著重要的作用。例如要翻譯“張三寫的小說"這
個短語,通過語料庫檢索得到譯例:
1)李四寫的小說/the novel written by Li Si
2)去年寫的小說/the novel written last year
通過相似度計算我們發現,“張三"和“李四"都是具體的人,語義上非常相似,
而“去年"的語義是時間,和“張三"相似度較低,因此我們選用“李四寫的小說"這
個實例進行類比翻譯,就可以得到正確的譯文:
the novel written by Zhang San
如果選用後者作為實例,那麼得到的錯誤譯文將是:
* the novel written Zhang San
通過這個例子可以看出相似度計算在基於實例的機器翻譯中所起的作用。
在基於實例的翻譯中另一個重要的工作是雙語對齊。在雙語對齊過程中要用到兩種
語言的詞義相似度計算,這不在本文所考慮的範圍之內。

62
劉群、李素建
2. 詞語相似度及其計算的方法
2.1 詞語相似度的含義
詞語相似度是一個主觀性相當強的概念,沒有明確的客觀標準可以衡量。脫離具體的應
用去談論詞語相似度,很難得到一個統一的定義。
本文的研究主要以基於實例的機器翻譯為背景,因此在本文中我們所理解的詞語相
似度就是兩個詞語在不同的上下文中可以互相替換使用而不改變文本的句法語義結構的
程度。兩個詞語,如果在不同的上下文中可以互相替換且不改變文本的句法語義結構的
可能性越大,二者的相似度就越高,否則相似度就越低。
相似度這個概念,涉及到詞語的詞法、句法、語義甚至語用等方方面面的特點。其
中,對詞語相似度影響最大的應該是詞的語義。
在本文中,相似度被定義為一個 0 1 之間的實數。
詞語距離與詞語相似度之間有著密切的關係。實際上,詞語距離和詞語相似度是一
對詞語的相同關係特徵的不同表現形式,二者之間可以建立一種簡單的對應關係。對於
兩個詞語 W
1
W
2
,我們記其相似度為 Sim(W
1
,W
2
),其詞語距離為 Dis(W
1
,W
2
),那麼我
們可以定義一個滿足以上條件的簡單轉換關係:
其中
α
是一個可調節的參數。
α
的含義是:當相似度為 0.5 時的詞語距離值。
這種轉換關係並不是唯一的,我們這裏只是給出了其中的一種可能。
在很多情況下,直接計算詞語的相似度比較困難,通常可以先計算詞語的距離,然
後再轉換成詞語的相似度。
詞語相關性反映的是兩個詞語互相關聯的程度。可以用這兩個詞語在同一個語境中
共現的可能性來衡量。詞語相關性和詞語相似性是兩個不同的概念,二者沒有直接的對
應關係。
2.2 詞語相似度的計算方法
詞語距離有兩類常見的計算方法,一種是根據某種世界知識(Ontology)或分類體系
Taxonomy)來計算,一種利用大規模的語料庫進行統計。
根據世界知識(Ontology)或分類體系(Taxonomy)計算詞語語義距離的方法,
般是利用一部同義詞詞典(Thesaurus)。一般同義詞詞典都是將所有的詞組織在一棵或
幾棵樹狀的層次結構中。我們知道,在一棵樹狀圖中,任何兩個結點之間有且只有一條
路徑。於是,這條路徑的長度就可以作為這兩個概念的語義距離的一種度量。
α
α
+
=
),(
),(
21
21
WWDis
WWSim
……(1)

基於《知網》的辭彙語義相似度計算
63
1
《同義詞詞林》語義分類樹狀圖
[王斌,1999]採用這種方法利用《同義詞詞林》來計算漢語詞語之間的相似度(如
1 所示)有些研究者考慮的情況複雜。[Agirre & Rigau 1995]在利用 Wordnet 計算
詞語的語義相似度時,除了結點間的路徑長度外,還考慮到了其他一些因素。例如:
概念層次樹的深度:路徑長度相同的兩個結點,如果位於概念層次的越高層,其語
義距離較大;比如說:“動物"和“植物"、“哺乳動物"和“爬行動物",這兩對概
念間的路徑長度都是 2,但前一對詞處於語義樹的較高層,因此認為其語義距離較大,
後一對詞處於語義樹的較低層,其語義距離較小;
概念層次樹的區域密度:路徑長度相同的兩對結點,如果一對位於概念層次樹中低
密度區域,另一對位元於高密度區域,那麼前者的語義距離應大於後者。引入區域密度
的原因在於,有些概念層次樹中概念描述的粗細程度不均,例如在 Wordnet 中,動植物
分類的描述極其詳盡,而有些區域的概念描述又比較粗疏,這會導致語義距離計算的不
合理。
另一種詞語相似度的計算方法是用大規模的語料來統計。例如,利用詞語的相關性
來計算詞語的相似度。事先選擇一組特徵詞,然後計算這一組特徵詞與每一個詞的相關
性(一般用這組特徵詞在實際的大規模語料中在該詞的上下文中出現的頻率來度量),
於是,對於每一個詞都可以得到一個相關性的特徵詞向量,然後利用這些向量之間的相
似度(一般用向量的夾角餘弦來計算)作為這兩個詞的相似度。這種做法的假設是,凡
是語義相近的詞,他們的上下文也應該相似。[李涓子,1999]利用這種思想來實現語義
的自動排歧[魯松2001]研究了如何利用詞語的相關性來計算詞語的相似度[Dagan et
al. 1995,1999]使用了更為複雜的概率模型來計算詞語的距離。
這兩種方法各有特點。基於世界知識的方法簡單有效,無需用語料庫進行訓練,也
比較直觀,易於理解,但這種方法得到的結果受人的主觀意識影響較大,有時並不能準
確反映客觀事實另外這種方法比較準確地反映了詞語之間語義方面的相似性和差異
O
LBA
a
l
……
a b
01 02... 01… 01… …… 01
01 02... 01 ... 01 01
01 02…01...01 01 … 01 …… ...
虛線用於標識某上層結點到下層結點的路徑

Citations
More filters
Proceedings ArticleDOI

Improved Word Representation Learning with Sememes

TL;DR: This paper follows the framework of Skip-gram and presents three sememe-encoded models to learn representations of sememes, senses and words, where the attention scheme is applied to detect word senses in various contexts.
Journal ArticleDOI

Bilingual Lexical Interactions in an Unsupervised Neural Network Model.

TL;DR: The results suggest a dynamic developmental picture for bilingual lexical acquisition: the acquisition of two languages entails strong competition in a highly interactive context and involves limited plasticity as a function of the timing of L2 learning.
Journal ArticleDOI

A Fuzzy PROMETHEE Approach for Mining Customer Reviews in Chinese

TL;DR: The goal of this paper is to propose fuzzy PROMETHEE, an MCDM method, to rank alternative products based on online customer reviews of products, and to identify key product features that are considered by consumers as the most important aspects of a product.
Journal ArticleDOI

Simulating cross-language priming with a dynamic computational model of the lexicon

TL;DR: In this paper, a self-organizing neural network model, DevLex-II, is proposed to simulate cross-language priming effects across Chinese and English, which incorporates a computational mechanism for simulating spreading activation based on the distance between bilingual words in the semantic space and considers additional factors that modulate priming effect, such as the initial activation level of the prime words and the degree to which the target word can be recognized.
Journal ArticleDOI

Automatic ICD-10 coding algorithm using an improved longest common subsequence based on semantic similarity

TL;DR: An improved approach based on the Longest Common Subsequence (LCS) and semantic similarity for automatic Chinese diagnoses, mapping from the disease names given by clinician to the diseaseNames in ICD-10 is presented.
References
More filters
Journal ArticleDOI

Similarity-Based Models of Word Cooccurrence Probabilities

TL;DR: The authors proposed a method for estimating the probability of unseen word combinations using available information on "most similar" words and applied it to language modeling and pseudo-word disambiguation tasks.
Proceedings ArticleDOI

Contextual word similarity and estimation from sparse data

TL;DR: A method is presented that makes local analogies between each specific unobserved cooccurrence and other cooccurrences that contain similar words, as determined by an appropriate word similarity metric, and may provide an alternative to class based models.
Posted Content

A Proposal for word sense disambiguation using conceptual distance

TL;DR: The method relies on the use of the wide-coverage noun taxonomy of WordNet and the notion of conceptual distance among concepts, captured by a Conceptual Density formula developed for this purpose, for the resolution of lexical ambiguity.
Proceedings Article

A Proposal for Word Sense Disambiguation using Conceptual Distance

TL;DR: In this article, a method for the resolution of lexical ambiguity and its automatic evaluation over the Brown Corpus is presented, which relies on the use of WordNet and the notion of conceptual distance among concepts, captured by a Conceptual Density formula developed for this purpose.
Proceedings Article

A WordNet-based algorithm for word sense disambiguation

TL;DR: An algorithm is presented for automatic word sense disambiguation based on lexical knowledge contained in WordNet and on the results of surface-syntactic analysis that results indicate that even on a relatively small text the proposed method produces correct noun meaning more than 72% of the time.