冯艳红,于红,孙庚,彭松.基于非对称多值特征杰卡德系数的高维语义向量差异性度量方法[J].计算机科学,2018,45(6):57-66
基于非对称多值特征杰卡德系数的高维语义向量差异性度量方法
Diversity Measures Method in High-dimensional Semantic Vector Based on Asymmetric Multi-valued Feature Jaccard Coefficient
  
DOI:10.11896/j.issn.1002-137X.2018.06.010
中文关键词:  非对称多值特征,杰卡德系数,高维语义向量,度量方法,度量集中
英文关键词:Asymmetric multi-valued feature,Jaccard coefficient,High-dimensional semantic vector,Measures method,Measurement concentration
基金项目:本文受大连市科技计划项目:海洋渔业大数据管理与集成关键技术研究(2015A11GX022),辽宁省大学生创新创业项目:渔业领域智能问答系统的研究与实现(201710158000131)资助
作者单位
冯艳红 大连海洋大学信息工程学院 大连116023
大连海洋大学辽宁省海洋信息技术重点实验室 大连116023 
于红 大连海洋大学信息工程学院 大连116023
大连海洋大学辽宁省海洋信息技术重点实验室 大连116023 
孙庚 大连海洋大学信息工程学院 大连116023
大连海洋大学辽宁省海洋信息技术重点实验室 大连116023 
彭松 大连海洋大学信息工程学院 大连116023 
摘要点击次数: 163
全文下载次数: 129
中文摘要:
      语义向量差异性度量是采用深度学习方法解决自然语言处理领域问题的重要基础。在高维语义向量差异性度量中存在“度量集中”问题,导致通过传统的度量方法得到的度量结果无法体现语义向量间的差异性。针对该问题,提出一种基于非对称多值特征杰卡德系数的差异性度量方法。由高维语义向量维度值的统计分布得出,部分维度的维度值密集地分布在特定值域内,导致其无法贡献差异度,因此不同维度对差异性的贡献量不同,具有非对称性。该方法定义了关于维度值的重要性函数,选取重要性函数值满足阈值的维度参与差异度计算,去掉无法贡献差异度的维度,从而实现了降维,缓解了“度量集中”问题。分别在渔业数据集和公开数据集上,对不同维度的语义向量的不同度量方法进行了比较,结果表明在语义性没有明显变差的情况下,所提方法的多样性指标较目前最优的度量方法有大幅提高。
英文摘要:
      The diversity measures of semantic vector are important base of natural language processing problem resolved by deep learning methods.There is a problem of “measurement concentration” in the diversity measure of high dimension semantic vector,which leads to the diversity of the semantic vectors disappear when the diversity are obtained by the traditional measure methods.To resolve this problem,a diversity measures method based on the asymmetric multi-valued feature Jaccard coefficient was proposed.From the statistical distribution of the dimension values of the high-dimensional semantic vector,the values of the partial dimensions are densely distributed in a certain range,which makes them impossible to contribute the diversity.Therefore,the contribution of different dimensions to the diversity is diffe-rent and has asymmetry.This method defines the importance function about the dimension value,selects the dimensions of the importance function value satisfying the threshold to participate in the diversity calculation and removes the dimensions that can not contribute the diversity,and then realizes the dimensionality reduction and alleviates the problem of “measurement concentration”.The experiments were respectively conducted on fishery data sets and public data sets.Different measures methods of the different dimension semantic vector were compared.Under the condition that the semantic nature is not markedly reduced,the diversity index of the proposed method is much higher than the current optimal measures method.
查看全文  查看/发表评论  下载PDF阅读器