张景,朱国宾.基于CBOW-LDA主题模型的Stack Overflow编程网站热点主题发现研究[J].计算机科学,2018,45(4):208-214
基于CBOW-LDA主题模型的Stack Overflow编程网站热点主题发现研究
Hot Topic Discovery Research of Stack Overflow Programming Website Based on CBOW-LDA Topic Model
投稿时间:2017-03-21  修订日期:2017-06-11
DOI:10.11896/j.issn.1002-137X.2018.04.035
中文关键词:  Stack Overflow,LDA-CBOW语言模型,主题发现,热门主题,困惑度
英文关键词:Stack Overflow,LDA-CBOW language model,Topic detection,Hot topic,Perplexity
基金项目:本文受国家科技支撑计划(2012BAH01F02)资助
作者单位
张景 武汉大学国际软件学院 武汉430079 
朱国宾 武汉大学国际软件学院 武汉430079 
摘要点击次数: 241
全文下载次数: 184
中文摘要:
      Stack Overflow是一个热门的国外编程问答网站,通过对该网站编程提问帖的问题文本进行文本语义挖掘,能获析用户关注的编程热点。由于研究对象所代表的短文本信息具有高维性及分布不均的特点,易导致主题获取不明晰。文中提出一种基于LDA(Latent Dirichlet Allocation)主题模型的CBOW-LDA建模方法,该方法对目标语料进行相似词聚类后再完成主题建模,能有效降低文本输入维度,使主题分布更明确。采集Stack Overflow网站上2010-2015年的问题帖数据集POST,并对其进行实验,同等主题数下采用文本建模中衡量模型性能的评价指标困惑度(Perplexity)来度量算法在不同数据集容量维度下的性能。结果表明,与现有的基于词频权重的词量化主题建模TF-LDA方法相比,CBOW-LDA方法的困惑度更低,在实验语料下的困惑度降低约4.87%,证明了所提算法的性能更好。采用CBOW-LDA方法对Stack Overflow进行热点挖掘,同时使用TF-LDA方法进行对比实验,建立手工标注的标准评测集对两种方法获取的热门主题和热搜词汇进行查全率、查准率及F1值的判定,结果证实CBOW-LDA表现更佳,其热点挖掘效果较好。由实验结果可知,Java为该编程网站提问帖中最热门的主题,而C和Javascript则为该网站用户提问中被提及得最频繁的词汇。
英文摘要:
      Stack Overflow is a popular programming question and answer(Q&A) website,we can gather the hot programming knowledge which the developers focus on by studying the programming question text semantic mining.Owing to the high dimensionality problem which hinders processing efficiency and the topic distribution which makes topics unclear,it is difficult to detect topics from a large number of short texts in social network.To overcome these problems,this paper proposed a new LDA(Latent Dirichlet Allocation) model based topic detection method called CBOW-LDA topic modeling method.Using the model to target language and clustering similar words by vectors similarity before topic detection can decrease the dimensions of LDA output and make topics more clearly.Through the analysis of topic perplexity in the experiment dataset with different data collection capacity about the POST on Stack Overflow in 2010-2015,it is obvious that topics detected by our method has a lower perplexity,comparing with word frequency weighing based vectors named TF-LDA.In a condition of same number of topic words from the corpus,perplexity is reduced by about 4.87%,which means CBOW-LDA model performs better.When acting CBOW-LDA method in hot topic on Stack Overflow,TF-LDA method was used to be compared as well,and this paper established a manual annotation standard evaluation set and used Recall,Precision and F1 to contrast experiment results.This paper confirmed that the CBOW-LDA method has better effect because each measuring value of CBOW-LDA is better than TF-LDA,which proves that the hot spot mining effect of CBOW-LDA is good.Through ourexperiment,this paper effectively found out the hot issues of the theme and hot words in nearly six years.This paper drew the conclusion that “Java” is the hottest topic in the website,and “JavaScript” and “C” are the favorite words mentioned in questions from the users.
查看全文  查看/发表评论  下载PDF阅读器