摘 要
A Python implementation of the truth discovery algorithm for multi-source data
With the development of Internet technology, information dissemination and access become more convenient. The world wide web to bring more and more sources of information, and at the same time bring the authenticity of information sources and timeliness problem, among them, the different websites as the object of the same information conflict problem is particularly prominent, for example, different books website provides the same book with different information, the author of different sites on the height of mount Everest values are not consistent, etc. This kind of problem, the conflict information may be due to the input error, information, semantic understanding, extract the bugs all sorts of reasons, such as bring misleading to users and even cause huge losses.
How to find correct information from these conflicting information becomes a problem to be solved urgently. This kind of problem is called truth value discovery problem. In order to solve the data truth value discovery of multi-data source conflicts, many researchers have proposed relevant algorithms in recent years. In this paper, the principles, accuracy and performance of typical TruthFinder, CRH and KDEm algorithms are analyzed and compared, and implemented by Python language programming. Then the accuracy of these algorithms is compared and verified through experiments. It is found that KDEm algorithm has high accuracy and relatively complicated implementation. It lays a good foundation for relevant truth value discovery applications, such as wireless sensor applications and mobile group intelligence perception.
Key words:Truth Discovery;TruthFinder;CRH;KDEm
目 录
1绪论 1
1.1 课题的背景及意义 1
1.2 国内外研究概况 2
1.3 本文的组织结构 3
2数据冲突及评估方法介绍 4
2.1 数据冲突 4
2.2真值发现 4
2.3评估方法 6
3 TruthFinder、CRH和KDEm算法原理 7
3.1 TruthFinder算法 7
3.1.1相关符号定义 7
3.1.2 TruthFinder模型 8
3.2 CRH算法 10
3.2.1符号定义 10
3.2.2 CRH模型 11
3.3 KDEm算法 14
3.3.1相关概念 16
3.3.2 KDEm模型 17
4算法的Python实现 20
4.1系统总体设计 20
4.1.1多源数据的获取 20
4.1.2真值数据的发现 21
4.1.3输出结果分析 21
4.2算法实现 21
4.2.1获取数据模块 21
4.2.2 TruthFinder算法 22
4.2.3 CRH算法 26
4.2.4 KDEm算法 28
5实验数据分析 32
5.1数据集的介绍与说明 32
5.2数据集的处理和算法对比 32
6总结与展望 37
6.1 总结 37
6.2 展望 37
致 谢 38
参考文献 39