基于WEB的社交网络信息抽取系统毕业论文

 2021-04-10 10:04

摘 要

因为社会的发展,尤其最近这些年网络的发展,人们的社交圈子越来越大,产生的社交信息也越来越多,网络上面的各种信息呈现爆炸式增长,使得网络信息变得非常杂乱和繁多。所以如何在这样的网络环境中准确的找到我们所需要的信息就显得尤为关键,这也就是我们研究信息抽取的出发点,也是社交网络信息抽取这个系统应该要完成的任务。

本篇论文的主要目的是实现对多个异构的社交网络的信息抽取,即利用网络爬虫把社交网络上面的信息爬取出来,然后对这些非结构化的信息按照人们的意愿进行抽取和解析,形成结构化的信息,以方便人们更加快捷、迅速、有效的利用信息。这个过程主要有三个部分,一是社交网络的信息爬取,就是使用一个爬虫把社交网络的信息爬取出来;第二是信息抽取,因为中文信息识别的困难,所以本文是使用pinyin4j把中文文本转化为英文之后再用正则匹配进行文本匹配和抽取,完成抽取之后再转为中文文本储存在数据库方便查询。第三部分就是用户界面的设计和界面与数据库及后台程序的链接与互动。抽样结果证明,本文采取的抽取方式能够实现对大多数社交网络的信息抽取和存储,抽取结果也比较精确,只是覆盖率方面有所欠缺,需要改进。

关键词:web;网络爬虫;数据库;信息抽取

Social Network information extraction system based on web

ABSTRACT

Because of the development of society, especially in recent years, the development of the network, people's social circle is growing, the social information generated more and more, the network of information on the explosion-like growth, making the network information become very messy and numerous. So how to find exactly the information we need in such a network environment is very important, which is the starting point of the research of information extraction and the task that the social Network information extraction system should accomplish.

The main purpose of this paper is to realize the information extraction of multiple heterogeneous social networks. That is, using the network crawler to crawl the information on the social network, and then extract and analyze the unstructured information according to the people's wishes, and form the structured information so that people can use the information more quickly and efficiently. The process consists of three parts, one of which is the information crawling of social networks, which uses a crawler to crawl the information of social networks; the second is information extraction, because of the difficulty of Chinese information recognition, so this article is to use PINYIN4J to convert the Chinese text to English after using regular matching for text matching and extraction, complete extraction and then converted to Chinese text stored in the database convenient query. The third part is the user interface design and interface and database and background program links and interaction. The sampling results show that the extraction method can realize the information extraction and storage of most social networks, the extraction result is more accurate, but the coverage is deficient and needs to be improved.

Key words:web;Web Crawler;Database;Information extraction

目 录

第一章 绪论……………………………………………………………………………-1-

1.1 开发背景………………………………………………………………………-1-

1.1.1 研究目的及意义……………………………………………………-1-

1.1.2 国内外研究现状……………………………………………………-2-

第二章 相关技术概述……………………………………………………………………-4-

2.1 URL…………………………………………………………………………-4-

2.2 HTTP协议…………………………………………………………………-5-

2.2.1 HTTP协议简介…………………………………………………-6-

2.2.2 请求方法……………………………………………………………-6-

2.2.3 HTTP状态代码…………………………………………………-7-

2.3 Swing编程…………………………………………………………………………-8-

2.3.1 Swing简介………………………………………………………………-8-

2.3.2 Swing模型………………………………………………………………-9-

2.4 爬虫………………………………………………………………………………-9-

2.4.1网络爬虫的基本原理…………………………………………………-9-

2.4.2爬虫爬取策略………………………………………………………-10-

第三章 系统设计………………………………………………………………-12-

3.1 爬虫设计……………………………………………………………………-12-

3.1.1爬虫功能和意义……………………………………………………-12-

3.1.2爬虫设计流程图…………………………………………………-12-

3.2 数据库设计…………………………………………………………………-12-

3.2.1设计原则……………………………………………………………-12-

3.2.2微博表结构E-R图………………………………………………-13-

3.2.3知乎表结构E-R图………………………………………………-14-

3.3 当前主流技术分析………………………………………………………-14-

3.3.1基于DOM的Web信息抽取系统…………………………………-14-

3.3.2基于Ajax的信息抽取系统……………………………………………-15-

3.3.2分析与对比………………………………………………………………-17-

第四章 系统实现…………………………………………………………………-18-

4.1 开发平台………………………………………………………………-18-

4.1.1硬件平台………………………………………………………-18-

4.1.2软件平台………………………………………………………-18-

4.2 爬虫…………………………………………………………………-18-

4.2.1爬虫包结构…………………………………………………-18-

4.2.2爬虫模块……………………………………………………-18-

4.3 数据库实现………………………………………………………-20-

4.3.1微博表……………………………………………………-20-

4.3.2知乎表……………………………………………………-21-

4.4 系统实现最终结果……………………………………………-22-

4.4.1微博网页信息抽取………………………………………-22-

4.4.1知乎网页信息抽取………………………………………-24-

结论…………………………………………………………………………-27-

致谢…………………………………………………………………………-28-

参考文献……………………………………………………………………-30-

附录…………………………………………………………………………-32-

第一章 绪论

您需要先支付 80元 才能查看全部内容!立即支付

课题毕业论文、开题报告、任务书、外文翻译、程序设计、图纸设计等资料可联系客服协助查找,优先添加企业微信。