L’utilisation Du Deep Learning Pour L’extraction Du Contenu Des Pages Web
Résumé: The problem of content extraction is a subject of study since the development of the World Wide Web. Its goal is to separate the main content of a web page, such as the text of an article, from the noisy content, such as advertisements and navigation links. Most content extraction approaches operate on the block level, that is, the web page is segmented into blocks, and then each of these blocks is determined to be part of the main content or the noisy content of the Web page. In this project, we try to apply the content extraction at a deeper level, namely to HTML elements. During the thesis, we investigate the notion of main content more closely, create a dataset of web pages whose elements have been manually labeled as either part of the main content or the noisy content by the web scraping, then we apply the deep learning (convolution neural network) to this data set in order to induce a model for separating the main content and the noisy content. Finally, this induced model is going to be evaluated by a different dataset of manually labeled Web pages using the web scraping also. Key words: Content extraction, deep learning, convolution neural network (CNN), web scraping, main content, noisy content.
Mots-clès:
Nos services universitaires et académiques
Thèses-Algérie vous propose ses divers services d’édition: mise en page, révision, correction, traduction, analyse du plagiat, ainsi que la réalisation des supports graphiques et de présentation (Slideshows).
Obtenez dès à présent et en toute facilité votre devis gratuit et une estimation de la durée de réalisation et bénéficiez d'une qualité de travail irréprochable et d'un temps de livraison imbattable!