Skip to content

In this project we collect and preprocess news articles, train a Word2Vec model, classify texts with an SVM, and create a Telegram bot for news relevance prediction, visualization, and topic-based news identification.

License

Notifications You must be signed in to change notification settings

moxeeem/NewsBot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📰 News Analytics Bot

img

Project Description

In this project we are implementing a telegram bot that provides analytics of a news resource using NLP.

fontanka.ru is used as a news resource

Table of Contents

Files

Dataset

The dataset used to build the models was created by parsing news articles from the website fontanka.ru. The news is divided into topics that make up the classes. The classes are absolutely balanced.

The dataset uses news posts mainly from 2023, but also contains records from 2018-2022. In total it contains 26719 records.

Target Variable

  • topic (categorical, string) : the topic of the news post

All features

Feature Description Type
date The date of the news post Datetime
title The title of the news post String
topic The topic of the news post String
url URL of the news post String
time The time of the news post Datetime
comm_num Number of comments Integer
author Author of the news post String
views Number of views Integer
content Content of the news post String
year The year of the news post Datetime
month The month of the news post Datetime
weekday The weekday of the news post Datetime
log_views Log of the number of views Float
len_title Length of the title Integer
len_content Length of the content Integer
lifetime Lifetime of the news post Float
views_by_minutes Views per minute Float
log_comm Log of the number of comments Float

Parsing

During the initial data collection, a large number of problems were identified when using the parser from fontanka_parsing.ipynb. Therefore, it was decided to rewrite the parser, extending its functionality and improving its logic. The corrected parser is parser.ipynb.

To implement the parser we used regular expressions, as well as the lack of direct reference to html tags (tag names on the fontanka.ru site are often changed).

Logging with the help of loguru library was also implemented.

Exploratory Data Analysis

In this project, we analyze our data and perform EDA to understand its main characteristics before building our model. We found that articles are evenly distributed across topics, with most news from 2023. The publication dates show peaks in August and September, with fewer articles in winter. Weekdays have more news compared to weekends.

We also examined the average lengths of titles and articles. Keywords were also explored. Keywords are a key consideration, as articles from different topics share dominant keywords, affecting the model.

The number of views is important but varies with article age. We introduced the average growth rate of views but found it doesn't follow a lognormal distribution. The number of comments also doesn't have a lognormal distribution and resembles an exponential one.

Classification problem

In this project we did text preprocessing using the Natasha library. More specifically, we did the following steps:

  • lowering
  • tokenize
  • lemmatize
  • remove symbols
  • remove stop-words

Also we've trained Word2Vec for our news data and got an adequate result.

The project uses SVM with MeanEmbeddingVectorizer to classify texts. Accuracy of such a model is 0.77 (we also explained why we rely on this metric).

Despite the best performance of the XGBoost with MeanEmbeddingVectorizer model (accuracy = 0.78), we decided not to sacrifice memory and speed and chose SVM.

If this classifier makes a mistake, it will most likely confuse the class Общество with Город or Политика. This is not a big deal, because these topics are quite related.

Bot Functionality

The Telegram bot offers the user three main functionalities:

  • Predictions The bot uses a previously trained classifier to predict the relationship of a particular day's news to a particular date
  • Visualizations The bot offers the ability to plot pie charts and word clouds for news for a particular day.
  • News by topic The bot allows you to determine the most relevant news for a particular day. This is done by comparing user topic vectors and news vectors by cosine measure.

Deployment

The final part of the project was to move it to a VPS. This was successfully implemented.

Include Credits

Author

This project was completed as part of the "Основы нейронных сетей и NLP" course offered by AI Education.

License

This project is licensed under the MIT license. For more information, see the LICENSE file.

About

In this project we collect and preprocess news articles, train a Word2Vec model, classify texts with an SVM, and create a Telegram bot for news relevance prediction, visualization, and topic-based news identification.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published