diff --git a/.gitignore b/.gitignore index 8cd10b4..e6ee2b0 100644 --- a/.gitignore +++ b/.gitignore @@ -12,6 +12,7 @@ conf/**/*credentials* # ignore everything in the following folders data/** logs/** +anonymized/** # except their sub-folders !data/**/ diff --git a/README.md b/README.md index 9695021..2dc92b0 100644 --- a/README.md +++ b/README.md @@ -1,179 +1,6 @@ +# The short video for streamlit dashbord -# Network Analysis Python Package - -## Overview - -This Python package, `network_analysis`, is designed for conducting network analysis task. It provides tools and utilities to analyze network data, with a focus on handling Slack messages from a previous 10 Academy training program. - -## What to do - -Several code snippets have been provided to serve as a starting point for your project. However, it's essential to note that the code has not undergone thorough testing, and errors are expected. Your task is to identify and rectify errors, remove unnecessary components, and incorporate any missing elements. - -Consider this initial code as a foundation for your solution, but do not rely on it in its current state. It's provided to give you a starting point, but you should be prepared to modify and enhance it to meet the specific requirements of your system. - -As you commence your work, focus on exploring the dataset to gain a deep understanding of its structure and contents. Attempt to answer various intriguing questions that arise during your exploration. - -For guidance on the specific questions to address, refer to the notebooks/parse_slack_data.ipynb notebook, where you'll find empty cells designed for your responses. Utilize these cells to document your findings, insights, and any challenges encountered. - -Remember, this is an iterative process, and refining your code and analyses is a crucial part of the learning experience. Regularly post question on slack, and don't hesitate to reach out to tutors if you encounter difficulties. Best of luck with your exploration and analysis! - -## Table of Contents - -- [Installation](#installation) - - [Creating a Virtual Environment](#virtual-env) - - [Clone this package](#clone) -- [Usage](#usage) - - [Configuration](#configuration) - - [Data Loading](#data-loading) - - [Utilities](#utilities) -- [Testing](#testing) -- [Documentation](#documentation) -- [Notebooks](#notebooks) -- [Contributing](#contributing) -- [License](#license) - -## Installation - -### Creating a Virtual Environment - -#### Using Conda - -If you prefer Conda as your package manager: - -1. Open your terminal or command prompt. - -2. Navigate to your project directory. - -3. Run the following command to create a new Conda environment: - - ```bash - conda create --name your_env_name python=3.12 - ``` - Replace `your_env_name` with the desired name for your environment e.g. week0 and `3.12` with your preferred Python version. - -4. Activate the environment: - - ```bash - conda activate your_env_name - ``` - -#### Using Virtualenv - -If you prefer using `venv`, Python's built-in virtual environment module: - -1. Open your terminal or command prompt. - -2. Navigate to your project directory. - -3. Run the following command to create a new virtual environment: - - ```bash - python -m venv your_env_name - ``` - - Replace `your_env_name` with the desired name for your environment. - -4. Activate the environment: - - - On Windows: - - ```bash - .\your_env_name\scripts\activate - ``` - - - On macOS/Linux: - - ```bash - source your_env_name/bin/activate - ``` - -Now, your virtual environment is created and activated. You can install packages and run your Python scripts within this isolated environment. Don't forget to install required packages using `pip` or `conda` once the environment is activated. - -### Clone this package - -To install the `network_analysis` package, follow these steps: - -1. Clone the repository: - ```bash - git clone https://github.com/your-username/network_analysis.git - ``` -2. Navigate to the project directory: - ```bash - cd network_analysis - ``` - -3. Install the required dependencies: - ```bash - pip install -r requirements.txt - ``` - -Please be aware that the existing requirements.txt file includes only a limited set of packages at the moment, and it might not encompass all the necessary packages for your analysis. Make sure to supplement it with any additional packages you plan to install. - -## Usage -### Configuration -Configure the package by modifying the `src/config.py` file. Adjust parameters such as file paths, API keys, or any other configuration settings relevant to your use case. - -### Data Loading -The package provides a data loader module (`loader.py`) in the src directory. Use this module to load your network data into a format suitable for analysis. - -Example: - -```python -from src.loader import DataLoader - -# Initialize DataLoader -data_loader = DataLoader() - -# Load data from a Slack channel -slack_data = data_loader.load_slack_data("path/to/slack_channel_data") -``` - -## Utilities -Explore the various utilities available in the `src/utils.py` module. This module contains functions for common tasks such as data cleaning, preprocessing, and analysis. - -Example: - -```python -from src.utils import clean_data, visualize_network - -# Clean the loaded data -cleaned_data = clean_data(slack_data) - -# Visualize the network -visualize_network(cleaned_data) -``` - -## Testing -Run tests using the following command: - -```bash -make test -``` - -This will execute the unit tests located in the tests directory. - -## Documentation -Visit the docs directory for additional documentation and resources. The documentation covers important aspects such as code structure, best practices, and additional usage examples. - -## Notebooks -The notebooks directory contains Jupyter notebooks that demonstrate specific use cases and analyses. Refer to these notebooks for hands-on examples. - -## Contributing -Contributions are welcome! Before contributing, please review our contribution guidelines. - -## License -This project is licensed under the MIT License. - -## Network Analysis - -This is a starter python package to analze the slack data to learn about - -* Patter of users messaging behaviour -* Patter of replies and reactions of users to messages posted both by peers and admins -* Discover sub-communities by building network graphs of message senders and those who reply or react to those messages - - - - - +https://github.com/aronsinkie/Semitic-NLP/assets/74707268/bfa3588c-b0d4-4890-abae-5b04addd797b +https://github.com/aronsinkie/Semitic-NLP/files/13539983/streamlit.pdf +https://github.com/aronsinkie/Semitic-NLP/files/13539983/streamlit.pdf diff --git a/add_data.py b/add_data.py new file mode 100644 index 0000000..5581786 --- /dev/null +++ b/add_data.py @@ -0,0 +1,219 @@ +import os +import pandas as pd +import mysql.connector as mysql +from mysql.connector import Error + +def DBConnect(dbName=None): + """ + + Parameters + ---------- + dbName : + Default value = None) + + Returns + ------- + + """ + conn = mysql.connect(host='localhost', + user='root', + password='Aron123@aron123@', + database=dbName, + buffered=True) + cur = conn.cursor() + return conn, cur + +def emojiDB(dbName: str) -> None: + conn, cur = DBConnect(dbName) + dbQuery = f"ALTER DATABASE {dbName} CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;" + cur.execute(dbQuery) + conn.commit() + +def createDB(dbName: str) -> None: + """ + + Parameters + ---------- + dbName : + str: + dbName : + str: + dbName:str : + + + Returns + ------- + + """ + conn, cur = DBConnect() + cur.execute(f"CREATE DATABASE IF NOT EXISTS {dbName};") + conn.commit() + cur.close() + +def createTables(dbName: str) -> None: + """ + + Parameters + ---------- + dbName : + str: + dbName : + str: + dbName:str : + + + Returns + ------- + + """ + conn, cur = DBConnect(dbName) + sqlFile = 'day5_schema.sql' + fd = open(sqlFile, 'r') + readSqlFile = fd.read() + fd.close() + + sqlCommands = readSqlFile.split(';') + + for command in sqlCommands: + try: + res = cur.execute(command) + except Exception as ex: + print("Command skipped: ", command) + print(ex) + conn.commit() + cur.close() + + return + +def preprocess_df(df: pd.DataFrame) -> pd.DataFrame: + """ + + Parameters + ---------- + df : + pd.DataFrame: + df : + pd.DataFrame: + df:pd.DataFrame : + + + Returns + ------- + + """ + cols_2_drop = ['Unnamed: 0', 'timestamp', 'sentiment', 'possibly_sensitive', 'original_text'] + try: + df = df.drop(columns=cols_2_drop, axis=1) + df = df.fillna(0) + except KeyError as e: + print("Error:", e) + + return df + + +def insert_to_tweet_table(dbName: str, df: pd.DataFrame, table_name: str) -> None: + """ + + Parameters + ---------- + dbName : + str: + df : + pd.DataFrame: + table_name : + str: + dbName : + str: + df : + pd.DataFrame: + table_name : + str: + dbName:str : + + df:pd.DataFrame : + + table_name:str : + + + Returns + ------- + + """ + conn, cur = DBConnect(dbName) + + df = preprocess_df(df) + + for _, row in df.iterrows(): + sqlQuery = f"""INSERT INTO {table_name} (created_at, source, clean_text, polarity, subjectivity, language, + favorite_count, retweet_count, original_author, screen_count, followers_count, friends_count, + hashtags, user_mentions, place, place_coordinate) + VALUES(%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s);""" + data = (row[0], row[1], row[2], row[3], (row[4]), (row[5]), row[6], row[7], row[8], row[9], row[10], row[11], + row[12], row[13], row[14], row[15]) + + try: + # Execute the SQL command + cur.execute(sqlQuery, data) + # Commit your changes in the database + conn.commit() + print("Data Inserted Successfully") + except Exception as e: + conn.rollback() + print("Error: ", e) + return + +def db_execute_fetch(*args, many=False, tablename='', rdf=True, **kwargs) -> pd.DataFrame: + """ + + Parameters + ---------- + *args : + + many : + (Default value = False) + tablename : + (Default value = '') + rdf : + (Default value = True) + **kwargs : + + + Returns + ------- + + """ + connection, cursor1 = DBConnect(**kwargs) + if many: + cursor1.executemany(*args) + else: + cursor1.execute(*args) + + # get column names + field_names = [i[0] for i in cursor1.description] + + # get column values + res = cursor1.fetchall() + + # get row count and show info + nrow = cursor1.rowcount + if tablename: + print(f"{nrow} recrods fetched from {tablename} table") + + cursor1.close() + connection.close() + + # return result + if rdf: + return pd.DataFrame(res, columns=field_names) + else: + return res + + +if __name__ == "__main__": + createDB(dbName='tweets') + emojiDB(dbName='tweets') + createTables(dbName='tweets') + + df = pd.read_csv(r'C:\Users\CIAD\Downloads\Compressed\Week-0-20220429T071543Z-001\Week-0\Tuesday\cleaned_fintech_data.csv') + + insert_to_tweet_table(dbName='tweets', df=df, table_name='TweetInformation') diff --git a/bandicam 2023-12-03 08-37-51-790.mp4 b/bandicam 2023-12-03 08-37-51-790.mp4 new file mode 100644 index 0000000..40d733e --- /dev/null +++ b/bandicam 2023-12-03 08-37-51-790.mp4 @@ -0,0 +1 @@ +bandicam 2023-12-03 08-37-51-790.mp4 diff --git a/day5.py b/day5.py new file mode 100644 index 0000000..83d38ef --- /dev/null +++ b/day5.py @@ -0,0 +1,93 @@ +import numpy as np +import pandas as pd +import streamlit as st +import altair as alt +from wordcloud import WordCloud +import plotly.express as px +from add_data import db_execute_fetch + +st.set_page_config(page_title="Day 5", layout="wide") + +def loadData(): + query = "select * from TweetInformation" + df = db_execute_fetch(query, dbName="tweets", rdf=True) + return df + +def selectHashTag(): + df = loadData() + hashTags = st.multiselect("choose combination of hashtags", list(df['hashtags'].unique())) + if hashTags: + df = df[df['hashtags'].apply(lambda x: any(tag in x for tag in hashTags))] + st.write(df) + +def selectLocAndAuth(): + df = loadData() + location = st.multiselect("choose Location of tweets", list(df['place_coordinate'].unique())) + lang = st.multiselect("choose Language of tweets", list(df['language'].unique())) + + if location and not lang: + df = df[df['place_coordinate'].isin(location)] + st.write(df) + elif lang and not location: + df = df[df['language'].isin(lang)] + st.write(df) + elif lang and location: + df = df[df['place_coordinate'].isin(location) & df['language'].isin(lang)] + st.write(df) + else: + st.write(df) + +def barChart(data, title, X, Y): + title = title.title() + st.title(f'{title} Chart') + msgChart = (alt.Chart(data).mark_bar().encode(alt.X(f"{X}:N", sort=alt.EncodingSortField(field=f"{Y}", op="values", + order='ascending')), y=f"{Y}:Q")) + st.altair_chart(msgChart, use_container_width=True) + +def wordCloud(): + df = loadData() + cleanText = '' + for text in df['clean_text']: + tokens = str(text).lower().split() + cleanText += " ".join(tokens) + " " + + wc = WordCloud(width=650, height=450, background_color='white', min_font_size=5).generate(cleanText) + st.title("Tweet Text Word Cloud") + st.image(wc.to_array()) + +def stBarChart(): + df = loadData() + dfCount = pd.DataFrame({'Tweet_count': df.groupby(['original_author'])['clean_text'].count()}).reset_index() + dfCount["original_author"] = dfCount["original_author"].astype(str) + dfCount = dfCount.sort_values("Tweet_count", ascending=False) + + num = st.slider("Select number of Rankings", 0, 50, 5) + title = f"Top {num} Ranking By Number of tweets" + barChart(dfCount.head(num), title, "original_author", "Tweet_count") + +def langPie(): + df = loadData() + dfLangCount = pd.DataFrame({'Tweet_count': df.groupby(['language'])['clean_text'].count()}).reset_index() + dfLangCount["language"] = dfLangCount["language"].astype(str) + dfLangCount = dfLangCount.sort_values("Tweet_count", ascending=False) + dfLangCount.loc[dfLangCount['Tweet_count'] < 10, 'language'] = 'Other languages' + st.title("Tweets Language pie chart") + fig = px.pie(dfLangCount, values='Tweet_count', names='language', width=500, height=350) + fig.update_traces(textposition='inside', textinfo='percent+label') + + colB1, colB2 = st.columns([2.5, 1]) + + with colB1: + st.plotly_chart(fig) + with colB2: + st.write(dfLangCount) + +st.title("slack Message Display") +#selectHashTag() +st.markdown("
Section Break
", unsafe_allow_html=True) +#selectLocAndAuth() +st.title("Data Visualizations") + +with st.expander("Show More Graphs"): + stBarChart() + langPie() \ No newline at end of file diff --git a/day5_schema.sql b/day5_schema.sql new file mode 100644 index 0000000..3126f16 --- /dev/null +++ b/day5_schema.sql @@ -0,0 +1,24 @@ + + +CREATE TABLE IF NOT EXISTS `TweetInformation` +( + `id` INT NOT NULL AUTO_INCREMENT, + `created_at` TEXT NOT NULL, + `source` VARCHAR(200) NOT NULL, + `clean_text` TEXT DEFAULT NULL, + `polarity` FLOAT DEFAULT NULL, + `subjectivity` FLOAT DEFAULT NULL, + `language` TEXT DEFAULT NULL, + `favorite_count` INT DEFAULT NULL, + `retweet_count` INT DEFAULT NULL, + `original_author` TEXT DEFAULT NULL, + `screen_count` INT NOT NULL, + `followers_count` INT DEFAULT NULL, + `friends_count` INT DEFAULT NULL, + `hashtags` TEXT DEFAULT NULL, + `user_mentions` TEXT DEFAULT NULL, + `place` TEXT DEFAULT NULL, + `place_coordinate` VARCHAR(100) DEFAULT NULL, + PRIMARY KEY (`id`) +) +ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE utf8mb4_unicode_ci; diff --git a/streamlit.pdf b/streamlit.pdf new file mode 100644 index 0000000..e1ba277 Binary files /dev/null and b/streamlit.pdf differ