12- Data Mining / Project 2 – Clustering Algorithms Exploration and Comparison

Institution: Pontifical Catholic University of São Paulo (PUC-SP)
School: Faculty of Interdisciplinary Studies
Program: Humanistic AI and Data Science Semester: 2nd Semester 2025
Professor: Professor Doctor in Mathematics Daniel Rodrigues da Silva

Important

⚠️ Heads Up

Projects and deliverables may be made publicly available whenever possible.
The course emphasizes practical, hands-on experience with real datasets to simulate professional consulting scenarios in the fields of Data Analysis and Data Mining for partner organizations and institutions affiliated with the university.
All activities comply with the academic and ethical guidelines of PUC-SP.
Any content not authorized for public disclosure will remain confidential and securely stored in private repositories.

🎶 Prelude Suite no.1 (J. S. Bach) - Sound Design Remix

Statistical.Measures.and.Banking.Sector.Analysis.at.Bovespa.mp4

📺 For better resolution, watch the video on YouTube.

Tip

This repository is a review of the Statistics course from the undergraduate program Humanities, AI and Data Science at PUC-SP.

☞ Access Data Mining Main Repository

📚 Overview / Visão Geral

🇬🇧 This code performs an exploratory and clustering analysis of the dataset "Grupo4.csv" for a classroom project. It includes data cleaning, preprocessing, and applies three clustering algorithms (K-Means, Mean-Shift, Affinity Propagation), visualizing and comparing the results in Python.

🇧🇷 Este código realiza uma análise exploratória e de agrupamento no dataset "Grupo4.csv" para um projeto de sala de aula. Inclui limpeza e pré-processamento dos dados, além de aplicar três algoritmos de agrupamento (K-Means, Mean-Shift, Propagação por Afinidade), visualizando e comparando os resultados em Python.

🚦 Steps/Células do Código

1. Import pandas and load the dataset / Importar pandas e carregar o dataset

import pandas as pd
df = pd.read_csv('Grupo4.csv')
df.head()

2. Display dataset dimensions and statistics / Mostrar dimensões e estatísticas do dataset

num_rows, num_cols = df.shape
print(f"🇬🇧 Number of rows: {num_rows}, Number of columns: {num_cols}")
print(f"🇧🇷 Número de linhas: {num_rows}, Número de colunas: {num_cols}")
display(df.describe())

3. Remove 'Unnamed: 0' column (if exists) / Remover coluna 'Unnamed: 0' (se existir)

if 'Unnamed: 0' in df.columns:
    df.drop('Unnamed: 0', axis=1, inplace=True)
    print("🇬🇧 'Unnamed: 0' column dropped. 🇧🇷 Coluna 'Unnamed: 0' removida.")
else:
    print("🇬🇧 'Unnamed: 0' column not found. 🇧🇷 Coluna 'Unnamed: 0' não encontrada.")

4. Show missing values before filling / Mostrar valores faltantes antes de preencher

print("🇬🇧 Missing values per column before filling:")
print("🇧🇷 Valores faltantes por coluna antes do preenchimento:")
print(df.isnull().sum())

5. Fill missing values with column median / Preencher valores faltantes com a mediana

column_medians = df.median()
df.fillna(column_medians, inplace=True)
print("🇬🇧 Missing values filled with medians. 🇧🇷 Valores faltantes preenchidos com as medianas.")

6. Remove duplicate rows / Remover registros duplicados

initial_rows = df.shape[0]
df.drop_duplicates(inplace=True)
rows_after_duplicates = df.shape[0]
print(f"🇬🇧 Duplicates removed: {initial_rows - rows_after_duplicates}")
print(f"🇧🇷 Duplicados removidos: {initial_rows - rows_after_duplicates}")

7. Display the preprocessed DataFrame / Mostrar o DataFrame após processamento

display(df.head())
num_rows_preprocessed, num_cols_preprocessed = df.shape
print(f"🇬🇧 After preprocessing: {num_rows_preprocessed} rows, {num_cols_preprocessed} columns")
print(f"🇧🇷 Após o pré-processamento: {num_rows_preprocessed} linhas, {num_cols_preprocessed} colunas")

8. Scatter plot (12x8, dark mode turquoise) / Gráfico de dispersão (12x8, modo escuro turquesa)

import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('darkgrid')
sns.set_palette('viridis')

plt.figure(figsize=(12, 8))
sns.scatterplot(data=df, x='Coluna1', y='Coluna2')
plt.title('Scatter Plot of Coluna1 vs Coluna2 / Gráfico de Dispersão Coluna1 vs Coluna2')
plt.show()

21- Our Crew:

👨🏽‍🚀 Andson Ribeiro - Slide into my inbox
👩🏻‍🚀 Fabiana ⚡️ Campanari - Shoot me an email
👨🏽‍🚀 José Augusto de Souza Oliveira - email
🧑🏼‍🚀 Luan Fabiano - email
👨🏽‍🚀 Pedro Barrenco - email
🧑🏼‍🚀 Pedro Vyctor - Hit me up by email

Bibliography

1. Castro, L. N. & Ferrari, D. G. (2016). Introdução à mineração de dados: conceitos básicos, algoritmos e aplicações. Saraiva.

2. Ferreira, A. C. P. L. et al. (2024). Inteligência Artificial - Uma Abordagem de Aprendizado de Máquina. 2nd Ed. LTC.

3. Larson & Farber (2015). Estatística Aplicada. Pearson.

💌 Let the data flow... Ping Me !

🛸๋ My Contacts Hub

────────────── 🔭⋆ ──────────────

➣➢➤ Back to Top

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
Pedro		Pedro
code		code
datasets		datasets
plots		plots
.gitignore		.gitignore
Grupo4.csv		Grupo4.csv
Grupo5.csv		Grupo5.csv
LICENSE		LICENSE
README.md		README.md
🇧🇷Briefing.pdf		🇧🇷Briefing.pdf
🇬🇧Briefing.md		🇬🇧Briefing.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Repository files navigation

12- Data Mining / Project 2 – Clustering Algorithms Exploration and Comparison

🎶 Prelude Suite no.1 (J. S. Bach) - Sound Design Remix

📺 For better resolution, watch the video on YouTube.

☞ Access Data Mining Main Repository

📚 Overview / Visão Geral

🚦 Steps/Células do Código

1. Import pandas and load the dataset / Importar pandas e carregar o dataset

2. Display dataset dimensions and statistics / Mostrar dimensões e estatísticas do dataset

3. Remove 'Unnamed: 0' column (if exists) / Remover coluna 'Unnamed: 0' (se existir)

4. Show missing values before filling / Mostrar valores faltantes antes de preencher

5. Fill missing values with column median / Preencher valores faltantes com a mediana

6. Remove duplicate rows / Remover registros duplicados

7. Display the preprocessed DataFrame / Mostrar o DataFrame após processamento

8. Scatter plot (12x8, dark mode turquoise) / Gráfico de dispersão (12x8, modo escuro turquesa)

21- Our Crew:

Bibliography

💌 Let the data flow... Ping Me !

🛸๋ My Contacts Hub

Copyright 2025 Quantum Software Development. Code released under the MIT License license.

About

Uh oh!

Sponsor this project

Uh oh!

Uh oh!

Contributors 2

Uh oh!

Languages

Uh oh!

License

Quantum-Software-Development/12-DataMining_Project_2_-Clustering_Algorithms-_Exploration_and_Comparison-

Folders and files

Latest commit

History

Repository files navigation

12- Data Mining / Project 2 – Clustering Algorithms Exploration and Comparison

🎶 Prelude Suite no.1 (J. S. Bach) - Sound Design Remix

📺 For better resolution, watch the video on YouTube.

☞ Access Data Mining Main Repository

📚 Overview / Visão Geral

🚦 Steps/Células do Código

1. Import pandas and load the dataset / Importar pandas e carregar o dataset

2. Display dataset dimensions and statistics / Mostrar dimensões e estatísticas do dataset

3. Remove 'Unnamed: 0' column (if exists) / Remover coluna 'Unnamed: 0' (se existir)

4. Show missing values before filling / Mostrar valores faltantes antes de preencher

5. Fill missing values with column median / Preencher valores faltantes com a mediana

6. Remove duplicate rows / Remover registros duplicados

7. Display the preprocessed DataFrame / Mostrar o DataFrame após processamento

8. Scatter plot (12x8, dark mode turquoise) / Gráfico de dispersão (12x8, modo escuro turquesa)

21- Our Crew:

Bibliography

💌 Let the data flow... Ping Me !

🛸๋ My Contacts Hub

Copyright 2025 Quantum Software Development. Code released under the MIT License license.

About

Topics

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Sponsor this project

Uh oh!

Uh oh!

Contributors 2

Uh oh!

Languages