Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .github/workflows/test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,11 @@ jobs:
uv run ruff check ./src ./app ./tests
uv run ruff format --check ./src ./app ./tests

# Run mypy type checking
- name: Run mypy type checking
run: |
uv run mypy src/ app/ tests/ --ignore-missing-imports

# Run pytest (excludes audio-dependent modules like speech_to_text)
- name: Run tests with pytest
run: |
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -195,3 +195,4 @@ src/models/*.json


app_simple.py
mypy_output.txt
8 changes: 8 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,14 @@ repos:

- repo: local
hooks:
- id: mypy
name: mypy
entry: uv run mypy
language: system
types: [python]
files: ^(src/|app/|tests/).*\.py$
args: [--ignore-missing-imports]

- id: pytest
name: pytest
entry: uv run python -m pytest
Expand Down
8 changes: 7 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Makefile for Sentiment Analysis
.PHONY: help install test lint format clean run
.PHONY: help install test lint format clean run type-check

help: ## Show available commands
@grep -E '^[a-zA-Z_-]+:.*?## .*$$' $(MAKEFILE_LIST) | sort | awk 'BEGIN {FS = ":.*?## "}; {printf "\033[36m%-15s\033[0m %s\n", $$1, $$2}'
Expand All @@ -20,6 +20,12 @@ lint: ## Check and fix code quality
uv run ruff check --fix ./src ./app ./tests
uv run ruff format ./src ./app ./tests

type-check: ## Run mypy type checking
uv run mypy src/ app/ tests/

type-check-strict: ## Run mypy with strict mode
uv run mypy --strict src/ app/

format: ## Format code only
uv run ruff format ./src ./app ./tests

Expand Down
22 changes: 22 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,21 @@
[![Linting: Ruff](https://img.shields.io/badge/linting-ruff-yellowgreen)](https://github.com/charliermarsh/ruff)
[![Type Checking: mypy](https://img.shields.io/badge/type%20checking-mypy-blue)](http://mypy-lang.org/)
[![CI: Passed](https://img.shields.io/badge/CI-Passed-brightgreen)](https://github.com/jvachier/Sentiment_Analysis/actions/workflows/test.yaml)
[![Tests: pytest](https://img.shields.io/badge/tests-pytest-orange)](https://docs.pytest.org/)
[![Pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit)](https://github.com/pre-commit/pre-commit)
[![Deep Learning](https://img.shields.io/badge/Deep%20Learning-TensorFlow-orange)](https://www.tensorflow.org/)
[![Keras](https://img.shields.io/badge/Keras-red)](https://keras.io/)
[![TensorFlow](https://img.shields.io/badge/TensorFlow-2.0%2B-orange)](https://www.tensorflow.org/)
[![Python](https://img.shields.io/badge/Python-3.11%2B-blue)](https://www.python.org/)
[![uv](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/uv/main/assets/badge/v0.json)](https://github.com/astral-sh/uv)
[![NLP](https://img.shields.io/badge/NLP-Natural%20Language%20Processing-green)](https://en.wikipedia.org/wiki/Natural_language_processing)
[![Transformers](https://img.shields.io/badge/Transformers-From%20Scratch-blueviolet)](https://arxiv.org/abs/1706.03762)
[![Neural Machine Translation](https://img.shields.io/badge/Neural-Machine%20Translation-purple)](https://en.wikipedia.org/wiki/Neural_machine_translation)
[![Sentiment Analysis](https://img.shields.io/badge/Sentiment-Analysis-pink)](https://en.wikipedia.org/wiki/Sentiment_analysis)
[![Speech Recognition](https://img.shields.io/badge/Speech-Recognition-cyan)](https://en.wikipedia.org/wiki/Speech_recognition)
[![Gradio](https://img.shields.io/badge/UI-Gradio-ff7c00)](https://gradio.app/)
[![Optuna](https://img.shields.io/badge/Hyperparameter-Optuna-lightblue)](https://optuna.org/)
[![Dash](https://img.shields.io/badge/Dashboard-Dash-blue)](https://dash.plotly.com/)
[![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)

# Sentiment Analysis and Translation
Expand Down Expand Up @@ -292,6 +303,17 @@ Sentiment_Analysis/

This Kaggle notebook provides a detailed tutorial on the transformer architecture implemented in this repository.

---

## Live Demo

** HuggingFace Space: English-to-French Translator**
- Try the enhanced Transformer model live in your browser
- Real-time translation with greedy and beam search decoding
- No installation required - instant access
- [Launch Demo on HuggingFace](https://huggingface.co/spaces/Jvachier/transformer-nmt-en-fr)


---

## Customization
Expand Down
10 changes: 10 additions & 0 deletions gradio_apps/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,16 @@ tags:
- tensorflow
- keras
- from-scratch
- nlp
- seq2seq
- attention-mechanism
- encoder-decoder
- deep-learning
- machine-translation
- multilingual
- text-generation
- custom-model
- educational
---

# English to French Enhanced Transformer
Expand Down
7 changes: 4 additions & 3 deletions gradio_apps/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
gradio==4.0.0
tensorflow==2.19.0
numpy==1.26.0
huggingface-hub==0.25.1
tensorflow==2.20.0
numpy>=1.26.2
audioop-lts
40 changes: 40 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,8 @@ dev = [
"ruff>=0.2.1",
"scikit-optimize>=0.9.0",
"pre-commit>=4.0.0",
"mypy>=1.8.0",
"types-tensorflow>=2.16.0",
]
macos = [
"tensorflow-io-gcs-filesystem<0.35.0",
Expand All @@ -69,3 +71,41 @@ build-backend = "hatchling.build"

[tool.hatch.build.targets.wheel]
packages = ["src", "app"]

[tool.mypy]
python_version = "3.11"
warn_return_any = false
warn_unused_configs = true
disallow_untyped_defs = false
disallow_incomplete_defs = false
check_untyped_defs = true
disallow_untyped_calls = false
warn_redundant_casts = true
warn_unused_ignores = true
strict_optional = true
no_implicit_optional = true
ignore_missing_imports = true
show_error_codes = true
pretty = true

[[tool.mypy.overrides]]
module = [
"tensorflow.*",
"keras.*",
"nltk.*",
"vosk.*",
"dash.*",
"optuna.*",
"transformers.*",
"polars.*",
]
ignore_missing_imports = true
ignore_errors = true

[[tool.mypy.overrides]]
module = [
"src.translation_french_english",
"src.modules.optuna_transformer",
"tests.test_transformer_model",
]
disable_error_code = ["dict-item", "misc"]
4 changes: 2 additions & 2 deletions src/modules/data_preprocess_nltk.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ def encode(self, text_tensor, label):
"""
text = text_tensor.numpy().decode("utf-8")
text = self.preprocess_text(text)
encoded_text = self.tokenizer.texts_to_sequences([text])[0]
encoded_text = self.tokenizer.texts_to_sequences([text])[0] # type: ignore[union-attr]
return encoded_text, label

def fit_tokenizer(self, ds_raw):
Expand Down Expand Up @@ -163,7 +163,7 @@ def encode(self, text_tensor):
"""
text = text_tensor.numpy().decode("utf-8")
text = self.preprocess_text(text)
encoded_text = self.tokenizer.texts_to_sequences([text])[0]
encoded_text = self.tokenizer.texts_to_sequences([text])[0] # type: ignore[union-attr]
return encoded_text

def fit_tokenizer(self, ds_raw):
Expand Down
13 changes: 11 additions & 2 deletions src/modules/data_processor.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
import tensorflow as tf
import string
import re
from typing import Tuple, Dict
from typing import Tuple, Dict, Optional


class DatasetProcessor:
Expand All @@ -20,14 +20,20 @@ def __init__(self, file_path: str, delimiters: str = r"|"):
"""
self.file_path = file_path
self.delimiters = delimiters
self.split_df = None
self.split_df: Optional[pl.DataFrame] = None
self.df: Optional[pl.DataFrame] = None

def load_data(self) -> None:
"""Load the Parquet file using Polars."""
self.df = pl.read_parquet(self.file_path)

def process_data(self) -> None:
"""Process the dataset by splitting, cleaning, and tokenizing."""
if self.df is None:
raise ValueError(
"Data must be loaded before processing. Call load_data() first."
)

# Split the 'en' column into rows based on delimiters
if "en" in self.df.columns:
en_split = self.df.select(pl.col("en").str.split(self.delimiters)).explode(
Expand Down Expand Up @@ -76,6 +82,9 @@ def shuffle_and_split(
"""

# Calculate the number of samples for validation and test sets
if self.split_df is None:
raise ValueError("Data must be processed before splitting")

num_val_samples = int(val_split * len(self.split_df))
num_train_samples = len(self.split_df) - 2 * num_val_samples

Expand Down
4 changes: 2 additions & 2 deletions src/modules/load_data.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
import pandas as pd
import tensorflow as tf
from pydantic import BaseModel, FilePath, Field, ValidationError
from pydantic import BaseModel, Field, ValidationError
from src.modules.utils import DatasetPaths


Expand All @@ -9,7 +9,7 @@ class DataLoaderConfig(BaseModel):
Configuration for the DataLoader class.
"""

data_path: FilePath = Field(
data_path: str = Field(
default=DatasetPaths.RAW_DATA.value,
description="Path to the CSV file containing the dataset.",
)
Expand Down
4 changes: 3 additions & 1 deletion src/modules/model_bert_other.py
Original file line number Diff line number Diff line change
Expand Up @@ -141,7 +141,9 @@ def build_model(self, num_classes):
dropout = tf.keras.layers.Dropout(0.3)(cls_token)
output = tf.keras.layers.Dense(1, activation="sigmoid")(dropout)

model = tf.keras.Model(inputs=[input_ids, attention_mask], outputs=output)
model: tf.keras.Model = tf.keras.Model(
inputs=[input_ids, attention_mask], outputs=output
)
model.compile(
optimizer=tf.keras.optimizers.legacy.RMSprop(
learning_rate=self.learning_rate
Expand Down
39 changes: 20 additions & 19 deletions src/modules/model_sentiment_analysis.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,10 @@ def train_and_evaluate(
callbacks=[early_stopping_callback, callbacks_model],
)
test_results = model.evaluate(test_data)
logging.info("Test Accuracy: {:.2f}%".format(test_results[1] * 100))
if isinstance(test_results, list):
logging.info("Test Accuracy: {:.2f}%".format(test_results[1] * 100))
else:
logging.info("Test Accuracy: {:.2f}%".format(test_results * 100))

def inference_model(
self, model: tf.keras.Model, text_vec: tf.keras.layers.TextVectorization
Expand All @@ -77,10 +80,10 @@ def inference_model(
Returns:
tf.keras.Model: An inference model for sentiment prediction.
"""
inputs = tf.keras.Input(shape=(1,), dtype=tf.string)
inputs: tf.Tensor = tf.keras.Input(shape=(1,), dtype=tf.string)
process_inputs = text_vec(inputs)
outputs = model(process_inputs)
inference_model = tf.keras.Model(inputs=inputs, outputs=outputs)
inference_model: tf.keras.Model = tf.keras.Model(inputs=inputs, outputs=outputs)
return inference_model


Expand Down Expand Up @@ -121,7 +124,7 @@ def optimize(
test_data (tf.data.Dataset): Test dataset.
"""

def _objective(trial):
def _objective(trial: optuna.trial.Trial) -> float:
"""
Objective function for Optuna to optimize the model's hyperparameters.

Expand All @@ -141,7 +144,7 @@ def _objective(trial):
)
)
n_layers_bidirectional = trial.suggest_int("n_units_bidirectional", 1, 3)
for i in range(n_layers_bidirectional):
for i in range(n_layers_bidirectional): # type: int
num_hidden_bidirectional = trial.suggest_int(
f"n_units_bidirectional_l{i}", 64, 128, log=True
)
Expand All @@ -165,8 +168,8 @@ def _objective(trial):

model.add(tf.keras.layers.Dropout(self.dropout_rate))
n_layers_nn = trial.suggest_int("n_layers_nn", 1, 2)
for i in range(n_layers_nn):
num_hidden_nn = trial.suggest_int(f"n_units_nn_l{i}", 64, 128, log=True)
for j in range(n_layers_nn):
num_hidden_nn = trial.suggest_int(f"n_units_nn_l{j}", 64, 128, log=True)
model.add(tf.keras.layers.Dense(num_hidden_nn, activation="gelu"))

model.add(tf.keras.layers.Dropout(self.dropout_rate))
Expand Down Expand Up @@ -199,7 +202,9 @@ def _objective(trial):
)
# Evaluate the model accuracy on the validation set.
score = model.evaluate(test_data, verbose=1)
return score[1]
if isinstance(score, list):
return float(score[1])
return float(score)

# Create an Optuna study
study = optuna.create_study(
Expand Down Expand Up @@ -267,7 +272,7 @@ def get_model_api(self) -> tf.keras.Model:
dropout_layer2 = tf.keras.layers.Dropout(self.dropout_rate)(dense_layer2)
dense_layer3 = tf.keras.layers.Dense(32, activation="gelu")(dropout_layer2)
outputs = tf.keras.layers.Dense(1, activation="sigmoid")(dense_layer3)
model = tf.keras.Model(inputs=inputs, outputs=outputs)
model: tf.keras.Model = tf.keras.Model(inputs=inputs, outputs=outputs)
model.compile(
optimizer=tf.keras.optimizers.RMSprop(),
loss=tf.keras.losses.BinaryCrossentropy(),
Expand All @@ -282,13 +287,9 @@ def get_config(self) -> dict:
Returns:
dict: A dictionary containing the model's configuration.
"""
config = super().get_config()
config.update(
{
"embedding_dim": self.embedding_dim,
"lstm_units": self.lstm_units,
"dropout_rate": self.dropout_rate,
"max_token": self.max_token,
}
)
return config
return {
"embedding_dim": self.embedding_dim,
"lstm_units": self.lstm_units,
"dropout_rate": self.dropout_rate,
"max_token": self.max_token,
}
4 changes: 3 additions & 1 deletion src/modules/optuna_transformer.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,9 @@ def build_transformer_model(
dropout_outputs
)

transformer = tf.keras.Model([encoder_inputs, decoder_inputs], final_outputs)
transformer: tf.keras.Model = tf.keras.Model(
[encoder_inputs, decoder_inputs], final_outputs
)

# Compile the model
transformer.compile(
Expand Down
2 changes: 1 addition & 1 deletion src/modules/sentiment_analysis_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ def create_or_load_inference_model(
return tf.keras.models.load_model(inference_model_path)

logging.info("Creating and saving the inference model.")
trainer = ModelTrainer()
trainer = ModelTrainer(config_path=ModelPaths.MODEL_TRAINER_CONFIG.value)
inference_model = trainer.inference_model(model, text_vec)
inference_model.save(inference_model_path)
return inference_model
Expand Down
2 changes: 1 addition & 1 deletion src/modules/speech_to_text.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ def __init__(self, model_path: str):
)
self.stream.start_stream()
self.rec = vosk.KaldiRecognizer(self.model, 16000)
self.recognized_text = []
self.recognized_text: list[str] = []
self.recording = False

def start_recording(self) -> None:
Expand Down
6 changes: 3 additions & 3 deletions src/modules/transformer_components.py
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,7 @@ def build(self, input_shape):
self.layernorm_3 = tf.keras.layers.LayerNormalization()
super().build(input_shape)

def call(self, inputs, encoder_outputs, mask=None):
def call(self, inputs, encoder_outputs, mask=None): # type: ignore[override]
causal_mask = self.get_causal_attention_mask(inputs)
if mask is not None:
padding_mask = tf.cast(mask[:, tf.newaxis, tf.newaxis, :], dtype="float32")
Expand Down Expand Up @@ -214,8 +214,8 @@ def evaluate_bleu(
references.append([ref_sentence])

# Calculate BLEU score
bleu_score = corpus_bleu(
references, candidates, smoothing_function=smoothing_function
bleu_score = float(
corpus_bleu(references, candidates, smoothing_function=smoothing_function)
)
logging.info(f"BLEU score evaluation completed: {bleu_score:.4f}")
return bleu_score
Loading