lib-ml is a Python library designed for preprocessing text data, particularly for sentiment analysis models. It provides utilities to clean, tokenize, and preprocess text data efficiently.
preprocess.py provides methods that achieve the following:
- Text Cleaning: Remove unwanted characters, HTML tags, and special symbols.
- Lowercasing: Converts all text to lowercase for uniformity.
- Tokenization: Splits text into individual words (tokens).
- Stopword Removal: Filters out common English stopwords, with an exception for the word "not" to preserve negations.
To install the library, you can use the following command after the release is published:
(Example for v0.1.0), change as needed.
pip install git+https://github.com/remla25-team6/lib-ml@v0.1.0You can use the preprocess method as follows:
import pandas as pd
from lib_ml.preprocess import preprocess
# Example dataset
data = {'Review': ["I <3 love this product!", "This is not good.", "Enjoyable - experience"]}
dataset = pd.DataFrame(data)
# Preprocess the reviews
num_reviews = len(dataset)
corpus = preprocess(dataset, num_reviews)
print(corpus)
# Output: ['love product', 'not good', 'enjoy experi']To set up a local development environment:
# Create and activate a virtual environment
python -m venv venv
source venv/bin/activate # On Linux/macOS
venv\Scripts\activate # On Windows
# Upgrade pip and install dependencies
pip install --upgrade pip
pip install . # Installs from pyproject.tomlThis project uses GitHub Actions for automated releases.
To publish an official release:
- Ensure all changes are committed and pushed to any desired
releasebranch. - Tag the commit with a version like
v1.0.0and push:git tag v1.0.0 git push origin v1.0.0
- This triggers the
release.ymlworkflow, which:- Builds the package from
main. - Updates the version in
pyproject.toml. - Publishes the package as a GitHub release with the tag name.
- Builds the package from
To publish a pre-release:
- Push a commit to the
mainbranch (i.e. merge a pull request to themainbranch). - The
prerelease.ymlworkflow automatically runs on every commit tomain. - It creates a pre-release using the current timestamp (e.g.,
0.1.0-pre.20250625.123456). - These packages are available via:
pip install git+https://github.com/remla25-team6/lib-ml@<pre-release-tag>
Used ChatGPT-4o to refine README.