Skip to content
/ ml_project_template Public template

A scalable and modular template for machine learning projects, featuring CI/CD integration, configuration management, robust testing, Dockerization, and comprehensive documentation. Ideal for production-ready ML workflows.

License

Notifications You must be signed in to change notification settings

kjpou1/ml_project_template

Repository files navigation

ml_project_template

A scalable and modular template for machine learning projects, featuring CI/CD integration, configuration management, robust testing, Dockerization, and comprehensive documentation. Ideal for production-ready ML workflows.


Table of Contents


Description

This project provides a reusable template to kickstart machine learning workflows. Designed for scalability, modularity, and production readiness, it features:

  • Clear pipeline-based architecture for training and prediction.
  • REST API and web applications for serving predictions.
  • Robust testing for ensuring reliability.

Features

  • Modular pipelines for data ingestion, transformation, model selection, and prediction.
  • REST API and Flask-based web UI for prediction services.
  • Logging and error handling with customizable configurations.
  • Comprehensive test suite with coverage reports.

Datasets

  • Source: [Provide dataset source here, e.g., Hugging Face Dataset Library]
  • Preprocessing: Includes normalization, handling missing values, and feature engineering.
  • Licensing: [Specify dataset licensing details here]

Model

  • Algorithms: Implements Random Forest, Linear Regression, XGBoost, CatBoost, and more.
  • Hyperparameters: Configurable via YAML/JSON files.
  • Evaluation: Selects the best model based on R² scores across various algorithms.

Results

  • Performance Metrics:
    • Random Forest: R² = 0.845
    • Linear Regression: R² = 0.885
    • XGBoost: R² = 0.875
  • Artifacts:
    • Best Model: Saved as model.pkl in the artifacts/ directory.
    • Preprocessor: Saved as preprocessor.pkl.

Installation

  1. Clone the repository:
    git clone [repository-url]
    cd ml_project_template
  2. Install dependencies:
    pip install -r requirements.txt
  3. (Optional) Set up a virtual environment:
    python -m venv venv  
    source venv/bin/activate  # On Linux/Mac  
    venv\Scripts\activate     # On Windows  

Usage

The project supports two primary workflows:

  1. Data Ingestion (Download & Prepare Datasets)
  2. Model Training (Train ML Models)

These workflows can be executed via command-line arguments.


1. Running Data Ingestion

To ingest data, use the following command:

python launch.py ingest --config config/ingestion_config.yaml --debug

Available Options

Argument Description Default
--config (Optional) Path to ingestion configuration file None
--debug (Optional) Enable debug mode False

2. Running Model Training

To train a model, run:

python launch.py train --config config/model_config.yaml --debug

Available Options

Argument Description Default
--config (Optional) Path to training configuration file config/model_config.yaml
--debug (Optional) Enable debug mode False
--model-type (Optional) Specify one or more models to train (e.g., "RandomForest DecisionTree") Runs all models if not provided
--best-of-all (Optional) If set, overrides --model-type and trains all models to find the best one False
--save-best (Optional) If set, saves the best-performing model after training False

Example Runs

  • Run data ingestion with a custom config file:
    python launch.py ingest --config my_custom_ingestion.yaml
  • Run model training in debug mode:
    python launch.py train --debug
  • Train a specific model (e.g., CatBoost and Linear Regression):
    python launch.py train --model-type "CatBoosting Regressor" "Linear Regression"
  • Train all models and pick the best one automatically:
    python launch.py train --best-of-all
  • Train all models and save the best one:
    python launch.py train --best-of-all --save-best

Notes

  • If no --config is provided, default configurations will be used.
  • If --model-type is not provided, all models will be trained.
  • Using --best-of-all will override --model-type and automatically determine the best model.
  • The --save-best flag ensures that the best model is stored after training.
  • The script logs all activity for debugging and monitoring.

Configuration

  • Logging: Configurable via environment variables (LOG_LEVEL, LOG_JSON).
  • Hyperparameters: Adjustable in config/params.yaml.
  • Artifacts Directory: Defined in config/pipeline.yaml.

Project Structure

ml_project_template/  
├── artifacts/              # Model and preprocessor artifacts  
├── config/                 # Configuration files  
├── src/                    # Core project source code  
│   ├── pipeline/           # Pipelines for training and prediction  
│   ├── services/           # Modular services for ingestion, transformation, etc.  
│   ├── utils/              # Utility functions for file handling and ML helpers  
├── tests/                  # Test cases  
├── requirements.txt        # Python dependencies  
└── README.md               # Project documentation  

Technologies Used

  • Python
  • Scikit-learn, XGBoost, CatBoost
  • FastAPI, Flask
  • Pytest
  • Docker (optional)

Automated Test Suite Documentation

For a detailed overview of the testing framework, including test categories, execution instructions, and coverage reports, refer to the Test Suite Documentation. This document provides insights into how the system is validated for robustness, correctness, and reliability across different components.


Sample Apps

For a comprehensive guide on the available sample applications, including their functionality, usage instructions, and integration details, refer to the Sample Apps Readme. This document provides an in-depth overview of each application, explaining how they interact with the system and facilitate predictions through different interfaces.


Contributing

  1. Fork the repository.
  2. Create a new feature branch:
    git checkout -b feature/your-feature-name  
  3. Commit your changes:
    git commit -m "Add your message"  
  4. Push to the branch:
    git push origin feature/your-feature-name  
  5. Submit a pull request.

License

[Specify the license here, e.g., MIT License.]


Acknowledgements

  • Scikit-learn documentation for algorithm support.
  • Community contributors for feedback and improvements.

About

A scalable and modular template for machine learning projects, featuring CI/CD integration, configuration management, robust testing, Dockerization, and comprehensive documentation. Ideal for production-ready ML workflows.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published