CyberBrain is an advanced AI project designed specifically for training artificial intelligence models on devices with limited hardware capabilities. This repository provides tools and scripts for fine-tuning large language models efficiently using minimal resources, making it ideal for weaker devices—whether you're using a low-end CPU or a GPU with limited VRAM.
In this project, we use technical content extracted from books as our primary training data. The raw text from these books is processed into instruction-response pairs tailored for fine-tuning models in Ethical Cyber Security. You can access the books file used for training here.
Create-DataSet/ # Contains dataset creation scripts and the raw books file
LICENSE # License information for the project
LoRA.py # Script for applying Low-Rank Adaptation (LoRA) for efficient fine-tuning
README.md # This file
load-CPU.py # Script to load the model for CPU-based training
load-GPU.py # Script to load the model for GPU-based training
loadToLoRA.py # Script to convert the model into LoRA format for fine-tuning
requirements.txt # Required dependencies for the project
trainer-v2.py # Advanced trainer script for fine-tuning the model
trainer.py # Basic trainer script for fine-tuning the model
git clone https://github.com/0PeterAdel/CyberBrain.git
cd CyberBrain
Create a new Conda environment with Python 3.11:
conda create -n deepseek python=3.11
conda activate deepseek
Install the necessary dependencies (including GCC for compatibility and cudatoolkit for GPU training):
conda install -c conda-forge gcc_linux-64 gxx_linux-64
conda install -c conda-forge cudatoolkit=11.7
pip install -r requirements.txt
To ensure smooth performance, here is the recommended hardware for training models efficiently:
Hardware | Minimum Specification | Recommended Specification |
---|---|---|
GPU | No GPU (CPU-based training) | RTX 3090/4090 (24 GB VRAM) or equivalent |
RAM | 16GB | 64GB+ or more |
Storage | 100GB available disk space | 1TB+ SSD |
Important:
For best performance, a proper GPU is highly recommended. Before starting GPU-based training, ensure your graphics card is correctly defined and recognized. For instance, if you’re using a GPU like the NVIDIA Quadro M620 Mobile (with around 1.956 GB VRAM), special settings such as offloading certain model components to the CPU are required. Our configuration allows critical components (like embedding layers and the LM head) to remain on the GPU while offloading heavy transformer layers to the CPU, which is essential when GPU memory is limited.
- Source: The training data is derived from technical books. The raw texts from these books are processed into instruction-response pairs focused on ethical cyber security.
- Books File: The raw books file is available here.
- Processing: Scripts in the
Create-DataSet/
folder extract, clean, and format the text into a dataset ready for fine-tuning.
- GPU Setup:
Before training, ensure that your GPU is detected and properly defined. For GPU-based training, our scripts (e.g.,load-GPU.py
) configure a custom device map to load critical parts (embeddings, LM head) on the GPU and offload heavy transformer layers to the CPU. This is crucial for running on GPUs with limited memory. - Max Sequence Length:
Themax_seq_length
parameter controls the context length for the model. While a larger value (e.g., 2048) provides more context, it uses more GPU memory. If your GPU has limited VRAM, consider reducing this value (e.g., to 1024 or even 512) to balance context length and resource usage.
- Number of Epochs (
num_train_epochs
):
This sets the number of complete passes over the training dataset. For initial experiments, 1–3 epochs are recommended to limit training time and resource consumption. - Gradient Accumulation (
gradient_accumulation_steps
):
This setting simulates a larger batch size by accumulating gradients over several steps before performing an optimization update. For example, using 8 accumulation steps helps maintain training stability without increasing the memory load per step. - Batch Size (
per_device_train_batch_size
):
A small batch size (e.g., 1) is used to conserve GPU memory, especially when fine-tuning large models on GPUs with limited VRAM. - Overall:
These settings are adjustable based on your hardware specifications. If you encounter memory issues, try reducing themax_seq_length
, lowering the number of epochs, or decreasing the gradient accumulation steps.
- Trainer Scripts:
The repository includestrainer-v2.py
(advanced) andtrainer.py
(basic). Both scripts use the Hugging Face Trainer API to fine-tune the model on the processed dataset. - Workflow:
- Load the model (with LoRA applied) using the appropriate loading script (
load-GPU.py
for GPU orload-CPU.py
for CPU). - Preprocess and tokenize the dataset.
- Start fine-tuning using the Trainer script with your custom settings.
- Evaluate and save the fine-tuned model for inference.
- Load the model (with LoRA applied) using the appropriate loading script (
Once you have set up the environment and prepared your dataset, start the fine-tuning process by running one of the trainer scripts. For example, for GPU-based training with advanced settings, run:
python trainer-v2.py
Make sure to adjust paths and parameters in the scripts according to your hardware and dataset specifications.
This project is licensed under the MIT License – see the LICENSE file for details.
For questions or contributions, feel free to open an issue or contact us directly through GitHub.
- Portfolio: peteradel.netlify.app
- LinkedIn: linkedin.com/in/1peteradel
If you find this project useful or interesting, please give it a star! Your support helps improve the project and motivates further development.
🤍 Thank you for checking out CyberBrain! Happy training!
```