Full Parameter

Last updated 2 months ago

Full Parameter

After selecting Add Task and assigning a task name, the training workflow begins. The process is divided into four main steps: Settings, Training, Validation, and Finished.

Step 1: Settings

In this step, users configure the model and dataset for training, as well as define the training parameters.

Model Selection

Select the model to fine-tune from the list.
If all models are grayed out, it indicates that the models are not yet available. To download a model:
- Use the Model Management interface to download the desired model.

Dataset Selection

Choose an existing dataset from the list.
If the dataset list is empty, upload or generate a dataset by:
- In the Dataset Management interface, you can upload a dataset or use the LLM to generate one based on your input files (e.g., PDF, Word documents).

Training Parameters

Configure the following parameters:

Batch Size
- Meaning: The number of data samples the model processes at a time during training. It’s like studying 10 pages of a book in one sitting; those 10 pages are the Batch Size.
- Key Considerations:
  - Too small: The model may not learn efficiently, and training can become unstable.
  - Too large: It requires more memory (e.g., GPU VRAM) and might slow down training.
Total Batch Size
- Meaning: If you’re training with multiple GPUs, this is the sum of Batch Sizes across all GPUs. For example, if each GPU processes 32 samples and you have 4 GPUs, the Total Batch Size is 32 × 4 = 128.
- Key Considerations:
  - Overall size impacts learning: Larger sizes can stabilize training but might require adjustments to other parameters like the Learning Rate.
Maximum Sequence Length
- Meaning: The maximum number of tokens (words, subwords, or characters) the model processes in one input. Think of it as the maximum length of a sentence or paragraph the model can read at once.
- Key Considerations:
  - Longer sequences: Provide more context but require more memory and computational power.
  - Shorter sequences: Are faster to process but may lose important context.
Learning Rate
- Meaning: How much the model adjusts its parameters with each training step. It’s like deciding how big of a step you take when walking toward your goal.
- Key Considerations:
  - Too high: The model might overshoot the optimal solution, leading to instability.
  - Too low: Training becomes slow, and the model might get stuck at a suboptimal solution.
Epoch
- Meaning: One complete pass through the entire training dataset. If you have a book with 100 pages, reading all 100 pages once is one Epoch.
- Key Considerations:
  - Too few epochs: The model may underfit (not learn enough from the data).
  - Too many epochs: The model may overfit (memorize the data instead of generalizing well).

Once all configurations are complete, proceed to the next step by clicking Start Training.

Step 2: Training

Training Progress Overview

A progress bar indicates the real-time status of training, along with the elapsed time (e.g., 15h 17m 13s).
Users can click the Stop button to immediately halt the training process.

System Monitoring Metrics

GPU Utilization: Includes real-time and maximum values for GPU usage (e.g., 0% out of 100%).
VRAM Usage: Indicates both current (e.g., 0.89%) and peak (e.g., 93%) memory usage.
Temperature: Tracks the temperature of the GPU (e.g., 31°C), and peak (e.g.,71°C).
Fan Speed: Displays real-time fan speed as a percentage of its maximum capacity (e.g., 30%) and peak (e.g., 40%).
CPU Utilization: Real-time and peak CPU usage are displayed (e.g., 0%) and peak (e.g., 93%).
Memory Utilization: Shows the memory usage of the system (e.g., 2%) and peak (e.g., 17%).
AI SSD Usage: Monitors SSD usage specifically allocated for AI operations (e.g., 21%) and peak (e.g., 20%).

Loss Rate Visualization

A dynamic graph tracks the loss rate over epochs.
The graph prominently highlights loss rate improvements:
This visualization allows users to quickly assess training effectiveness and convergence trends.

Loss Rate: The primary metric to monitor during training, which should ideally decrease with each epoch.

Detailed Logs

Real-time logs provide granular information about training iterations, including specific timestamps and operations performed (e.g., Forward, Backward, Save Model_Checkpoint).

Step 3: Validation and Quantization

Validation Overview Model Validation offers tools for side-by-side comparison of multiple Large Language Models (LLMs), including fine-tuned models. It evaluates model performance across different training stages (e.g., epochs) by analyzing responses to a given set of questions.

For more detailed instructions and examples on utilizing this feature effectively, we encourage you to visit the Validation operation page. This page provides comprehensive guidance, including step-by-step procedures, best practices, and troubleshooting tips to ensure you can maximize the feature's capabilities and apply it efficiently to meet your requirements.

Step 4: Finished

Upon successful model validation and quantization, you will be redirected to either the designated model repository (ollama) or your designated workspace.

PreviousOverview NextLoRA