Convert

Model Conversion and Quantization

This document provides technical guidance on converting and quantizing large language models (LLMs) for deployment on various platforms. This is crucial for deploying and optimizing LLMs on diverse hardware, including but not limited to NVIDIA Jetson, AMD, Intel, and Qualcomm. This guide covers both foundation LLMs and fine-tuned LLMs.

Model Sources

  • Foundation LLMs

  • Fine-tuned LLMs

Target Platforms and Formats:

This guide covers the following target platforms and formats:

  • GGUF: A format for efficient CPU execution of models, particularly using the llama.cpp library.

Conversion Process:

  1. Select Source Model: Choose from the available foundation LLMs or fine-tuned LLMs.

  2. Model Quantization: (Optional) Apply quantization techniques to reduce model size and improve inference speed.

Quantization Parameters:

Quantization is the process of converting model weights from floating-point numbers (e.g., FP32) to lower-precision formats (e.g., INT8). This can significantly reduce model size and improve inference speed, but may slightly decrease accuracy.

Common quantization types include:

  • q4_k_m: A 4-bit quantization method.

  • q6_k: A 6-bit quantization method.

Instructions:

  1. Name: Enter a name for the converted model (letters, numbers, . - _ only).

  2. Description: Provide an optional description for the model (limit 20 characters).

  3. Source Model: Select the base model from the dropdown menu.

  4. Quantization Type: Select the desired quantization type from the dropdown menu.

  5. Convert: Click the "Convert" button to begin the conversion process.

Last updated