Skip to content

Commit 3b4a531

Browse files
committed
Merge 'origin/master' into hipblas
2 parents a1caa48 + 0b2da20 commit 3b4a531

File tree

10 files changed

+862
-56
lines changed

10 files changed

+862
-56
lines changed

.devops/tools.sh

+1-1
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ elif [[ $arg1 == '--all-in-one' || $arg1 == '-a' ]]; then
2323
echo "Skip model quantization, it already exists: ${i/f16/q4_0}"
2424
else
2525
echo "Converting PTH to GGML: $i into ${i/f16/q4_0}..."
26-
./quantize "$i" "${i/f16/q4_0}" 2
26+
./quantize "$i" "${i/f16/q4_0}" q4_0
2727
fi
2828
done
2929
else

.gitignore

+1
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ build-em/
1515
build-debug/
1616
build-release/
1717
build-static/
18+
build-cublas/
1819
build-no-accel/
1920
build-sanitize-addr/
2021
build-sanitize-thread/

README.md

+104-19
Original file line numberDiff line numberDiff line change
@@ -7,31 +7,27 @@
77

88
Inference of [LLaMA](https://arxiv.org/abs/2302.13971) model in pure C/C++
99

10-
**Warnings**
11-
12-
- `Q4_2` and `Q4_3` are still in development. Do not expect any kind of backward compatibility until they are finalized
13-
1410
**Hot topics:**
1511

12+
- [New quantization methods](https://github.com/ggerganov/llama.cpp#quantization)
1613
- [Added LoRA support](https://github.com/ggerganov/llama.cpp/pull/820)
1714
- [Add GPU support to ggml](https://github.com/ggerganov/llama.cpp/discussions/915)
1815
- [Roadmap Apr 2023](https://github.com/ggerganov/llama.cpp/discussions/784)
1916

2017
## Description
2118

22-
The main goal of llama.cpp is to run the llama model using 4-bit quantization on a MacBook.
19+
The main goal of `llama.cpp` is to run the LLaMA model using 4-bit integer quantization on a MacBook
2320

2421
- Plain C/C++ implementation without dependencies
2522
- Apple silicon first-class citizen - optimized via ARM NEON and Accelerate framework
2623
- AVX2 support for x86 architectures
2724
- Mixed F16 / F32 precision
28-
- 4-bit quantization support
25+
- 4-bit integer quantization support
2926
- Runs on the CPU
3027

31-
This was [hacked in an evening](https://github.com/ggerganov/llama.cpp/issues/33#issuecomment-1465108022) - I have no idea if it works correctly.
32-
Please do not make conclusions about the models based on the results from this implementation.
33-
For all I know, it can be completely wrong. This project is for educational purposes.
34-
New features will probably be added mostly through community contributions.
28+
The original implementation of `llama.cpp` was [hacked in an evening](https://github.com/ggerganov/llama.cpp/issues/33#issuecomment-1465108022).
29+
Since then, the project has improved significantly thanks to many contributions. This project is for educational purposes and serves
30+
as the main playground for developing new features for the [ggml](https://github.com/ggerganov/ggml) library.
3531

3632
**Supported platforms:**
3733

@@ -167,15 +163,27 @@ cd llama.cpp
167163

168164
### Build
169165

170-
Note: For Windows, CMake or Zig can be used.
166+
In order to build llama.cpp you have three different options.
171167

172-
1. Use `make`
168+
- Using `make`:
169+
- On Linux or MacOS:
173170

174-
```bash
175-
make
176-
```
171+
```bash
172+
make
173+
```
177174

178-
1. Use CMake
175+
- On Windows:
176+
177+
1. Download the latest fortran version of [w64devkit](https://github.com/seeto/w64devkit/releases).
178+
2. Extract `w64devkit` on your pc.
179+
3. Run `w64devkit.exe`.
180+
4. Use the `cd` command to reach the `llama.cpp` folder.
181+
5. From here you can run:
182+
```bash
183+
make
184+
```
185+
186+
- Using `CMake`:
179187

180188
```bash
181189
mkdir build
@@ -184,12 +192,71 @@ Note: For Windows, CMake or Zig can be used.
184192
cmake --build . --config Release
185193
```
186194

187-
1. Use Zig
195+
- Using `Zig`:
188196

189197
```bash
190198
zig build -Drelease-fast
191199
```
192200

201+
### BLAS Build
202+
203+
Building the program with BLAS support may lead to some performance improvements in prompt processing using batch sizes higher than 32 (the default is 512). BLAS doesn't affect the normal generation performance. There are currently three different implementations of it:
204+
205+
- Accelerate Framework:
206+
207+
This is only available on Mac PCs and it's enabled by default. You can just build using the normal instructions.
208+
209+
- OpenBLAS:
210+
211+
This provides BLAS acceleration using only the CPU. Make sure to have OpenBLAS installed on your machine.
212+
213+
- Using `make`:
214+
- On Linux:
215+
```bash
216+
make LLAMA_OPENBLAS=1
217+
```
218+
Note: In order to build on Arch Linux with OpenBLAS support enabled you must edit the Makefile adding at the end of the line 105: `-lcblas`
219+
220+
- On Windows:
221+
222+
1. Download the latest fortran version of [w64devkit](https://github.com/skeeto/w64devkit/releases).
223+
2. Download the latest version of [OpenBLAS for Windows](https://github.com/xianyi/OpenBLAS/releases).
224+
3. Extract `w64devkit` on your pc.
225+
4. From the OpenBLAS zip that you just downloaded copy `libopenblas.a`, located inside the `lib` folder, inside `w64devkit\x86_64-w64-mingw32\lib`.
226+
5. From the same OpenBLAS zip copy the content of the `include` folder inside `w64devkit\x86_64-w64-mingw32\include`.
227+
6. Run `w64devkit.exe`.
228+
7. Use the `cd` command to reach the `llama.cpp` folder.
229+
8. From here you can run:
230+
231+
```bash
232+
make LLAMA_OPENBLAS=1
233+
```
234+
235+
- Using `CMake` on Linux:
236+
237+
```bash
238+
mkdir build
239+
cd build
240+
cmake .. -DLLAMA_OPENBLAS=ON
241+
cmake --build . --config Release
242+
```
243+
244+
- cuBLAS
245+
246+
This provides BLAS acceleration using the CUDA cores of your Nvidia GPU. Make sure to have the CUDA toolkit installed. You can download it from your Linux distro's package manager or from here: [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads).
247+
- Using `make`:
248+
```bash
249+
make LLAMA_CUBLAS=1
250+
```
251+
- Using `CMake`:
252+
253+
```bash
254+
mkdir build
255+
cd build
256+
cmake .. -DLLAMA_CUBLAS=ON
257+
cmake --build . --config Release
258+
```
259+
193260
### Prepare Data & Run
194261
195262
```bash
@@ -203,8 +270,8 @@ python3 -m pip install -r requirements.txt
203270
# convert the 7B model to ggml FP16 format
204271
python3 convert.py models/7B/
205272
206-
# quantize the model to 4-bits (using method 2 = q4_0)
207-
./quantize ./models/7B/ggml-model-f16.bin ./models/7B/ggml-model-q4_0.bin 2
273+
# quantize the model to 4-bits (using q4_0 method)
274+
./quantize ./models/7B/ggml-model-f16.bin ./models/7B/ggml-model-q4_0.bin q4_0
208275
209276
# run the inference
210277
./main -m ./models/7B/ggml-model-q4_0.bin -n 128
@@ -223,6 +290,24 @@ As the models are currently fully loaded into memory, you will need adequate dis
223290
| 30B | 60 GB | 19.5 GB |
224291
| 65B | 120 GB | 38.5 GB |
225292
293+
### Quantization
294+
295+
Several quantization methods are supported. They differ in the resulting model disk size and inference speed.
296+
297+
Model | F16 | Q4_0 | Q4_1 | Q4_2 | Q4_3 | Q5_0 | Q5_1 | Q8_0
298+
-- | -- | -- | -- | -- | -- | -- | -- | --
299+
7B (ppl) | 5.9565 | 6.2103 | 6.1286 | 6.1698 | 6.0617 | 6.0139 | 5.9934 | 5.9571
300+
7B (size) | 13.0G | 4.0G | 4.8G | 4.0G | 4.8G | 4.4G | 4.8G | 7.1G
301+
7B (ms/tok @ 4th) | 128 | 56 | 61 | 84 | 91 | 91 | 95 | 75
302+
7B (ms/tok @ 8th) | 128 | 47 | 55 | 48 | 53 | 53 | 59 | 75
303+
7B (bpw) | 16.0 | 5.0 | 6.0 | 5.0 | 6.0 | 5.5 | 6.0 | 9.0
304+
-- | -- | -- | -- | -- | -- | -- | -- | --
305+
13B (ppl) | 5.2455 | 5.3748 | 5.3471 | 5.3433 | 5.3234 | 5.2768 | 5.2582 | 5.2458
306+
13B (size) | 25.0G | 7.6G | 9.1G | 7.6G | 9.1G | 8.4G | 9.1G | 14G
307+
13B (ms/tok @ 4th) | 239 | 104 | 113 | 160 | 175 | 176 | 185 | 141
308+
13B (ms/tok @ 8th) | 240 | 85 | 99 | 97 | 114 | 108 | 117 | 147
309+
13B (bpw) | 16.0 | 5.0 | 6.0 | 5.0 | 6.0 | 5.5 | 6.0 | 9.0
310+
226311
### Interactive mode
227312
228313
If you want a more ChatGPT-like experience, you can run in interactive mode by passing `-i` as a parameter.

examples/quantize/quantize.cpp

+26-6
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,19 @@
22
#include "llama.h"
33

44
#include <cstdio>
5+
#include <map>
56
#include <string>
67

8+
static const std::map<std::string, enum llama_ftype> LLAMA_FTYPE_MAP = {
9+
{"q4_0", LLAMA_FTYPE_MOSTLY_Q4_0},
10+
{"q4_1", LLAMA_FTYPE_MOSTLY_Q4_1},
11+
{"q4_2", LLAMA_FTYPE_MOSTLY_Q4_2},
12+
{"q4_3", LLAMA_FTYPE_MOSTLY_Q4_3},
13+
{"q5_0", LLAMA_FTYPE_MOSTLY_Q5_0},
14+
{"q5_1", LLAMA_FTYPE_MOSTLY_Q5_1},
15+
{"q8_0", LLAMA_FTYPE_MOSTLY_Q8_0},
16+
};
17+
718
// usage:
819
// ./quantize models/llama/ggml-model.bin models/llama/ggml-model-quant.bin type
920
//
@@ -12,11 +23,9 @@ int main(int argc, char ** argv) {
1223

1324
if (argc < 4) {
1425
fprintf(stderr, "usage: %s model-f32.bin model-quant.bin type [nthread]\n", argv[0]);
15-
fprintf(stderr, " type = %d - q4_0\n", LLAMA_FTYPE_MOSTLY_Q4_0);
16-
fprintf(stderr, " type = %d - q4_1\n", LLAMA_FTYPE_MOSTLY_Q4_1);
17-
fprintf(stderr, " type = %d - q4_2\n", LLAMA_FTYPE_MOSTLY_Q4_2);
18-
fprintf(stderr, " type = %d - q4_3\n", LLAMA_FTYPE_MOSTLY_Q4_3);
19-
fprintf(stderr, " type = %d - q8_0\n", LLAMA_FTYPE_MOSTLY_Q8_0);
26+
for (auto it = LLAMA_FTYPE_MAP.begin(); it != LLAMA_FTYPE_MAP.end(); it++) {
27+
fprintf(stderr, " type = \"%s\" or %d\n", it->first.c_str(), it->second);
28+
}
2029
return 1;
2130
}
2231

@@ -30,7 +39,18 @@ int main(int argc, char ** argv) {
3039
const std::string fname_inp = argv[1];
3140
const std::string fname_out = argv[2];
3241

33-
const enum llama_ftype ftype = (enum llama_ftype)atoi(argv[3]);
42+
enum llama_ftype ftype;
43+
if (argv[3][0] == 'q') {
44+
auto it = LLAMA_FTYPE_MAP.find(argv[3]);
45+
if (it == LLAMA_FTYPE_MAP.end()) {
46+
fprintf(stderr, "%s: unknown ftype '%s'\n", __func__, argv[3]);
47+
return 1;
48+
}
49+
ftype = it->second;
50+
} else {
51+
ftype = (enum llama_ftype)atoi(argv[3]);
52+
}
53+
3454
int nthread = argc > 4 ? atoi(argv[4]) : 0;
3555

3656
const int64_t t_main_start_us = ggml_time_us();

ggml-cuda.cu

+85
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,23 @@ typedef struct {
4141
} block_q4_3;
4242
static_assert(sizeof(block_q4_3) == 2 * sizeof(ggml_fp16_t) + QK4_3 / 2, "wrong q4_3 block size/padding");
4343

44+
#define QK5_0 32
45+
typedef struct {
46+
__half d; // delta
47+
uint8_t qh[4]; // 5-th bit of quants
48+
uint8_t qs[QK5_0 / 2]; // nibbles / quants
49+
} block_q5_0;
50+
static_assert(sizeof(block_q5_0) == sizeof(ggml_fp16_t) + sizeof(uint32_t) + QK5_0 / 2, "wrong q5_0 block size/padding");
51+
52+
#define QK5_1 32
53+
typedef struct {
54+
__half d; // delta
55+
__half m; // min
56+
uint32_t qh; // 5-th bit of quants
57+
uint8_t qs[QK5_1 / 2]; // nibbles / quants
58+
} block_q5_1;
59+
static_assert(sizeof(block_q5_1) == 2 * sizeof(ggml_fp16_t) + sizeof(uint32_t) + QK5_1 / 2, "wrong q5_1 block size/padding");
60+
4461
#define QK8_0 32
4562
typedef struct {
4663
float d; // delta
@@ -142,6 +159,64 @@ static __global__ void dequantize_block_q4_3(const void * vx, float * y) {
142159
}
143160
}
144161

162+
static __global__ void dequantize_block_q5_0(const void * vx, float * y) {
163+
const block_q5_0 * x = (const block_q5_0 *) vx;
164+
165+
const int i = blockIdx.x;
166+
167+
const float d = x[i].d;
168+
169+
const uint8_t * pp = x[i].qs;
170+
171+
uint32_t qh;
172+
memcpy(&qh, x[i].qh, sizeof(qh));
173+
174+
for (int l = 0; l < QK5_0; l += 2) {
175+
const uint8_t vi = pp[l/2];
176+
177+
const int8_t vh0 = ((qh & (1 << (l + 0))) >> (l + 0)) << 4;
178+
const int8_t vh1 = ((qh & (1 << (l + 1))) >> (l + 1)) << 4;
179+
180+
const int8_t vi0 = ((vi & 0xf) | vh0);
181+
const int8_t vi1 = ((vi >> 4) | vh1);
182+
183+
const float v0 = (vi0 - 16)*d;
184+
const float v1 = (vi1 - 16)*d;
185+
186+
y[i*QK5_0 + l + 0] = v0;
187+
y[i*QK5_0 + l + 1] = v1;
188+
}
189+
}
190+
191+
static __global__ void dequantize_block_q5_1(const void * vx, float * y) {
192+
const block_q5_1 * x = (const block_q5_1 *) vx;
193+
194+
const int i = blockIdx.x;
195+
196+
const float d = x[i].d;
197+
const float m = x[i].m;
198+
199+
const uint8_t * pp = x[i].qs;
200+
201+
const uint32_t qh = x[i].qh;
202+
203+
for (int l = 0; l < QK5_1; l += 2) {
204+
const uint8_t vi = pp[l/2];
205+
206+
const int8_t vh0 = ((qh & (1 << (l + 0))) >> (l + 0)) << 4;
207+
const int8_t vh1 = ((qh & (1 << (l + 1))) >> (l + 1)) << 4;
208+
209+
const int8_t vi0 = (vi & 0xf) | vh0;
210+
const int8_t vi1 = (vi >> 4) | vh1;
211+
212+
const float v0 = vi0*d + m;
213+
const float v1 = vi1*d + m;
214+
215+
y[i*QK5_1 + l + 0] = v0;
216+
y[i*QK5_1 + l + 1] = v1;
217+
}
218+
}
219+
145220
static __global__ void dequantize_block_q8_0(const void * vx, float * y) {
146221
const block_q8_0 * x = (const block_q8_0 *) vx;
147222

@@ -178,6 +253,16 @@ void dequantize_row_q4_3_cuda(const void * vx, float * y, int k, cudaStream_t st
178253
dequantize_block_q4_3<<<nb, 1, 0, stream>>>(vx, y);
179254
}
180255

256+
void dequantize_row_q5_0_cuda(const void * vx, float * y, int k, cudaStream_t stream) {
257+
const int nb = k / QK5_0;
258+
dequantize_block_q5_0<<<nb, 1, 0, stream>>>(vx, y);
259+
}
260+
261+
void dequantize_row_q5_1_cuda(const void * vx, float * y, int k, cudaStream_t stream) {
262+
const int nb = k / QK5_1;
263+
dequantize_block_q5_1<<<nb, 1, 0, stream>>>(vx, y);
264+
}
265+
181266
void dequantize_row_q8_0_cuda(const void * vx, float * y, int k, cudaStream_t stream) {
182267
const int nb = k / QK8_0;
183268
dequantize_block_q8_0<<<nb, 1, 0, stream>>>(vx, y);

ggml-cuda.h

+2
Original file line numberDiff line numberDiff line change
@@ -77,6 +77,8 @@ void dequantize_row_q4_0_cuda(const void * vx, float * y, int k, cudaStream_t st
7777
void dequantize_row_q4_1_cuda(const void * vx, float * y, int k, cudaStream_t stream);
7878
void dequantize_row_q4_2_cuda(const void * vx, float * y, int k, cudaStream_t stream);
7979
void dequantize_row_q4_3_cuda(const void * vx, float * y, int k, cudaStream_t stream);
80+
void dequantize_row_q5_0_cuda(const void * vx, float * y, int k, cudaStream_t stream);
81+
void dequantize_row_q5_1_cuda(const void * vx, float * y, int k, cudaStream_t stream);
8082
void dequantize_row_q8_0_cuda(const void * vx, float * y, int k, cudaStream_t stream);
8183

8284
#ifdef __cplusplus

0 commit comments

Comments
 (0)