|
| 1 | +### GPU |
| 2 | + |
| 3 | +GPU via CUDA is supported via Hugging Face type models and LLaMa.cpp models. |
| 4 | + |
| 5 | +#### GPU (CUDA) |
| 6 | + |
| 7 | +For help installing cuda toolkit, see [CUDA Toolkit](INSTALL.md#installing-cuda-toolkit) |
| 8 | + |
| 9 | +```bash |
| 10 | +git clone https://github.com/h2oai/h2ogpt.git |
| 11 | +cd h2ogpt |
| 12 | +pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu117 |
| 13 | +python generate.py --base_model=h2oai/h2ogpt-oig-oasst1-512-6_9b --load_8bit=True |
| 14 | +``` |
| 15 | +Then point browser at http://0.0.0.0:7860 (linux) or http://localhost:7860 (windows/mac) or the public live URL printed by the server (disable shared link with `--share=False`). For 4-bit or 8-bit support, older GPUs may require older bitsandbytes installed as `pip uninstall bitsandbytes -y ; pip install bitsandbytes==0.38.1`. For production uses, we recommend at least the 12B model, ran as: |
| 16 | +``` |
| 17 | +python generate.py --base_model=h2oai/h2ogpt-oasst1-512-12b --load_8bit=True |
| 18 | +``` |
| 19 | +and one can use `--h2ocolors=False` to get soft blue-gray colors instead of H2O.ai colors. [Here](FAQ.md#what-envs-can-i-pass-to-control-h2ogpt) is a list of environment variables that can control some things in `generate.py`. |
| 20 | + |
| 21 | +Note if you download the model yourself and point `--base_model` to that location, you'll need to specify the prompt_type as well by running: |
| 22 | +``` |
| 23 | +python generate.py --base_model=<user path> --load_8bit=True --prompt_type=human_bot |
| 24 | +``` |
| 25 | +for some user path `<user path>` and the `prompt_type` must match the model or a new version created in `prompter.py` or added in UI/CLI via `prompt_dict`. |
| 26 | + |
| 27 | +For quickly using a private document collection for Q/A, place documents (PDFs, text, etc.) into a folder called `user_path` and run |
| 28 | +```bash |
| 29 | +pip install -r reqs_optional/requirements_optional_langchain.txt |
| 30 | +python -m nltk.downloader all # for supporting unstructured package |
| 31 | +python generate.py --base_model=h2oai/h2ogpt-oig-oasst1-512-6_9b --load_8bit=True --langchain_mode=UserData --user_path=user_path |
| 32 | +``` |
| 33 | +For more ways to ingest on CLI and control see [LangChain Readme](README_LangChain.md). For example, for improved pdf handling via pymupdf (GPL) and support for docx, ppt, OCR, and ArXiV run: |
| 34 | +```bash |
| 35 | +sudo apt-get install -y libmagic-dev poppler-utils tesseract-ocr tesseract-ocr libreoffice |
| 36 | +pip install -r reqs_optional/requirements_optional_langchain.gpllike.txt |
| 37 | +``` |
| 38 | + |
| 39 | +For 4-bit support, the latest dev versions of transformers, accelerate, and peft are required, which can be installed by running: |
| 40 | +```bash |
| 41 | +pip uninstall peft transformers accelerate -y |
| 42 | +pip install -r reqs_optional/requirements_optional_4bit.txt |
| 43 | +``` |
| 44 | +where uninstall is required in case, e.g., peft was installed from GitHub previously. Then when running generate pass `--load_4bit=True`, which is only supported for certain [architectures](https://github.com/huggingface/peft#models-support-matrix) like GPT-NeoX-20B, GPT-J, LLaMa, etc. |
| 45 | + |
| 46 | +Any other instruct-tuned base models can be used, including non-h2oGPT ones. [Larger models require more GPU memory](FAQ.md#larger-models-require-more-gpu-memory). |
| 47 | + |
| 48 | +#### GPU with LLaMa |
| 49 | + |
| 50 | +* Install langchain, and GPT4All, and python LLaMa dependencies: |
| 51 | +```bash |
| 52 | +pip install -r reqs_optional/requirements_optional_langchain.txt |
| 53 | +pip install -r reqs_optional/requirements_optional_gpt4all.txt |
| 54 | +``` |
| 55 | +then compile llama-cpp-python with CUDA support: |
| 56 | +```bash |
| 57 | +conda install -c "nvidia/label/cuda-12.1.1" cuda-toolkit # maybe optional |
| 58 | +pip uninstall -y llama-cpp-python |
| 59 | +export LLAMA_CUBLAS=1 |
| 60 | +export CMAKE_ARGS=-DLLAMA_CUBLAS=on |
| 61 | +export FORCE_CMAKE=1 |
| 62 | +export CUDA_HOME=$HOME/miniconda3/envs/h2ogpt |
| 63 | +CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python==0.1.68 --no-cache-dir --verbose |
| 64 | +``` |
| 65 | +and uncomment `# n_gpu_layers=20` in `.env_gpt4all`. If one sees `/usr/bin/nvcc` mentioned in errors, that file needs to be removed as would likely conflict with version installed for conda. Then run: |
| 66 | +```bash |
| 67 | +python generate.py --base_model='llama' --prompt_type=wizard2 --score_model=None --langchain_mode='UserData' --user_path=user_path |
| 68 | +``` |
| 69 | +when loading you should see something like: |
| 70 | +```text |
| 71 | +Using Model llama |
| 72 | +Prep: persist_directory=db_dir_UserData exists, user_path=user_path passed, adding any changed or new documents |
| 73 | +load INSTRUCTOR_Transformer |
| 74 | +max_seq_length 512 |
| 75 | +0it [00:00, ?it/s] |
| 76 | +0it [00:00, ?it/s] |
| 77 | +Loaded 0 sources for potentially adding to UserData |
| 78 | +ggml_init_cublas: found 2 CUDA devices: |
| 79 | + Device 0: NVIDIA GeForce RTX 3090 Ti |
| 80 | + Device 1: NVIDIA GeForce RTX 2080 |
| 81 | +llama.cpp: loading model from WizardLM-7B-uncensored.ggmlv3.q8_0.bin |
| 82 | +llama_model_load_internal: format = ggjt v3 (latest) |
| 83 | +llama_model_load_internal: n_vocab = 32001 |
| 84 | +llama_model_load_internal: n_ctx = 1792 |
| 85 | +llama_model_load_internal: n_embd = 4096 |
| 86 | +llama_model_load_internal: n_mult = 256 |
| 87 | +llama_model_load_internal: n_head = 32 |
| 88 | +llama_model_load_internal: n_layer = 32 |
| 89 | +llama_model_load_internal: n_rot = 128 |
| 90 | +llama_model_load_internal: ftype = 7 (mostly Q8_0) |
| 91 | +llama_model_load_internal: n_ff = 11008 |
| 92 | +llama_model_load_internal: model size = 7B |
| 93 | +llama_model_load_internal: ggml ctx size = 0.08 MB |
| 94 | +llama_model_load_internal: using CUDA for GPU acceleration |
| 95 | +ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090 Ti) as main device |
| 96 | +llama_model_load_internal: mem required = 4518.85 MB (+ 1026.00 MB per state) |
| 97 | +llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 368 MB VRAM for the scratch buffer |
| 98 | +llama_model_load_internal: offloading 20 repeating layers to GPU |
| 99 | +llama_model_load_internal: offloaded 20/35 layers to GPU |
| 100 | +llama_model_load_internal: total VRAM used: 4470 MB |
| 101 | +llama_new_context_with_model: kv self size = 896.00 MB |
| 102 | +AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | |
| 103 | +Model {'base_model': 'llama', 'tokenizer_base_model': '', 'lora_weights': '', 'inference_server': '', 'prompt_type': 'wizard2', 'prompt_dict': {'promptA': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.', 'promptB': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.', 'PreInstruct': '\n### Instruction:\n', 'PreInput': None, 'PreResponse': '\n### Response:\n', 'terminate_response': ['\n### Response:\n'], 'chat_sep': '\n', 'chat_turn_sep': '\n', 'humanstr': '\n### Instruction:\n', 'botstr': '\n### Response:\n', 'generates_leading_space': False}} |
| 104 | +Running on local URL: http://0.0.0.0:7860 |
| 105 | +Running on public URL: https://1ccb24d03273a3d085.gradio.live |
| 106 | +``` |
| 107 | +and GPU usage when using. Note that once `llama-cpp-python` is compiled to support CUDA, it no longer works for CPU mode, |
| 108 | +so one would have to reinstall it without the above options to recovers CPU mode or have a separate h2oGPT env for CPU mode. |
0 commit comments