Skip to content

Commit c9f457d

Browse files
committed
Improve readme by putting into separate files
1 parent cc87185 commit c9f457d

12 files changed

+351
-346
lines changed

README.md

+9-338
Large diffs are not rendered by default.

docs/FAQ.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
One can choose any huggingface model, just pass the name after `--base_model=`, but a `prompt_type` is required if we don't already have support.
44
E.g. for vicuna models, a typical prompt_type is used and we support that already automatically for specific models,
55
but if you pass `--prompt_type=instruct_vicuna` with any other Vicuna model, we'll use it assuming that is the correct prompt type.
6-
See models that are currently supported in this automatic way, and the same dictionary shows which prompt types are supported: [prompter](prompter.py).
6+
See models that are currently supported in this automatic way, and the same dictionary shows which prompt types are supported: [prompter](../prompter.py).
77

88
### Low-memory mode
99

@@ -82,7 +82,7 @@ but nothing prevents gradio from working without this. So a simple firewall blo
8282

8383
### Isolated LangChain Usage:
8484

85-
See [tests/test_langchain_simple.py](tests/test_langchain_simple.py)
85+
See [tests/test_langchain_simple.py](../tests/test_langchain_simple.py)
8686

8787
### ValueError: ...offload....
8888

@@ -115,7 +115,7 @@ etc.
115115

116116
For GPT4All based models, require AVX2, unless one recompiles that project on your system. Until then, use llama.cpp models instead.
117117

118-
So we recommend downloading models from [TheBloke](https://huggingface.co/TheBloke) that are version 3 quantized ggml files to work with latest llama.cpp. See main [README.md](README.md#cpu).
118+
So we recommend downloading models from [TheBloke](https://huggingface.co/TheBloke) that are version 3 quantized ggml files to work with latest llama.cpp. See main [README.md](README_CPU.md).
119119

120120
The below example is for base LLaMa model, not instruct-tuned, so is not recommended for chatting. It just gives an example of how to quantize if you are an expert.
121121

docs/LINKS.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,7 @@
8686
[DataSet Viewer](https://huggingface.co/datasets/viewer/?dataset=squad)<br />
8787
[Anthropic RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf)<br />
8888
[WebGPT_Comparisons](https://huggingface.co/datasets/openai/webgpt_comparisons)<br />
89-
[Self_instruct](yizhongw/self_instruct)<br />
89+
[Self_instruct](https://github.com/yizhongw/self-instruct)<br />
9090
[20BChatModelData](https://github.com/togethercomputer/OpenDataHub)<br />
9191

9292
### Apache2/MIT/BSD-3 Summarization Data

docs/README_CLI.md

+24
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
### CLI chat
2+
3+
The CLI can be used instead of gradio by running for some base model, e.g.:
4+
```bash
5+
python generate.py --base_model=gptj --cli=True
6+
```
7+
and for LangChain run:
8+
```bash
9+
python make_db.py --user_path=user_path --collection_name=UserData
10+
python generate.py --base_model=gptj --cli=True --langchain_mode=UserData
11+
```
12+
with documents in `user_path` folder, or directly run:
13+
```bash
14+
python generate.py --base_model=gptj --cli=True --langchain_mode=UserData --user_path=user_path
15+
```
16+
which will build the database first time. One can also use any other models, like:
17+
```bash
18+
python generate.py --base_model=h2oai/h2ogpt-oig-oasst1-512-6_9b --cli=True --langchain_mode=UserData --user_path=user_path
19+
```
20+
or for WizardLM:
21+
```bash
22+
python generate.py --base_model='llama' --prompt_type=wizard2 --cli=True --langchain_mode=UserData --user_path=user_path
23+
```
24+

docs/README_CLIENT.md

+38
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
### Client APIs
2+
3+
A Gradio API and an OpenAI-compliant API are supported.
4+
5+
##### Gradio Client API
6+
7+
`generate.py` by default runs a gradio server, which also gives access to client API using gradio client. One can use it with h2oGPT, or independently of h2oGPT repository by installing an env:
8+
```bash
9+
conda create -n gradioclient -y
10+
conda activate gradioclient
11+
conda install python=3.10 -y
12+
pip install gradio_client
13+
python checkclient.py
14+
```
15+
then running client code:
16+
```python
17+
from gradio_client import Client
18+
import ast
19+
20+
HOST_URL = "http://localhost:7860"
21+
client = Client(HOST_URL)
22+
23+
# string of dict for input
24+
kwargs = dict(instruction_nochat='Who are you?')
25+
res = client.predict(str(dict(kwargs)), api_name='/submit_nochat_api')
26+
27+
# string of dict for output
28+
response = ast.literal_eval(res)['response']
29+
print(response)
30+
```
31+
For other ways to use gradio client, see example [test code](../client_test.py) or other tests in our [tests](https://github.com/h2oai/h2ogpt/blob/main/tests/test_client_calls.py).
32+
33+
Any element in [gradio_runner.py](../gradio_runner.py) with `api_name` defined can be accessed via the gradio client.
34+
35+
##### OpenAI Python Client Library
36+
37+
An OpenAI compliant client is available. Refer the [README](../client/README.md) for more details.
38+

docs/README_CPU.md

+59
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
### CPU
2+
3+
CPU support is obtained after installing two optional requirements.txt files. This does not preclude GPU support, just adds CPU support:
4+
5+
* Install base, langchain, and GPT4All, and python LLaMa dependencies:
6+
```bash
7+
git clone https://github.com/h2oai/h2ogpt.git
8+
cd h2ogpt
9+
pip install -r requirements.txt # only do if didn't already do for GPU support, since windows needs --extra-index-url line
10+
pip install -r reqs_optional/requirements_optional_langchain.txt
11+
python -m nltk.downloader all # for supporting unstructured package
12+
pip install -r reqs_optional/requirements_optional_gpt4all.txt
13+
```
14+
See [GPT4All](https://github.com/nomic-ai/gpt4all) for details on installation instructions if any issues encountered.
15+
16+
* Change `.env_gpt4all` model name if desired.
17+
```.env_gpt4all
18+
model_path_llama=WizardLM-7B-uncensored.ggmlv3.q8_0.bin
19+
model_path_gptj=ggml-gpt4all-j-v1.3-groovy.bin
20+
model_name_gpt4all_llama=ggml-wizardLM-7B.q4_2.bin
21+
```
22+
For `gptj` and `gpt4all_llama`, you can choose a different model than our default choice by going to GPT4All Model explorer [GPT4All-J compatible model](https://gpt4all.io/index.html). One does not need to download manually, the gp4all package will download at runtime and put it into `.cache` like Hugging Face would. However, `gpjt` model often gives [no output](FAQ.md#gpt4all-not-producing-output), even outside h2oGPT.
23+
24+
So, for chatting, a better instruct fine-tuned LLaMa-based model for llama.cpp can be downloaded from [TheBloke](https://huggingface.co/TheBloke). For example, [13B WizardLM Quantized](https://huggingface.co/TheBloke/wizardLM-13B-1.0-GGML) or [7B WizardLM Quantized](https://huggingface.co/TheBloke/WizardLM-7B-uncensored-GGML). TheBloke has a variety of model types, quantization bit depths, and memory consumption. Choose what is best for your system's specs. However, be aware that LLaMa-based models are not [commercially viable](FAQ.md#commercial-viability).
25+
26+
For 7B case, download [WizardLM-7B-uncensored.ggmlv3.q8_0.bin](https://huggingface.co/TheBloke/WizardLM-7B-uncensored-GGML/blob/main/WizardLM-7B-uncensored.ggmlv3.q8_0.bin) into local path. Then one sets `model_path_llama` in `.env_gpt4all`, which is currently the default.
27+
28+
* Run generate.py
29+
30+
For LangChain support using documents in `user_path` folder, run h2oGPT like:
31+
```bash
32+
python generate.py --base_model='llama' --prompt_type=wizard2 --score_model=None --langchain_mode='UserData' --user_path=user_path
33+
```
34+
See [LangChain Readme](README_LangChain.md) for more details.
35+
For no langchain support (still uses LangChain package as model wrapper), run as:
36+
```bash
37+
python generate.py --base_model='llama' --prompt_type=wizard2 --score_model=None
38+
```
39+
40+
When using `llama.cpp` based CPU models, for computers with low system RAM or slow CPUs, we recommend adding to `.env_gpt4all`:
41+
```.env_gpt4all
42+
use_mlock=False
43+
n_ctx=1024
44+
```
45+
where `use_mlock=True` is default to avoid slowness and `n_ctx=2048` is default for large context handling. For computers with plenty of system RAM, we recommend adding to `.env_gpt4all`:
46+
```.env_gpt4all
47+
n_batch=1024
48+
```
49+
for faster handling. On some systems this has no strong effect, but on others may increase speed quite a bit.
50+
51+
Also, for slow and low-memory systems, we recommend using a smaller embedding by using with `generrate.py`:
52+
```bash
53+
python generate.py ... --hf_embedding_model=sentence-transformers/all-MiniLM-L6-v2
54+
```
55+
where `...` means any other options one should add like `--base_model` etc. This simpler embedding is about half the size as default `instruct-large` and so uses less disk, CPU memory, and GPU memory if using GPUs.
56+
57+
See also [Low Memory](FAQ.md#low-memory-mode) for more information about low-memory recommendations.
58+
59+

docs/README_GPU.md

+108
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
### GPU
2+
3+
GPU via CUDA is supported via Hugging Face type models and LLaMa.cpp models.
4+
5+
#### GPU (CUDA)
6+
7+
For help installing cuda toolkit, see [CUDA Toolkit](INSTALL.md#installing-cuda-toolkit)
8+
9+
```bash
10+
git clone https://github.com/h2oai/h2ogpt.git
11+
cd h2ogpt
12+
pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu117
13+
python generate.py --base_model=h2oai/h2ogpt-oig-oasst1-512-6_9b --load_8bit=True
14+
```
15+
Then point browser at http://0.0.0.0:7860 (linux) or http://localhost:7860 (windows/mac) or the public live URL printed by the server (disable shared link with `--share=False`). For 4-bit or 8-bit support, older GPUs may require older bitsandbytes installed as `pip uninstall bitsandbytes -y ; pip install bitsandbytes==0.38.1`. For production uses, we recommend at least the 12B model, ran as:
16+
```
17+
python generate.py --base_model=h2oai/h2ogpt-oasst1-512-12b --load_8bit=True
18+
```
19+
and one can use `--h2ocolors=False` to get soft blue-gray colors instead of H2O.ai colors. [Here](FAQ.md#what-envs-can-i-pass-to-control-h2ogpt) is a list of environment variables that can control some things in `generate.py`.
20+
21+
Note if you download the model yourself and point `--base_model` to that location, you'll need to specify the prompt_type as well by running:
22+
```
23+
python generate.py --base_model=<user path> --load_8bit=True --prompt_type=human_bot
24+
```
25+
for some user path `<user path>` and the `prompt_type` must match the model or a new version created in `prompter.py` or added in UI/CLI via `prompt_dict`.
26+
27+
For quickly using a private document collection for Q/A, place documents (PDFs, text, etc.) into a folder called `user_path` and run
28+
```bash
29+
pip install -r reqs_optional/requirements_optional_langchain.txt
30+
python -m nltk.downloader all # for supporting unstructured package
31+
python generate.py --base_model=h2oai/h2ogpt-oig-oasst1-512-6_9b --load_8bit=True --langchain_mode=UserData --user_path=user_path
32+
```
33+
For more ways to ingest on CLI and control see [LangChain Readme](README_LangChain.md). For example, for improved pdf handling via pymupdf (GPL) and support for docx, ppt, OCR, and ArXiV run:
34+
```bash
35+
sudo apt-get install -y libmagic-dev poppler-utils tesseract-ocr tesseract-ocr libreoffice
36+
pip install -r reqs_optional/requirements_optional_langchain.gpllike.txt
37+
```
38+
39+
For 4-bit support, the latest dev versions of transformers, accelerate, and peft are required, which can be installed by running:
40+
```bash
41+
pip uninstall peft transformers accelerate -y
42+
pip install -r reqs_optional/requirements_optional_4bit.txt
43+
```
44+
where uninstall is required in case, e.g., peft was installed from GitHub previously. Then when running generate pass `--load_4bit=True`, which is only supported for certain [architectures](https://github.com/huggingface/peft#models-support-matrix) like GPT-NeoX-20B, GPT-J, LLaMa, etc.
45+
46+
Any other instruct-tuned base models can be used, including non-h2oGPT ones. [Larger models require more GPU memory](FAQ.md#larger-models-require-more-gpu-memory).
47+
48+
#### GPU with LLaMa
49+
50+
* Install langchain, and GPT4All, and python LLaMa dependencies:
51+
```bash
52+
pip install -r reqs_optional/requirements_optional_langchain.txt
53+
pip install -r reqs_optional/requirements_optional_gpt4all.txt
54+
```
55+
then compile llama-cpp-python with CUDA support:
56+
```bash
57+
conda install -c "nvidia/label/cuda-12.1.1" cuda-toolkit # maybe optional
58+
pip uninstall -y llama-cpp-python
59+
export LLAMA_CUBLAS=1
60+
export CMAKE_ARGS=-DLLAMA_CUBLAS=on
61+
export FORCE_CMAKE=1
62+
export CUDA_HOME=$HOME/miniconda3/envs/h2ogpt
63+
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python==0.1.68 --no-cache-dir --verbose
64+
```
65+
and uncomment `# n_gpu_layers=20` in `.env_gpt4all`. If one sees `/usr/bin/nvcc` mentioned in errors, that file needs to be removed as would likely conflict with version installed for conda. Then run:
66+
```bash
67+
python generate.py --base_model='llama' --prompt_type=wizard2 --score_model=None --langchain_mode='UserData' --user_path=user_path
68+
```
69+
when loading you should see something like:
70+
```text
71+
Using Model llama
72+
Prep: persist_directory=db_dir_UserData exists, user_path=user_path passed, adding any changed or new documents
73+
load INSTRUCTOR_Transformer
74+
max_seq_length 512
75+
0it [00:00, ?it/s]
76+
0it [00:00, ?it/s]
77+
Loaded 0 sources for potentially adding to UserData
78+
ggml_init_cublas: found 2 CUDA devices:
79+
Device 0: NVIDIA GeForce RTX 3090 Ti
80+
Device 1: NVIDIA GeForce RTX 2080
81+
llama.cpp: loading model from WizardLM-7B-uncensored.ggmlv3.q8_0.bin
82+
llama_model_load_internal: format = ggjt v3 (latest)
83+
llama_model_load_internal: n_vocab = 32001
84+
llama_model_load_internal: n_ctx = 1792
85+
llama_model_load_internal: n_embd = 4096
86+
llama_model_load_internal: n_mult = 256
87+
llama_model_load_internal: n_head = 32
88+
llama_model_load_internal: n_layer = 32
89+
llama_model_load_internal: n_rot = 128
90+
llama_model_load_internal: ftype = 7 (mostly Q8_0)
91+
llama_model_load_internal: n_ff = 11008
92+
llama_model_load_internal: model size = 7B
93+
llama_model_load_internal: ggml ctx size = 0.08 MB
94+
llama_model_load_internal: using CUDA for GPU acceleration
95+
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090 Ti) as main device
96+
llama_model_load_internal: mem required = 4518.85 MB (+ 1026.00 MB per state)
97+
llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 368 MB VRAM for the scratch buffer
98+
llama_model_load_internal: offloading 20 repeating layers to GPU
99+
llama_model_load_internal: offloaded 20/35 layers to GPU
100+
llama_model_load_internal: total VRAM used: 4470 MB
101+
llama_new_context_with_model: kv self size = 896.00 MB
102+
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
103+
Model {'base_model': 'llama', 'tokenizer_base_model': '', 'lora_weights': '', 'inference_server': '', 'prompt_type': 'wizard2', 'prompt_dict': {'promptA': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.', 'promptB': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.', 'PreInstruct': '\n### Instruction:\n', 'PreInput': None, 'PreResponse': '\n### Response:\n', 'terminate_response': ['\n### Response:\n'], 'chat_sep': '\n', 'chat_turn_sep': '\n', 'humanstr': '\n### Instruction:\n', 'botstr': '\n### Response:\n', 'generates_leading_space': False}}
104+
Running on local URL: http://0.0.0.0:7860
105+
Running on public URL: https://1ccb24d03273a3d085.gradio.live
106+
```
107+
and GPU usage when using. Note that once `llama-cpp-python` is compiled to support CUDA, it no longer works for CPU mode,
108+
so one would have to reinstall it without the above options to recovers CPU mode or have a separate h2oGPT env for CPU mode.

docs/README_GRADIOUI.md

+14
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
### Gradio UI
2+
3+
`generate.py` by default runs a gradio server with a [UI (click for help with UI)](FAQ.md#explain-things-in-ui). Key benefits of the UI include:
4+
* Save, export, import chat histories and undo or regenerate last query-response pair
5+
* Upload and control documents of various kinds for document Q/A
6+
* Choose which specific collection to query, or just chat with LLM
7+
* Choose specific documents out of collection for asking questions
8+
* Side-by-side 2-model comparison view
9+
* RLHF response score evaluation for every query-response
10+
11+
See how we compare to other tools like PrivateGPT, see our comparisons at [h2oGPT LangChain Integration FAQ](README_LangChain.md#what-is-h2ogpts-langchain-integration-like).
12+
13+
We disable background uploads by disabling telemetry for Hugging Face, gradio, and chroma, and one can additionally avoid downloads (of fonts) by running `generate.py` with `--gradio_offline_level=2`. See [Offline Documentation](FAQ.md#offline-mode) for details.
14+

docs/README_InferenceServers.md

+7-4
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,7 @@
1+
# Inference Servers
2+
3+
One can connect to Hugging Face text generation inference server, gradio servers running h2oGPT, or OpenAI servers.
4+
15
## Hugging Face Text Generation Inference Server-Client
26

37
### Local Install
@@ -44,7 +48,7 @@ NCCL_SHM_DISABLE=1 CUDA_VISIBLE_DEVICES=0 text-generation-launcher --model-id h2
4448

4549
### Docker Install
4650

47-
#### **Recommended** (instead of Local Install)
51+
#### **Recommended**
4852

4953
```bash
5054
# https://docs.docker.com/engine/install/ubuntu/
@@ -75,7 +79,7 @@ Reboot or run:
7579
```bash
7680
newgrp docker
7781
```
78-
in order to login to this user.
82+
in order to log in to this user.
7983

8084
Then run:
8185
```bash
@@ -200,7 +204,7 @@ If you run in bash and need to use an authentication for the Hugging Face text g
200204
```
201205
i.e. 4 spaces between each IP, USER, and AUTH. USER should be the user and AUTH be the token.
202206

203-
When bringing up `generate.py` with any inference server, one can set `REQUEST_TIMEOUT` ENV to smaller value than default of 60 seconds to get server up faster if have many inaccessible endpoints you don't mind skipping. E.g. set `REQUEST_TIMEOUT=5`. One can also choose the timeout overall for each chat turn using env `REQUEST_TIMEOUT_FAST` that defaults to 10 seconds.
207+
When bringing up `generate.py` with any inference server, one can set `REQUEST_TIMEOUT` ENV to smaller value than default of 60 seconds to get server up faster if one has many inaccessible endpoints you don't mind skipping. E.g. set `REQUEST_TIMEOUT=5`. One can also choose the timeout overall for each chat turn using env `REQUEST_TIMEOUT_FAST` that defaults to 10 seconds.
204208

205209
Note: The client API calls for chat APIs (i.e. `instruction` type for `instruction`, `instruction_bot`, `instruction_bot_score`, and similar for `submit` and `retry` types) require managing all chat sessions via API. However, the `nochat` APIs only use the first model in the list of chats or model_lock list.
206210

@@ -210,7 +214,6 @@ Note: The client API calls for chat APIs (i.e. `instruction` type for `instructi
210214
### System info from gradio server
211215

212216
```python
213-
import os
214217
import json
215218
from gradio_client import Client
216219
ADMIN_PASS = ''

docs/README_MACOS.md

+15
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
### MACOS
2+
3+
First install [Rust](https://www.geeksforgeeks.org/how-to-install-rust-in-macos/):
4+
```bash
5+
curl –proto ‘=https’ –tlsv1.2 -sSf https://sh.rustup.rs | sh
6+
```
7+
Enter new shell and test: `rustc --version`
8+
9+
When running a Mac with Intel hardware (not M1), you may run into `_clang: error: the clang compiler does not support '-march=native'_` during pip install.
10+
If so, set your archflags during pip install. eg: `ARCHFLAGS="-arch x86_64" pip3 install -r requirements.txt`
11+
12+
If you encounter an error while building a wheel during the `pip install` process, you may need to install a C++ compiler on your computer.
13+
14+
Now go back to normal [CPU](README_CPU.md) installation.
15+

docs/README_WHEEL.md

+21
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
#### Python Wheel
2+
3+
The wheel adds all dependencies including optional dependencies like 4-bit and flash-attention. To build do:
4+
```bash
5+
python setup.py sdist bdist_wheel
6+
```
7+
To install the default dependencies do:
8+
```bash
9+
pip install dist/h2ogpt-0.1.0-py3-none-any.whl
10+
```
11+
replace `0.1.0` with actual version built if more than one.
12+
To install additional dependencies, for instance for faiss on GPU, do:
13+
```bash
14+
pip install dist/h2ogpt-0.1.0-py3-none-any.whl
15+
pip install dist/h2ogpt-0.1.0-py3-none-any.whl[FAISS]
16+
```
17+
once `whl` file is installed, two new scripts will be added to the current environment: `h2ogpt_finetune`, and `h2ogpt_generate`.
18+
19+
The wheel is not required to use h2oGPT locally from repo, but makes it portable with all required dependencies.
20+
21+
See [setup.py](../setup.py) for controlling other options via `extras_require`.

0 commit comments

Comments
 (0)