Improve readme by putting into separate files

pseudotensor · pseudotensor · commit c9f457df47e0 · 2023-07-06T01:06:35.000-07:00
diff --git a/README.md b/README.md
diff --git a/docs/FAQ.md b/docs/FAQ.md
@@ -3,7 +3,7 @@
 One can choose any huggingface model, just pass the name after `--base_model=`, but a `prompt_type` is required if we don't already have support.
 E.g. for vicuna models, a typical prompt_type is used and we support that already automatically for specific models,
 but if you pass `--prompt_type=instruct_vicuna` with any other Vicuna model, we'll use it assuming that is the correct prompt type.
-See models that are currently supported in this automatic way, and the same dictionary shows which prompt types are supported: [prompter](prompter.py).
+See models that are currently supported in this automatic way, and the same dictionary shows which prompt types are supported: [prompter](../prompter.py).
 
 ### Low-memory mode
 
@@ -82,7 +82,7 @@ but nothing prevents gradio from working without this.  So a simple firewall blo
 
 ### Isolated LangChain Usage:
 
-See [tests/test_langchain_simple.py](tests/test_langchain_simple.py)
+See [tests/test_langchain_simple.py](../tests/test_langchain_simple.py)
 
 ### ValueError: ...offload....
 
@@ -115,7 +115,7 @@ etc.
 
 For GPT4All based models, require AVX2, unless one recompiles that project on your system.  Until then, use llama.cpp models instead.
 
-So we recommend downloading models from [TheBloke](https://huggingface.co/TheBloke) that are version 3 quantized ggml files to work with latest llama.cpp.  See main [README.md](README.md#cpu).
+So we recommend downloading models from [TheBloke](https://huggingface.co/TheBloke) that are version 3 quantized ggml files to work with latest llama.cpp.  See main [README.md](README_CPU.md).
 
 The below example is for base LLaMa model, not instruct-tuned, so is not recommended for chatting.  It just gives an example of how to quantize if you are an expert.
 
diff --git a/docs/LINKS.md b/docs/LINKS.md
@@ -86,7 +86,7 @@
 [DataSet Viewer](https://huggingface.co/datasets/viewer/?dataset=squad)<br />
 [Anthropic RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf)<br />
 [WebGPT_Comparisons](https://huggingface.co/datasets/openai/webgpt_comparisons)<br />
-[Self_instruct](yizhongw/self_instruct)<br />
+[Self_instruct](https://github.com/yizhongw/self-instruct)<br />
 [20BChatModelData](https://github.com/togethercomputer/OpenDataHub)<br />
 
 ### Apache2/MIT/BSD-3 Summarization Data
diff --git a/docs/README_CLI.md b/docs/README_CLI.md
@@ -0,0 +1,24 @@
+### CLI chat
+
+The CLI can be used instead of gradio by running for some base model, e.g.:
+```bash
+python generate.py --base_model=gptj --cli=True
+```
+and for LangChain run:
+```bash
+python make_db.py --user_path=user_path --collection_name=UserData
+python generate.py --base_model=gptj --cli=True --langchain_mode=UserData
+```
+with documents in `user_path` folder, or directly run:
+```bash
+python generate.py --base_model=gptj --cli=True --langchain_mode=UserData --user_path=user_path
+```
+which will build the database first time.  One can also use any other models, like:
+```bash
+python generate.py --base_model=h2oai/h2ogpt-oig-oasst1-512-6_9b --cli=True --langchain_mode=UserData --user_path=user_path
+```
+or for WizardLM:
+```bash
+python generate.py --base_model='llama' --prompt_type=wizard2 --cli=True --langchain_mode=UserData --user_path=user_path
+```
+
diff --git a/docs/README_CLIENT.md b/docs/README_CLIENT.md
@@ -0,0 +1,38 @@
+### Client APIs
+
+A Gradio API and an OpenAI-compliant API are supported.
+
+##### Gradio Client API
+
+`generate.py` by default runs a gradio server, which also gives access to client API using gradio client.  One can use it with h2oGPT, or independently of h2oGPT repository by installing an env:
+```bash
+conda create -n gradioclient -y
+conda activate gradioclient
+conda install python=3.10 -y
+pip install gradio_client
+python checkclient.py
+```
+then running client code:
+```python
+from gradio_client import Client
+import ast
+
+HOST_URL = "http://localhost:7860"
+client = Client(HOST_URL)
+
+# string of dict for input
+kwargs = dict(instruction_nochat='Who are you?')
+res = client.predict(str(dict(kwargs)), api_name='/submit_nochat_api')
+
+# string of dict for output
+response = ast.literal_eval(res)['response']
+print(response)
+```
+For other ways to use gradio client, see example [test code](../client_test.py) or other tests in our [tests](https://github.com/h2oai/h2ogpt/blob/main/tests/test_client_calls.py).
+
+Any element in [gradio_runner.py](../gradio_runner.py) with `api_name` defined can be accessed via the gradio client.
+
+##### OpenAI Python Client Library
+
+An OpenAI compliant client is available. Refer the [README](../client/README.md)  for more details.
+
diff --git a/docs/README_CPU.md b/docs/README_CPU.md
@@ -0,0 +1,59 @@
+### CPU
+
+CPU support is obtained after installing two optional requirements.txt files.  This does not preclude GPU support, just adds CPU support:
+
+* Install base, langchain, and GPT4All, and python LLaMa dependencies:
+```bash
+git clone https://github.com/h2oai/h2ogpt.git
+cd h2ogpt
+pip install -r requirements.txt  # only do if didn't already do for GPU support, since windows needs --extra-index-url line
+pip install -r reqs_optional/requirements_optional_langchain.txt
+python -m nltk.downloader all  # for supporting unstructured package
+pip install -r reqs_optional/requirements_optional_gpt4all.txt
+```
+See [GPT4All](https://github.com/nomic-ai/gpt4all) for details on installation instructions if any issues encountered.
+
+* Change `.env_gpt4all` model name if desired.
+```.env_gpt4all
+model_path_llama=WizardLM-7B-uncensored.ggmlv3.q8_0.bin
+model_path_gptj=ggml-gpt4all-j-v1.3-groovy.bin
+model_name_gpt4all_llama=ggml-wizardLM-7B.q4_2.bin
+```
+For `gptj` and `gpt4all_llama`, you can choose a different model than our default choice by going to GPT4All Model explorer [GPT4All-J compatible model](https://gpt4all.io/index.html). One does not need to download manually, the gp4all package will download at runtime and put it into `.cache` like Hugging Face would.  However, `gpjt` model often gives [no output](FAQ.md#gpt4all-not-producing-output), even outside h2oGPT.
+
+So, for chatting, a better instruct fine-tuned LLaMa-based model for llama.cpp can be downloaded from [TheBloke](https://huggingface.co/TheBloke).  For example, [13B WizardLM Quantized](https://huggingface.co/TheBloke/wizardLM-13B-1.0-GGML) or [7B WizardLM Quantized](https://huggingface.co/TheBloke/WizardLM-7B-uncensored-GGML).  TheBloke has a variety of model types, quantization bit depths, and memory consumption.  Choose what is best for your system's specs.  However, be aware that LLaMa-based models are not [commercially viable](FAQ.md#commercial-viability).
+
+For 7B case, download [WizardLM-7B-uncensored.ggmlv3.q8_0.bin](https://huggingface.co/TheBloke/WizardLM-7B-uncensored-GGML/blob/main/WizardLM-7B-uncensored.ggmlv3.q8_0.bin) into local path.  Then one sets `model_path_llama` in `.env_gpt4all`, which is currently the default.
+
+* Run generate.py
+
+For LangChain support using documents in `user_path` folder, run h2oGPT like:
+```bash
+python generate.py --base_model='llama' --prompt_type=wizard2 --score_model=None --langchain_mode='UserData' --user_path=user_path
+```
+See [LangChain Readme](README_LangChain.md) for more details.
+For no langchain support (still uses LangChain package as model wrapper), run as:
+```bash
+python generate.py --base_model='llama' --prompt_type=wizard2 --score_model=None
+```
+
+When using `llama.cpp` based CPU models, for computers with low system RAM or slow CPUs, we recommend adding to `.env_gpt4all`:
+```.env_gpt4all
+use_mlock=False
+n_ctx=1024
+```
+where `use_mlock=True` is default to avoid slowness and `n_ctx=2048` is default for large context handling.  For computers with plenty of system RAM, we recommend adding to `.env_gpt4all`:
+```.env_gpt4all
+n_batch=1024
+```
+for faster handling.  On some systems this has no strong effect, but on others may increase speed quite a bit.
+
+Also, for slow and low-memory systems, we recommend using a smaller embedding by using with `generrate.py`:
+```bash
+python generate.py ... --hf_embedding_model=sentence-transformers/all-MiniLM-L6-v2
+```
+where `...` means any other options one should add like `--base_model` etc.  This simpler embedding is about half the size as default `instruct-large` and so uses less disk, CPU memory, and GPU memory if using GPUs.
+
+See also [Low Memory](FAQ.md#low-memory-mode) for more information about low-memory recommendations.
+
+
diff --git a/docs/README_GPU.md b/docs/README_GPU.md
@@ -0,0 +1,108 @@
+### GPU
+
+GPU via CUDA is supported via Hugging Face type models and LLaMa.cpp models.
+
+#### GPU (CUDA)
+
+For help installing cuda toolkit, see [CUDA Toolkit](INSTALL.md#installing-cuda-toolkit)
+
+```bash
+git clone https://github.com/h2oai/h2ogpt.git
+cd h2ogpt
+pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu117
+python generate.py --base_model=h2oai/h2ogpt-oig-oasst1-512-6_9b --load_8bit=True
+```
+Then point browser at http://0.0.0.0:7860 (linux) or http://localhost:7860 (windows/mac) or the public live URL printed by the server (disable shared link with `--share=False`).  For 4-bit or 8-bit support, older GPUs may require older bitsandbytes installed as `pip uninstall bitsandbytes -y ; pip install bitsandbytes==0.38.1`.  For production uses, we recommend at least the 12B model, ran as:
+```
+python generate.py --base_model=h2oai/h2ogpt-oasst1-512-12b --load_8bit=True
+```
+and one can use `--h2ocolors=False` to get soft blue-gray colors instead of H2O.ai colors.  [Here](FAQ.md#what-envs-can-i-pass-to-control-h2ogpt) is a list of environment variables that can control some things in `generate.py`.
+
+Note if you download the model yourself and point `--base_model` to that location, you'll need to specify the prompt_type as well by running:
+```
+python generate.py --base_model=<user path> --load_8bit=True --prompt_type=human_bot
+```
+for some user path `<user path>` and the `prompt_type` must match the model or a new version created in `prompter.py` or added in UI/CLI via `prompt_dict`.
+
+For quickly using a private document collection for Q/A, place documents (PDFs, text, etc.) into a folder called `user_path` and run
+```bash
+pip install -r reqs_optional/requirements_optional_langchain.txt
+python -m nltk.downloader all  # for supporting unstructured package
+python generate.py --base_model=h2oai/h2ogpt-oig-oasst1-512-6_9b  --load_8bit=True --langchain_mode=UserData --user_path=user_path
+```
+For more ways to ingest on CLI and control see [LangChain Readme](README_LangChain.md).  For example, for improved pdf handling via pymupdf (GPL) and support for docx, ppt, OCR, and ArXiV run:
+```bash
+sudo apt-get install -y libmagic-dev poppler-utils tesseract-ocr tesseract-ocr libreoffice
+pip install -r reqs_optional/requirements_optional_langchain.gpllike.txt
+```
+
+For 4-bit support, the latest dev versions of transformers, accelerate, and peft are required, which can be installed by running:
+```bash
+pip uninstall peft transformers accelerate -y
+pip install -r reqs_optional/requirements_optional_4bit.txt
+```
+where uninstall is required in case, e.g., peft was installed from GitHub previously.  Then when running generate pass `--load_4bit=True`, which is only supported for certain [architectures](https://github.com/huggingface/peft#models-support-matrix) like GPT-NeoX-20B, GPT-J, LLaMa, etc.
+
+Any other instruct-tuned base models can be used, including non-h2oGPT ones.  [Larger models require more GPU memory](FAQ.md#larger-models-require-more-gpu-memory).
+
+#### GPU with LLaMa
+
+* Install langchain, and GPT4All, and python LLaMa dependencies:
+```bash
+pip install -r reqs_optional/requirements_optional_langchain.txt
+pip install -r reqs_optional/requirements_optional_gpt4all.txt
+```
+then compile llama-cpp-python with CUDA support:
+```bash
+conda install -c "nvidia/label/cuda-12.1.1" cuda-toolkit  # maybe optional
+pip uninstall -y llama-cpp-python
+export LLAMA_CUBLAS=1
+export CMAKE_ARGS=-DLLAMA_CUBLAS=on
+export FORCE_CMAKE=1
+export CUDA_HOME=$HOME/miniconda3/envs/h2ogpt
+CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python==0.1.68 --no-cache-dir --verbose
+```
+and uncomment `# n_gpu_layers=20` in `.env_gpt4all`.  If one sees `/usr/bin/nvcc` mentioned in errors, that file needs to be removed as would likely conflict with version installed for conda.  Then run:
+```bash
+python generate.py --base_model='llama' --prompt_type=wizard2 --score_model=None --langchain_mode='UserData' --user_path=user_path
+```
+when loading you should see something like:
+```text
+Using Model llama
+Prep: persist_directory=db_dir_UserData exists, user_path=user_path passed, adding any changed or new documents
+load INSTRUCTOR_Transformer
+max_seq_length  512
+0it [00:00, ?it/s]
+0it [00:00, ?it/s]
+Loaded 0 sources for potentially adding to UserData
+ggml_init_cublas: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090 Ti
+  Device 1: NVIDIA GeForce RTX 2080
+llama.cpp: loading model from WizardLM-7B-uncensored.ggmlv3.q8_0.bin
+llama_model_load_internal: format     = ggjt v3 (latest)
+llama_model_load_internal: n_vocab    = 32001
+llama_model_load_internal: n_ctx      = 1792
+llama_model_load_internal: n_embd     = 4096
+llama_model_load_internal: n_mult     = 256
+llama_model_load_internal: n_head     = 32
+llama_model_load_internal: n_layer    = 32
+llama_model_load_internal: n_rot      = 128
+llama_model_load_internal: ftype      = 7 (mostly Q8_0)
+llama_model_load_internal: n_ff       = 11008
+llama_model_load_internal: model size = 7B
+llama_model_load_internal: ggml ctx size =    0.08 MB
+llama_model_load_internal: using CUDA for GPU acceleration
+ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090 Ti) as main device
+llama_model_load_internal: mem required  = 4518.85 MB (+ 1026.00 MB per state)
+llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 368 MB VRAM for the scratch buffer
+llama_model_load_internal: offloading 20 repeating layers to GPU
+llama_model_load_internal: offloaded 20/35 layers to GPU
+llama_model_load_internal: total VRAM used: 4470 MB
+llama_new_context_with_model: kv self size  =  896.00 MB
+AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
+Model {'base_model': 'llama', 'tokenizer_base_model': '', 'lora_weights': '', 'inference_server': '', 'prompt_type': 'wizard2', 'prompt_dict': {'promptA': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.', 'promptB': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.', 'PreInstruct': '\n### Instruction:\n', 'PreInput': None, 'PreResponse': '\n### Response:\n', 'terminate_response': ['\n### Response:\n'], 'chat_sep': '\n', 'chat_turn_sep': '\n', 'humanstr': '\n### Instruction:\n', 'botstr': '\n### Response:\n', 'generates_leading_space': False}}
+Running on local URL:  http://0.0.0.0:7860
+Running on public URL: https://1ccb24d03273a3d085.gradio.live
+```
+and GPU usage when using.  Note that once `llama-cpp-python` is compiled to support CUDA, it no longer works for CPU mode,
+so one would have to reinstall it without the above options to recovers CPU mode or have a separate h2oGPT env for CPU mode.
diff --git a/docs/README_GRADIOUI.md b/docs/README_GRADIOUI.md
@@ -0,0 +1,14 @@
+### Gradio UI
+
+`generate.py` by default runs a gradio server with a [UI (click for help with UI)](FAQ.md#explain-things-in-ui).  Key benefits of the UI include:
+* Save, export, import chat histories and undo or regenerate last query-response pair
+* Upload and control documents of various kinds for document Q/A
+* Choose which specific collection to query, or just chat with LLM
+* Choose specific documents out of collection for asking questions
+* Side-by-side 2-model comparison view
+* RLHF response score evaluation for every query-response
+
+See how we compare to other tools like PrivateGPT, see our comparisons at [h2oGPT LangChain Integration FAQ](README_LangChain.md#what-is-h2ogpts-langchain-integration-like).
+
+We disable background uploads by disabling telemetry for Hugging Face, gradio, and chroma, and one can additionally avoid downloads (of fonts) by running `generate.py` with `--gradio_offline_level=2`.  See [Offline Documentation](FAQ.md#offline-mode) for details.
+
diff --git a/docs/README_InferenceServers.md b/docs/README_InferenceServers.md
@@ -1,3 +1,7 @@
+# Inference Servers
+
+One can connect to Hugging Face text generation inference server, gradio servers running h2oGPT, or OpenAI servers.  
+
 ## Hugging Face Text Generation Inference Server-Client
 
 ### Local Install
@@ -44,7 +48,7 @@ NCCL_SHM_DISABLE=1 CUDA_VISIBLE_DEVICES=0 text-generation-launcher --model-id h2
 
 ### Docker Install
 
-#### **Recommended** (instead of Local Install)
+#### **Recommended**
 
 ```bash
 # https://docs.docker.com/engine/install/ubuntu/
@@ -75,7 +79,7 @@ Reboot or run:
 ```bash
 newgrp docker
 ```
-in order to login to this user.
+in order to log in to this user.
 
 Then run:
 ```bash
@@ -200,7 +204,7 @@ If you run in bash and need to use an authentication for the Hugging Face text g
 ```
 i.e. 4 spaces between each IP, USER, and AUTH.  USER should be the user and AUTH be the token.
 
-When bringing up `generate.py` with any inference server, one can set `REQUEST_TIMEOUT` ENV to smaller value than default of 60 seconds to get server up faster if have many inaccessible endpoints you don't mind skipping.  E.g. set `REQUEST_TIMEOUT=5`.  One can also choose the timeout overall for each chat turn using env `REQUEST_TIMEOUT_FAST` that defaults to 10 seconds.
+When bringing up `generate.py` with any inference server, one can set `REQUEST_TIMEOUT` ENV to smaller value than default of 60 seconds to get server up faster if one has many inaccessible endpoints you don't mind skipping.  E.g. set `REQUEST_TIMEOUT=5`.  One can also choose the timeout overall for each chat turn using env `REQUEST_TIMEOUT_FAST` that defaults to 10 seconds.
 
 Note: The client API calls for chat APIs (i.e. `instruction` type for `instruction`, `instruction_bot`, `instruction_bot_score`, and similar for `submit` and `retry` types) require managing all chat sessions via API.  However, the `nochat` APIs only use the first model in the list of chats or model_lock list.
 
@@ -210,7 +214,6 @@ Note: The client API calls for chat APIs (i.e. `instruction` type for `instructi
 ### System info from gradio server
 
 ```python
-import os
 import json
 from gradio_client import Client
 ADMIN_PASS = ''
diff --git a/docs/README_MACOS.md b/docs/README_MACOS.md
@@ -0,0 +1,15 @@
+### MACOS
+
+First install [Rust](https://www.geeksforgeeks.org/how-to-install-rust-in-macos/):
+```bash
+curl –proto ‘=https’ –tlsv1.2 -sSf https://sh.rustup.rs | sh
+```
+Enter new shell and test: `rustc --version`
+
+When running a Mac with Intel hardware (not M1), you may run into `_clang: error: the clang compiler does not support '-march=native'_` during pip install.
+If so, set your archflags during pip install. eg: `ARCHFLAGS="-arch x86_64" pip3 install -r requirements.txt`
+
+If you encounter an error while building a wheel during the `pip install` process, you may need to install a C++ compiler on your computer.
+
+Now go back to normal [CPU](README_CPU.md) installation.
+
diff --git a/docs/README_WHEEL.md b/docs/README_WHEEL.md
@@ -0,0 +1,21 @@
+#### Python Wheel
+
+The wheel adds all dependencies including optional dependencies like 4-bit and flash-attention. To build do:
+```bash
+python setup.py sdist bdist_wheel
+```
+To install the default dependencies do:
+```bash
+pip install dist/h2ogpt-0.1.0-py3-none-any.whl
+```
+replace `0.1.0` with actual version built if more than one.
+To install additional dependencies, for instance for faiss on GPU, do:
+```bash
+pip install dist/h2ogpt-0.1.0-py3-none-any.whl
+pip install dist/h2ogpt-0.1.0-py3-none-any.whl[FAISS]
+```
+once `whl` file is installed, two new scripts will be added to the current environment: `h2ogpt_finetune`, and `h2ogpt_generate`.
+
+The wheel is not required to use h2oGPT locally from repo, but makes it portable with all required dependencies.
+
+See [setup.py](../setup.py) for controlling other options via `extras_require`.
diff --git a/docs/README_WINDOWS.md b/docs/README_WINDOWS.md