Skip to content

Commit 4bb65f2

Browse files
kaiyuxmegha95Shixiaowei02
authored
Update TensorRT-LLM (NVIDIA#1274)
* Update TensorRT-LLM --------- Co-authored-by: meghagarwal <[email protected]> Co-authored-by: Shixiaowei02 <[email protected]>
1 parent 728cc00 commit 4bb65f2

File tree

488 files changed

+23178
-10463
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

488 files changed

+23178
-10463
lines changed

.clang-format

+1
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,7 @@ PenaltyBreakString: 1000
5959
PenaltyExcessCharacter: 1000000
6060
PenaltyReturnTypeOnItsOwnLine: 60
6161
PointerAlignment: Left
62+
QualifierAlignment: Right
6263
ReflowComments: true
6364
SeparateDefinitionBlocks: Always
6465
SortIncludes: CaseSensitive

.gitignore

+10
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,16 @@ venv/
1717
.local/
1818
.hypothesis/
1919
.idea/
20+
dump*/
21+
.trt-internal
22+
*.dot
23+
*.prof
24+
*.log
25+
*.pkl
26+
*.hdf5
27+
*.lock
28+
config.json
29+
/*.svg
2030
cpp/cmake-build-*
2131
cpp/.ccache/
2232
tensorrt_llm/libs

README.md

+3
Original file line numberDiff line numberDiff line change
@@ -355,6 +355,9 @@ however, that it is recommended to use the C++ version.
355355

356356
## Troubleshooting
357357

358+
* If you encounter accuracy issues in the generated text, you may want to increase
359+
the internal precision in the attention layer. For that, pass the `--context_fmha_fp32_acc enable` to
360+
`trtllm-build`.
358361

359362
* It's recommended to add options `–shm-size=1g –ulimit memlock=-1` to the
360363
docker or nvidia-docker run command. Otherwise you may see NCCL errors when

benchmarks/cpp/README.md

-5
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,6 @@ Take GPT-350M as an example for single GPU
3939

4040
```
4141
./benchmarks/gptSessionBenchmark \
42-
--model gpt_350m \
4342
--engine_dir "../../benchmarks/gpt_350m/" \
4443
--batch_size "1" \
4544
--input_output_len "60,20"
@@ -50,7 +49,6 @@ Take GPT-350M as an example for single GPU
5049
Take GPT-175B as an example for multiple GPUs
5150
```
5251
mpirun -n 8 ./benchmarks/gptSessionBenchmark \
53-
--model gpt_175b \
5452
--engine_dir "../../benchmarks/gpt_175b/" \
5553
--batch_size "1" \
5654
--input_output_len "60,20"
@@ -125,7 +123,6 @@ cd cpp/build
125123
Take GPT-350M as an example for single GPU V1 batching
126124
```
127125
./benchmarks/gptManagerBenchmark \
128-
--model gpt \
129126
--engine_dir ../../examples/gpt/trt_engine/gpt2/fp16/1-gpu/ \
130127
--type V1 \
131128
--dataset ../../benchmarks/cpp/preprocessed_dataset.json
@@ -135,7 +132,6 @@ Take GPT-350M as an example for single GPU V1 batching
135132
Take GPT-350M as an example for 2-GPU inflight batching
136133
```
137134
mpirun -n 2 ./benchmarks/gptManagerBenchmark \
138-
--model gpt \
139135
--engine_dir ../../examples/gpt/trt_engine/gpt2-ib/fp16/2-gpu/ \
140136
--type IFB \
141137
--dataset ../../benchmarks/cpp/preprocessed_dataset.json
@@ -165,7 +161,6 @@ Given a `static_emulated_batch_size` of `n` the server will wait for `n` request
165161
Take GPT-350M as an example for single GPU with static batching
166162
```
167163
./benchmarks/gptManagerBenchmark \
168-
--model gpt \
169164
--engine_dir ../../examples/gpt/trt_engine/gpt2/fp16/1-gpu/ \
170165
--type IFB \
171166
--static_emulated_batch_size 32 \

benchmarks/cpp/bertBenchmark.cpp

+1-1
Original file line numberDiff line numberDiff line change
@@ -237,7 +237,7 @@ int main(int argc, char* argv[])
237237
benchmarkBert(result["model"].as<std::string>(), result["engine_dir"].as<std::string>(), batchSizes, inLens,
238238
logger, result["warm_up"].as<int>(), result["num_runs"].as<int>(), result["duration"].as<int>());
239239
}
240-
catch (const std::exception& e)
240+
catch (std::exception const& e)
241241
{
242242
TLLM_LOG_ERROR(e.what());
243243
return 1;

0 commit comments

Comments
 (0)