CUBLAS compilation issue on 4090 with `make` : `"Unsupported gpu architecture 'compute_89'"` Works with `cmake` or without `-arch=native` #1420

TheBloke · 2023-05-12T20:58:45Z

Current Behavior

When building llama.cpp with LLAMA_CUBLAS=1 make on a system with a 4090 or L40 GPU, I get the following failure:
nvcc fatal : Unsupported gpu architecture 'compute_89'

However if I remove -arch=native from Makefile, it will compile fine.

Do I just need to update CUDA toolkit? But then why does it compile and work without that flag? CUDA 11.x is listed as compatible with compute 8.x.

Compilation failure:

g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -c examples/common.cpp -o common.o
nvcc --forward-unknown-to-host-compiler -arch=native -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -Wno-pedantic -c ggml-cuda.cu -o ggml-cuda.o
nvcc fatal   : Unsupported gpu architecture 'compute_89'
make: *** [Makefile:124: ggml-cuda.o] Error 1

I can build it instead with: mkdir build && cd build && cmake -DLLAMA_CUBLAS=1 .. && cmake --build . --config Release and this works.

Workaround / fix

If I remove -arch=native from Makefile line: NVCCFLAGS = --forward-unknown-to-host-compiler -arch=native then it compiles.

But will this result in a less optimised executable?

Environment and Context

I'm using a Docker container based on Ubuntu 20.04 with CUDA 11.6

CUDA is in PATH:

root@4f2326844e8c:~# echo $PATH
/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin

Steps to Reproduce

Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.

On system with NV 4090 or L40 GPU, try make clean LLAMA_CUBLAS=1 make and observe failure
On same sysetm, try CMAKE (mkdir build && cd build && cmake -DLLAMA_CUBLAS=1 .. && cmake --build . --config Release and it works.
Remove -arch=native from NVCCFLAGS = --forward-unknown-to-host-compiler -arch=native and try step 1 again; it will now compile.

Failure Logs

nvcc

root@4f2326844e8c:~/llama.cpp# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_Mar__8_18:18:20_PST_2022
Cuda compilation tools, release 11.6, V11.6.124
Build cuda_11.6.r11.6/compiler.31057947_0

Log of failed compile with make:

root@4f2326844e8c:~/llama.cpp# make clean LLAMA_CUBLAS=1 make
I llama.cpp build info:
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include
I LDFLAGS:   -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
I CC:       cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
I CXX:      g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

rm -vf *.o main quantize quantize-stats perplexity embedding benchmark-matmult save-load-state build-info.h
make: *** No rule to make target 'make'.  Stop.
root@4f2326844e8c:~/llama.cpp# make clean && LLAMA_CUBLAS=1 make
I llama.cpp build info:
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native
I LDFLAGS:
I CC:       cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
I CXX:      g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

rm -vf *.o main quantize quantize-stats perplexity embedding benchmark-matmult save-load-state build-info.h
I llama.cpp build info:
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include
I LDFLAGS:   -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
I CC:       cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
I CXX:      g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

cc  -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include   -c ggml.c -o ggml.o
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -c llama.cpp -o llama.o
llama.cpp: In function ‘size_t llama_set_state_data(llama_context*, const uint8_t*)’:
llama.cpp:2624:36: warning: cast from type ‘const uint8_t*’ {aka ‘const unsigned char*’} to type ‘void*’ casts away qualifiers [-Wcast-qual]
 2624 |             kin3d->data = (void *) in;
      |                                    ^~
llama.cpp:2628:36: warning: cast from type ‘const uint8_t*’ {aka ‘const unsigned char*’} to type ‘void*’ casts away qualifiers [-Wcast-qual]
 2628 |             vin3d->data = (void *) in;
      |                                    ^~
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -c examples/common.cpp -o common.o
nvcc --forward-unknown-to-host-compiler -arch=native -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -Wno-pedantic -c ggml-cuda.cu -o ggml-cuda.o
nvcc fatal   : Unsupported gpu architecture 'compute_89'
make: *** [Makefile:124: ggml-cuda.o] Error 1

Log of compiling successfully with CMAKE:

root@4f2326844e8c:~/llama.cpp# rm -rf build && mkdir build && cd build && cmake -DLLAMA_CUBLAS=1 .. && cmake --build . --config Release
-- The C compiler identification is GNU 9.4.0
-- The CXX compiler identification is GNU 9.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /usr/bin/git (found version "2.25.1")
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Check if compiler accepts -pthread
-- Check if compiler accepts -pthread - yes
-- Found Threads: TRUE
-- Found CUDAToolkit: /usr/local/cuda/include (found version "11.6.124")
-- cuBLAS found
-- The CUDA compiler identification is NVIDIA 11.6.124
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- x86 detected
-- GGML CUDA sources found, configuring CUDA architecture
-- Configuring done (2.3s)
-- Generating done (0.0s)
-- Build files have been written to: /root/llama.cpp/build
[  3%] Built target BUILD_INFO
[  6%] Building C object CMakeFiles/ggml.dir/ggml.c.o
[  9%] Building CUDA object CMakeFiles/ggml.dir/ggml-cuda.cu.o
[  9%] Built target ggml
[ 12%] Building CXX object CMakeFiles/llama.dir/llama.cpp.o
/root/llama.cpp/llama.cpp: In function ‘size_t llama_set_state_data(llama_context*, const uint8_t*)’:
/root/llama.cpp/llama.cpp:2624:36: warning: cast from type ‘const uint8_t*’ {aka ‘const unsigned char*’} to type ‘void*’ casts away qualifiers [-Wcast-qual]
 2624 |             kin3d->data = (void *) in;
      |                                    ^~
/root/llama.cpp/llama.cpp:2628:36: warning: cast from type ‘const uint8_t*’ {aka ‘const unsigned char*’} to type ‘void*’ casts away qualifiers [-Wcast-qual]
 2628 |             vin3d->data = (void *) in;
      |                                    ^~
[ 15%] Linking CXX static library libllama.a
[ 15%] Built target llama
[ 18%] Building CXX object tests/CMakeFiles/test-quantize-fns.dir/test-quantize-fns.cpp.o
[ 21%] Linking CXX executable ../bin/test-quantize-fns
[ 21%] Built target test-quantize-fns
[ 25%] Building CXX object tests/CMakeFiles/test-quantize-perf.dir/test-quantize-perf.cpp.o
[ 28%] Linking CXX executable ../bin/test-quantize-perf
[ 28%] Built target test-quantize-perf
[ 31%] Building CXX object tests/CMakeFiles/test-sampling.dir/test-sampling.cpp.o
/root/llama.cpp/tests/test-sampling.cpp: In function ‘void test_top_k(const std::vector<float>&, const std::vector<float>&, int)’:
/root/llama.cpp/tests/test-sampling.cpp:22:44: warning: unused parameter ‘expected_probs’ [-Wunused-parameter]
   22 |                 const std::vector<float> & expected_probs,
      |                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~
/root/llama.cpp/tests/test-sampling.cpp: In function ‘void test_top_p(const std::vector<float>&, const std::vector<float>&, float)’:
/root/llama.cpp/tests/test-sampling.cpp:46:44: warning: unused parameter ‘expected_probs’ [-Wunused-parameter]
   46 |                 const std::vector<float> & expected_probs,
      |                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~
/root/llama.cpp/tests/test-sampling.cpp: In function ‘void test_tfs(const std::vector<float>&, const std::vector<float>&, float)’:
/root/llama.cpp/tests/test-sampling.cpp:71:44: warning: unused parameter ‘expected_probs’ [-Wunused-parameter]
   71 |                 const std::vector<float> & expected_probs,
      |                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~
/root/llama.cpp/tests/test-sampling.cpp: In function ‘void test_typical(const std::vector<float>&, const std::vector<float>&, float)’:
/root/llama.cpp/tests/test-sampling.cpp:94:44: warning: unused parameter ‘expected_probs’ [-Wunused-parameter]
   94 |                 const std::vector<float> & expected_probs,
      |                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~
/root/llama.cpp/tests/test-sampling.cpp: In function ‘void test_repetition_penalty(const std::vector<float>&, const std::vector<int>&, const std::vector<float>&, float)’:
/root/llama.cpp/tests/test-sampling.cpp:119:44: warning: unused parameter ‘expected_probs’ [-Wunused-parameter]
  119 |                 const std::vector<float> & expected_probs,
      |                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~
/root/llama.cpp/tests/test-sampling.cpp: In function ‘void test_frequency_presence_penalty(const std::vector<float>&, const std::vector<int>&, const std::vector<float>&, float, float)’:
/root/llama.cpp/tests/test-sampling.cpp:148:44: warning: unused parameter ‘expected_probs’ [-Wunused-parameter]
  148 |                 const std::vector<float> & expected_probs,
      |                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~
[ 34%] Linking CXX executable ../bin/test-sampling
[ 34%] Built target test-sampling
[ 37%] Building CXX object tests/CMakeFiles/test-tokenizer-0.dir/test-tokenizer-0.cpp.o
/root/llama.cpp/tests/test-tokenizer-0.cpp:19:2: warning: extra ‘;’ [-Wpedantic]
   19 | };
      |  ^
[ 40%] Linking CXX executable ../bin/test-tokenizer-0
[ 40%] Built target test-tokenizer-0
[ 43%] Building CXX object examples/CMakeFiles/common.dir/common.cpp.o
[ 43%] Built target common
[ 46%] Building CXX object examples/main/CMakeFiles/main.dir/main.cpp.o
[ 50%] Linking CXX executable ../../bin/main
[ 50%] Built target main
[ 53%] Building CXX object examples/quantize/CMakeFiles/quantize.dir/quantize.cpp.o
[ 56%] Linking CXX executable ../../bin/quantize
[ 56%] Built target quantize
[ 59%] Building CXX object examples/quantize-stats/CMakeFiles/quantize-stats.dir/quantize-stats.cpp.o
[ 62%] Linking CXX executable ../../bin/quantize-stats
[ 62%] Built target quantize-stats
[ 65%] Building CXX object examples/perplexity/CMakeFiles/perplexity.dir/perplexity.cpp.o
[ 68%] Linking CXX executable ../../bin/perplexity
[ 68%] Built target perplexity
[ 71%] Building CXX object examples/embedding/CMakeFiles/embedding.dir/embedding.cpp.o
[ 75%] Linking CXX executable ../../bin/embedding
[ 75%] Built target embedding
[ 78%] Building CXX object examples/save-load-state/CMakeFiles/save-load-state.dir/save-load-state.cpp.o
[ 81%] Linking CXX executable ../../bin/save-load-state
[ 81%] Built target save-load-state
[ 84%] Building CXX object examples/benchmark/CMakeFiles/benchmark.dir/benchmark-matmult.cpp.o
[ 87%] Linking CXX executable ../../bin/benchmark
[ 87%] Built target benchmark
[ 90%] Building CXX object pocs/vdot/CMakeFiles/vdot.dir/vdot.cpp.o
[ 93%] Linking CXX executable ../../bin/vdot
[ 93%] Built target vdot
[ 96%] Building CXX object pocs/vdot/CMakeFiles/q8dot.dir/q8dot.cpp.o
[100%] Linking CXX executable ../../bin/q8dot
[100%] Built target q8dot

Log of successful compile with make after removing -arch=native

root@4f2326844e8c:~/llama.cpp# make clean && LLAMA_CUBLAS=1 make
I llama.cpp build info:
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native
I LDFLAGS:
I CC:       cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
I CXX:      g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

rm -vf *.o main quantize quantize-stats perplexity embedding benchmark-matmult save-load-state build-info.h
removed 'common.o'
removed 'ggml-cuda.o'
removed 'ggml.o'
removed 'llama.o'
removed 'main'
removed 'quantize'
removed 'quantize-stats'
removed 'perplexity'
removed 'embedding'
removed 'build-info.h'
I llama.cpp build info:
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include
I LDFLAGS:   -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
I CC:       cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
I CXX:      g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

cc  -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include   -c ggml.c -o ggml.o
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -c llama.cpp -o llama.o
llama.cpp: In function ‘size_t llama_set_state_data(llama_context*, const uint8_t*)’:
llama.cpp:2624:36: warning: cast from type ‘const uint8_t*’ {aka ‘const unsigned char*’} to type ‘void*’ casts away qualifiers [-Wcast-qual]
 2624 |             kin3d->data = (void *) in;
      |                                    ^~
llama.cpp:2628:36: warning: cast from type ‘const uint8_t*’ {aka ‘const unsigned char*’} to type ‘void*’ casts away qualifiers [-Wcast-qual]
 2628 |             vin3d->data = (void *) in;
      |                                    ^~
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -c examples/common.cpp -o common.o
nvcc --forward-unknown-to-host-compiler -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -Wno-pedantic -c ggml-cuda.cu -o ggml-cuda.o
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include examples/main/main.cpp ggml.o llama.o common.o ggml-cuda.o -o main  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib

====  Run ./main -h for help.  ====

g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include examples/quantize/quantize.cpp ggml.o llama.o ggml-cuda.o -o quantize  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include examples/quantize-stats/quantize-stats.cpp ggml.o llama.o ggml-cuda.o -o quantize-stats  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include examples/perplexity/perplexity.cpp ggml.o llama.o common.o ggml-cuda.o -o perplexity  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include examples/embedding/embedding.cpp ggml.o llama.o common.o ggml-cuda.o -o embedding  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include pocs/vdot/vdot.cpp ggml.o ggml-cuda.o -o vdot  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib

The text was updated successfully, but these errors were encountered:

slaren · 2023-05-12T21:35:18Z

It seems that your nvcc doesn't support compute_89, which is the compute capability of your 4090. Updating your CUDA toolkit should fix this.

TheBloke · 2023-05-12T21:36:51Z

It seems that your nvcc doesn't support compute_89, which is the compute capability of your 4090. Updating your CUDA toolkit should fix this.

But why does it work fine when I use cmake, or remove -arch=native ?

slaren · 2023-05-12T22:07:36Z

That option is meant to configure the architectures for which the CUDA code will be compiled. For local use, native should be best since it will use the capability of your GPU. If you remove that option, it will default to some older architecture, which will still work with your GPU since its capability is higher than anything that your nvcc can compile to anyway, but the performance may be lower. In practice the difference is probably going to be small however, the code will still be JIT compiled to your GPU architecture when you run the program, but it may not use all the capabilities of your GPU.

I am not sure why the result is different with cmake, it uses CUDA_SELECT_NVCC_ARCH_FLAGS "Auto" which should do the same as arch=native (this should probably be changed for the CI builds). You would need to check the flags that cmake passes to nvcc to understand why it works.

TheBloke · 2023-05-12T22:09:42Z

OK thank you very much for the detailed explanation. That makes sense now. I thought the compute arch not being found would always be a fail, and therefore I must have the right toolkit if it worked with some args.

I will try to upgrade my Docker to CUDA 11.8 or 12.x for future compilations.

SigmaTAMU · 2023-09-25T10:44:15Z

Could you please tell me where the file "Makefile" is ? I donot know witch file needs to be modified

Backendmagier · 2024-08-26T12:07:27Z

fixed it for me: export TORCH_CUDA_ARCH_LIST=8.7
source: https://forums.developer.nvidia.com/t/nvcc-fatal-unsupported-gpu-architecture-compute-89/257060

shivanraptor · 2024-09-23T05:36:58Z

fixed it for me: export TORCH_CUDA_ARCH_LIST=8.7 source: https://forums.developer.nvidia.com/t/nvcc-fatal-unsupported-gpu-architecture-compute-89/257060

This didn't work for me. The core issue is the nvcc version. If it's 11.x, it cannot support compute_89 (i.e. 4090 architecture). Need to update CUDA Toolkit 12.6 and update the environment paths in .bashrc or .zshrc:

export PATH=/usr/local/cuda-12.6/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-12.6/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
export CUDA_HOME=/usr/local/cuda-12.6

TheBloke changed the title ~~CUBLAS compilation issue on 4090 with make - "Unsupported gpu architecture 'compute_89'" ; works with cmake~~ CUBLAS compilation issue on 4090 with make : "Unsupported gpu architecture 'compute_89'" Works with cmake or without -arch=native May 12, 2023

TheBloke closed this as completed May 12, 2023

TangoW mentioned this issue Jun 29, 2023

can't install autogptq_cuda when build docker image AutoGPTQ/AutoGPTQ#181

Open

CesarLiu mentioned this issue Sep 10, 2023

使用4090显卡编译代码报错 ApolloAuto/apollo#14821

Closed

Vadim2S mentioned this issue Oct 21, 2023

.\scripts\gui.bat reports error Setting up PyTorch plugin "bias_act_plugin"... Failed! XingangPan/DragGAN#406

Open

hitzhangjie mentioned this issue Oct 23, 2023

Not possible to install it on a Windows 11 with an NVIDIA 4090 XingangPan/DragGAN#391

Open

tamo mentioned this issue Jan 30, 2024

I cant use NVIDIA RTX 4090 gpu I got error while compiling in gpu mode ggerganov/whisper.cpp#1815

Open

Bearsaerker mentioned this issue Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUBLAS compilation issue on 4090 with `make` : `"Unsupported gpu architecture 'compute_89'"` Works with `cmake` or without `-arch=native` #1420

CUBLAS compilation issue on 4090 with `make` : `"Unsupported gpu architecture 'compute_89'"` Works with `cmake` or without `-arch=native` #1420

TheBloke commented May 12, 2023 •

edited

Loading

slaren commented May 12, 2023

TheBloke commented May 12, 2023

slaren commented May 12, 2023

TheBloke commented May 12, 2023

SigmaTAMU commented Sep 25, 2023

Backendmagier commented Aug 26, 2024 •

edited

Loading

shivanraptor commented Sep 23, 2024 •

edited

Loading

CUBLAS compilation issue on 4090 with make : "Unsupported gpu architecture 'compute_89'" Works with cmake or without -arch=native #1420

CUBLAS compilation issue on 4090 with make : "Unsupported gpu architecture 'compute_89'" Works with cmake or without -arch=native #1420

Comments

TheBloke commented May 12, 2023 • edited Loading

Current Behavior

Workaround / fix

Environment and Context

Steps to Reproduce

Failure Logs

slaren commented May 12, 2023

TheBloke commented May 12, 2023

slaren commented May 12, 2023

TheBloke commented May 12, 2023

SigmaTAMU commented Sep 25, 2023

Backendmagier commented Aug 26, 2024 • edited Loading

shivanraptor commented Sep 23, 2024 • edited Loading

CUBLAS compilation issue on 4090 with `make` : `"Unsupported gpu architecture 'compute_89'"` Works with `cmake` or without `-arch=native` #1420

CUBLAS compilation issue on 4090 with `make` : `"Unsupported gpu architecture 'compute_89'"` Works with `cmake` or without `-arch=native` #1420

TheBloke commented May 12, 2023 •

edited

Loading

Backendmagier commented Aug 26, 2024 •

edited

Loading

shivanraptor commented Sep 23, 2024 •

edited

Loading