- Dynamic memory allocation on demand to fully utilize device memory. No preset scratch size or memory size any more.
- Drop Baichuan/InternLM support since they were integrated in llama.cpp.
- API change:
- CMake CUDA option:
-DGGML_CUBLAS
changed to-DGGML_CUDA
- CMake CUDA architecture:
-DCUDA_ARCHITECTURES
changed to-DCMAKE_CUDA_ARCHITECTURES
num_threads
inGenerationConfig
was removed: the optimal thread settings will be automatically selected.