llama.cpp

Commit Graph

Author	SHA1	Message	Date
slaren	3d59769c3b	Show perplexity ETA in hours and minutes (#1096 )	1 year ago
Georgi Gerganov	d40fded93e	llama : fix comment for "output.weight" tensor	1 year ago
Stephan Walter	2510c1831f	Add ggml-model-.bin checksums for 7B, 13B, 30B, 65B (#1088 ) Add ggml-model-.bin checksums for 7B, 13B, 30B Add ggml-model-*.bin checksums for 65B --------- Co-authored-by: Pavol Rusnak <pavol@rusnak.io>	1 year ago
Georgi Gerganov	12b5900dbc	ggml : sync ggml (add GPT-NeoX RoPE implementation)	1 year ago
Georgi Gerganov	9ff334f3c9	ggml : fix bug in ggml_compute_forward_dup_f32()	1 year ago
slaren	2005469ea1	Add Q4_3 support to cuBLAS (#1086 )	1 year ago
Georgi Gerganov	8a1756abdf	ggml : do not break cuBLAS build (Q4_3 is not yet implemented)	1 year ago
Georgi Gerganov	66aab46079	ggml : fix Q4_3 quantization Broke it during conflict resolution in last PR	1 year ago
Kawrakow	38de86a711	llama : multi-threaded quantization (#1075 ) * Multi-threading quantization. Not much gain for simple quantizations, bit it will be important for quantizations that require more CPU cycles. * Multi-threading for quantize-stats It now does the job in ~14 seconds on my Mac for Q4_0, Q4_1 and Q4_2. Single-threaded it was taking more than 2 minutes after adding the more elaborate version of Q4_2. * Reviewer comments * Avoiding compiler confusion After changing chunk_size to const int as suggested by @ggerganov, clang and GCC starting to warn me that I don't need to capture it in the lambda. So, I removed it from the capture list. But that makes the MSVC build fail. So, making it a constexpr to make every compiler happy. * Still fighting with lambda captures in MSVC --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	1 year ago
Georgi Gerganov	e0305ead3a	ggml : add Q4_3 quantization (#1082 )	1 year ago
Ivan Komarov	6a9661ea5a	ci : remove the LLAMA_ACCELERATE matrix dimension from Ubuntu builds in the CI (#1074 ) [Accelerate](https://developer.apple.com/documentation/accelerate) is an Apple framework which can only be used on macOS, and the CMake build [ignores](https://github.com/ggerganov/llama.cpp/blob/master/CMakeLists.txt#L102) the `LLAMA_ACCELERATE` variable when run on non-Apple platforms. This implies setting `LLAMA_ACCELERATE` is a no-op on Ubuntu and can be removed. This will reduce visual noise in CI check results (in addition to reducing the number of checks we have to run for every PR). Right now every sanitized build is duplicated twice for no good reason (e.g., we have `CI / ubuntu-latest-cmake-sanitizer (ADDRESS, Debug, ON)` and `CI / ubuntu-latest-cmake-sanitizer (ADDRESS, Debug, OFF)`).	1 year ago
源文雨	5addcb120c	fix: LLAMA_CUBLAS=1 undefined reference 'shm_open' (#1080 )	1 year ago
Stephan Walter	c8c2c52482	AVX2 optimization for vec_dot_q4_2_q8_0 (#1068 )	1 year ago
slaren	02d6988121	Improve cuBLAS performance by dequantizing on the GPU (#1065 )	1 year ago
CRD716	834695fe3a	Minor: Readme fixed grammar, spelling, and misc updates (#1071 )	1 year ago
Kawrakow	f7d05095b4	Q4_2 quantization with rmse-optimized scale and quants (#1062 ) * Q4_2 quantization with rmse-optimized scale and quants For quantize-stats we get q4_2: rmse 0.00159301, maxerr 0.17480469, 95pct<0.0030, median<0.0012 For 7B perplexity with BLAS enabled we get 6.2038 after 655 chunks. Quantization is slow (~90 seconds on my Mac for 7B) as not multi-threaded as in PR #896. * ggml : satisfy the sanitizer builds Not sure why this makes them fail * Better follow ggml conventions for function names * Fixed type as per reviewer comment --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	1 year ago
Georgi Gerganov	884e7d7a2b	ggml : use 8-bit precision for Q4_1 intermediate results (#1047 ) * ggml : use 8-bit precision for Q4_1 intermediate results (ARM) * ggml : optimize ggml_vec_dot_q4_1_q8_0() via vmalq_n_f32 56 ms/token with Q4_1 ! * ggml : AVX2 implementation of ggml_vec_dot_q4_1_q8_0 (#1051) * gitignore : ignore ppl-*.txt files --------- Co-authored-by: slaren <2141330+slaren@users.noreply.github.com>	1 year ago
Georgi Gerganov	7cd5c4a3e9	readme : add warning about Q4_2 and Q4_3	1 year ago
Stephan Walter	f3d4edf504	ggml : Q4 cleanup - remove 4-bit dot product code (#1061 ) * Q4 cleanup * Remove unused AVX512 Q4_0 code	1 year ago
slaren	8944a13296	Add NVIDIA cuBLAS support (#1044 )	1 year ago
slaren	6667401238	Multi-threaded ggml_cpy (#1035 ) * Multi-threaded ggml_cpy * Update ggml.c Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Also fix wdata offset in ggml_compute_forward_add_q_f32 --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	1 year ago
Georgi Gerganov	77a73403ca	ggml : add new Q4_2 quantization (ARM only) (#1046 ) * ggml : Q4_2 ARM * ggml : add ggml_is_quantized() * llama : update llama_type_name() with Q4_2 entry * ggml : speed-up q4_2 - 4 threads: ~100ms -> ~90ms - 8 threads: ~55ms -> ~50ms * ggml : optimize q4_2 using vmlaq_n_f32 + vmulq_n_f32	1 year ago
Georgi Gerganov	50a8a2af97	ggml : scratch that - vmlaq_n_f32 is always better Had a background process that was messing with the timings	1 year ago
Georgi Gerganov	4caebf6d40	gitignore : vdot	1 year ago
Georgi Gerganov	dcdd65e296	ggml : optimize ggml_vec_dot_q4_0_q8_0() using vectorized accumulators	1 year ago
Kawrakow	5ecff35151	Adding a simple program to measure speed of dot products (#1041 ) On my Mac, the direct Q4_1 product is marginally slower (~69 vs ~55 us for Q4_0). The SIMD-ified ggml version is now almost 2X slower (~121 us). On a Ryzen 7950X CPU, the direct product for Q4_1 quantization is faster than the AVX2 implementation (~60 vs ~62 us). --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	1 year ago
Georgi Gerganov	7faa7460f0	readme : update hot topics about new LoRA functionality	1 year ago
Georgi Gerganov	5af8e32238	ci : do not run on drafts	1 year ago
Ivan Komarov	42747220b4	Do not close file after mmap (Windows version) (#1034 )	1 year ago
Atsushi Tatsuma	e9298af389	readme : add Ruby bindings (#1029 )	1 year ago
Cameron	4ad73137a1	add 4_0 to default outfile namestr dict (#1031 ) this came up when trying to convert the gpt4all-lora-unfiltered-quantized.bin file	1 year ago
slaren	315a95a4d3	Add LoRA support (#820 )	1 year ago
Arik Poznanski	efd05648c8	llama : well-defined static initialization of complex objects (#927 ) * Replaced static initialization of complex objects with a initialization on first use. This prevents an undefined behavior on program run, for example, crash in Release build, works in Debug build * replaced use of auto with exact type to avoid using -std=c++14 * Made the assessors functions for static maps be static const	1 year ago
Georgi Gerganov	eb17a026fd	quantize-stats : fix bug in --type argument	1 year ago
Georgi Gerganov	69b740289f	ggml : avoid using ggml_fp16_to_fp32() and ggml_fp32_to_fp16() in ggml.c	1 year ago
Ivan Komarov	f266259ad9	Speedup the AVX-512 implementation of ggml_vec_dot_q4_0() (#933 )	1 year ago
slaren	47f61aaa5f	Fix: do not close file on mmap (#1017 )	1 year ago
Georgi Gerganov	3173a62eb9	stdout : vertical align outputs for better readibility	1 year ago
Pavol Rusnak	489537e6cf	examples: add missing <ctime> include for time() (#1011 )	1 year ago
nanahi	2d3481c721	Fix msys2 build error and warnings (#1009 )	1 year ago
comex	74f5899df4	convert.py: Fix loading safetensors and ggml format on Windows (#991 ) Calling `mmap.mmap` on Windows apparently resets the file offset of the raw file object (and makes the BufferedReader return a negative file offset). For safetensors, avoid using the file offset after calling mmap. For GGML format, explicitly save and restore the offset. Fixes #966.	1 year ago
Stephan Walter	2f7c8e014e	Fix potential int8 overflow in non-SIMD vec_dot (#986 )	1 year ago
Stephan Walter	0ad964631f	Refactor ggml.c for future tensor types (#1001 )	1 year ago
Georgi Gerganov	e95b6554b4	ggml : add Q8_0 quantization for intermediate results (#951 ) * ggml : add Q8_0 quantization for intermediate results * quantize-stats : fix test + add it to Makefile default * Q8: use int8_t, AVX/AVX2 optimizations * ggml : fix quantize_row_q8_0() ARM_NEON rounding * minor : updates after rebase to latest master * quantize-stats : delete obsolete strings * ggml : fix q4_1 dot func --------- Co-authored-by: Stephan Walter <stephan@walter.name>	1 year ago
Georgi Gerganov	aa485cee33	ggml : use posix_memalign on non-Windows env	1 year ago
Ivan Komarov	c12b14b77f	benchmark : fix result validation in benchmark-q4_0-matmult (#987 )	1 year ago
katsu560	106faaf297	cmake : add finding the OpenBLAS header file (#992 )	1 year ago
Pavol Rusnak	c85e03d12e	Revert "main : alternative instruct mode (Vicuna support, etc.) (#863 )" (#982 ) This reverts commit `f4d277ae17`.	1 year ago
Pavol Rusnak	489093548c	py : bump sentencepiece to 0.1.98 to support Python 3.11 (#976 )	1 year ago
Stephan Walter	93265e988a	make : fix dependencies, use auto variables (#983 )	1 year ago

1 2 3 4 5 ...

500 Commits (45d94c8f6f6552c2f5b3fe78b37a4d865318c4ac) All Branches Search

500 Commits (45d94c8f6f6552c2f5b3fe78b37a4d865318c4ac)

All Branches