ollama/fs/ggml
Gabe Goodhart 7b91c9ce51
Hybrid and recurrent memory estimates (#12186)
This PR updates the memory size estimate logic to better handle recurrent and hybrid-recurrent models which are currently being badly overestimated because the default logic assumes full attention for all layers.

The logic for the sizing of the recurrent layers comes from the llama.cpp implementation

        ggml_tensor * r = ggml_new_tensor_1d(ctx, type_r, hparams.n_embd_r()*mem_size);
        ggml_tensor * s = ggml_new_tensor_1d(ctx, type_s, hparams.n_embd_s()*mem_size);

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2025-09-08 14:53:22 -07:00
..
ggml.go Hybrid and recurrent memory estimates (#12186) 2025-09-08 14:53:22 -07:00
ggml_test.go ggml: fix crash for array head counts 2025-04-27 11:38:06 -07:00
gguf.go convert: fix tensor sorting (#12015) 2025-08-26 13:57:46 -07:00
gguf_test.go convert: fix tensor sorting (#12015) 2025-08-26 13:57:46 -07:00
type.go convert(gptoss): mxfp4 to ggml layout to avoid jit conversion (#12018) 2025-08-26 16:41:02 -07:00