April 2026 TLDR Setup for Ollama and Gemma 4 26B on a Mac mini
greenstevester
[dead]
krzyk
By desk you mean that "Mac mini"? Because it is pricey. In my country it is 1000 USD (from Apple for basic M4 with 24GB). My desk was 1/5th of that price.
And considering that this Mac mini won't be doing anything else is there a reason why not just buy subscription from Claude, OpenAI, Google, etc.?
Are those open models more performant compared to Sonnet 4.5/4.6? Or have at least bigger context?
redrove
There is virtually no reason to use Ollama over LM Studio or the myriad of other alternatives.
Ollama is slower and they started out as a shameless llama.cpp ripoff without giving credit and now they “ported” it to Go which means they’re just vibe code translating llama.cpp, bugs included.
iLoveOncall
> There is virtually no reason to use Ollama over LM Studio or the myriad of other alternatives.
Hmm, the fact that Ollama is open-source, can run in Docker, etc.?
alifeinbinary
I really like LM Studio when I can use it under Windows but for people like me with Intel Macs + AMD gpu ollama is the only option because it can leverage the gpu using MoltenVK aka Vulkan, unofficially. We're still testing it, hoping to get the Vulkan support in the main branch soon. It works perfectly for single GPUs but some edge cases when using multiple GPUs are unsupported until upstream support from MoltenVK comes through. But yeah, I agree, it wasn't cool to repackage Georgi's work like that.
lousken
lm studio is not opensource and you can't use it on the server and connect clients to it?
meltyness
I feel like the READMEs for these 3 large popular packages already illustrate tradeoffs better than hacker news argument
gen6acd60af
LM Studio is closed source.
And didn't Ollama independently ship a vision pipeline for some multimodal models months before llama.cpp supported it?
faitswulff
Does LM Studio have an equivalent to the ollama launch command? i.e. `ollama launch claude --model qwen3.5:35b-a3b-coding-nvfp4`
logicallee
>Ollama is slower
I've benchmarked this on an actual Mac Mini M4 with 24 GB of RAM, and averaged 24.4 t/s on Ollama and 19.45 t/s on LM Studio for the same ~10 GB model (gemma4:e4b), a difference which was repeated across three runs and with both models warmed up beforehand. Unless there is an error in my methodology, which is easy to repeat[1], it means Ollama is a full 25% faster. That's an enormous difference. Try it for yourself before making such claims.
[1] script at: https://pastebin.com/EwcRqLUm but it warms up both and keeps them in memory, so you'll want to close almost all other applications first. Install both ollama and LM Studio and download the models, change the path to where you installed the model. Interestingly I had to go through 3 different AI's to write this script: ChatGPT (on which I'm a Pro subscriber) thought about doing so then returned nothing (shenanigans since I was benchmarking a competitor?), I had run out of my weekly session limit on Pro Max 20x credits on Claude (wonder why I need a local coding agent!) and then Google rose to the challenge and wrote the benchmark for me. I didn't try writing a benchmark like this locally, I'll try that next and report back.
jrm4
Do y'all mean backend or the Ollama frontend or both? I find it trivially easy to sub in my local Ollama api thing in virtually all of the interesting frontend things. I'm quite curious about the "why not Ollama" here.
easygenes
Why is ollama so many people’s go-to? Genuinely curious, I’ve tried it but it feels overly stripped down / dumbed down vs nearly everything else I’ve used.
Lately I’ve been playing with Unsloth Studio and think that’s probably a much better “give it to a beginner” default.
polotics
Ollama got some first-mover advantage at the time when actually building and git pulling llama.cpp was a bit of a moat. The devs' docker past probably made them overestimate how much they could lay claim to mindshare. However, no one really could have known how quickly things would evolve... Now I mostly recommend LM-studio to people.
What does unsloth-studio bring on top?
diflartle
Ollama is good enough to dabble with, and getting a model is as easy as ollama pull <model name> vs figuring it out by yourself on hugging face and trying to make sense on all the goofy letters and numbers between the forty different names of models, and not needing a hugging face account to download.
So you start there and eventually you want to get off the happy path, then you need to learn more about the server and it's all so much more complicated than just using ollama. You just want to try models, not learn the intricacies of hosting LLMs.
DiabloD3
Advertising, mostly.
Ollama's org had people flood various LLM/programming related Reddits and Discords and elsewhere, claiming it was an 'easy frontend for llama.cpp', and tricked people.
Only way to win is to uninstall it and switch to llama.cpp.
wolvoleo
For me it's just the server. I use openwebui as interface. I don't want it all running on the same machine.
jrm4
Ollama user with the opposite question -- why not? What am I missing out on? I'm using it as the backend for playing with other frontend stuff and it seems to work just fine.
And as someone running at 16gb card, I'm especially curious as to if I'm missing out on better performance?
linolevan
What I really don't get is why more people don't talk about LMStudio, I switched to it months ago and it seems like a straight upgrade.
robotswantdata
Why are you using Ollama? Just use llama.cpp
brew install llama.cpp
use the inbuilt CLI, Server or Chat interface. + Hook it up to any other app
Bigsy
For MLX I'd guess.
boutell
Last night I had to install the VO.20 pre-release of ollama to use this model. So I'm wondering if these instructions are accurate.
logicallee
In case someone would like to know what these are like on this hardware, I tested Gemma 4 32b (the ~20 GB model, the largest Gemma model Google published) and Gemma 4 gemma4:e4b (the ~10 GB model) on this exact setup (Mac Mini M4 with 24 GB of RAM using Ollama), I livestreamed it:
The ~10 GB model is super speedy, loading in a few seconds and giving responses almost instantly. If you just want to see its performance, it says hello around the 2 minute mark in the video (and fast!) and the ~20 GB model says hello around 5 minutes 45 seconds in the video. You can see the difference in their loading times and speed, which is a substantial difference. I also had each of them complete a difficult coding task, they both got it correct but the 20 GB model was much slower. It's a bit too slow to use on this setup day to day, plus it would take almost all the memory. The 10 GB model could fit comfortably on a Mac Mini 24 GB with plenty of RAM left for everything else, and it seems like you can use it for small-size useful coding tasks.
aetherspawn
Which harness (IDE) works with this if any? Can I use it for local coding right now?
lambda
Yes, you can use it for local coding. Most harnesses can be pointed at a local endpoint which provides an OpenAI compatible API, though I've had some trouble using recent versions of Codex with llama.cpp due to an API incompatibility (Codex uses the newer "responses" API, but in a way that llama.cpp hasn't fully supported).
I personally prefer Pi as I like the fact that it's minimalist and extensible. But some people just use Claude Code, some OpenCode, there are a ton of options out there and most of them can be used with local models.
kristopolous
It needs to support tool calling and many of the quantized ggufs don't so you have to check.
I've got a workaround for that called petsitter where it sits as a proxy between the harness and inference engine and emulates additional capabilities through clever prompt engineering and various algorithms.
They're abstractly called "tricks" and you can stack them as you please.
You can run the quantized model on ollama, put petsitter in front of it, put the agent harness in front of that and you're good to go
If you have trouble, file bugs. Please!
Thank you
milchek
I tested briefly with a MacBook Pro m4 with 36gb. Run in LM Studio with open code as the frontend and it failed over and over on tool calls. Switched back to qwen. Anyone else on similar setup have better luck?
internet101010
I failed to run in LM Studio on M5 with 32gb at even half max context. Literally locked up computer and had to reboot.
Ran gemma-4-26B-A4B-it-GGUF:Q4_K_M just fine with llama.cpp though. First time in a long time that I have been impressed by a local model. Both speed (~38t/s) and quality are very nice.
jasonjmcghee
Haven't had time to try yet, but heard from others that they needed to update both the main and runtime versions for things to work.
Aurornis
Tool calls falling is a problem with the inference engine’s implementation and/or the quant. Update and try again in a few days.
This is how all open weight model launches go.
volume_tech
[dead]
mark_l_watson
The article has a few good tips for using Ollama. Perhaps it should note that the Gemma 4 models are not really trained for strong performance with coding agents like OpenCode, Claude Code, pi, etc. The Gemma 4 models are excellent for applications requiring tool use, data extraction to JSON, etc. I asked Gemini Pro about this earlier and Gemini Pro recommended qwen 3.5 models specifically for coding, and backed that up with interesting material on training. This makes sense, and is something that I do: use strong models to build effective applications using small efficient models.
Aurornis
> I asked Gemini Pro about this earlier and Gemini Pro recommended qwen 3.5 models specifically for coding, and backed that up with interesting material on training.
The Gemma models were literally released yesterday. You can’t ask LLMs for advice on these topics and get accurate information.
Please don’t repeat LLM-sourced answers as canonical information
renewiltord
Oh yeah absolute genius. I asked GPT-2 about Claude Opus 4.6 and it said “this is not a recommendation. You might get some benefits from Opus… but this is not what you want”. Damn, real wisdom from the OG there. What a legend
anonyfox
M5 air here with 32gb ram and 10/10 cores. Anyone got some luck with mlx builds on oMLX so far? Not at my machine right now and would love to know if these models already work including tool calling
smith7018
I know that someone got Gemma 4 E4B working with MLX [1] but I don't know much more than that.
The latest release v0.3.2 has partial support, generation is supported but not all special tokens are handled. I've done some personal testing to add tool calling and <|channel> thinking support. https://github.com/Yukon/omlx
renewiltord
Just told Claude to sort it out and it ran it. 26 tok/s on the Mac mini I use for personal claw type program. Unusable for local agent but it’s okay.
zozbot234
Isn't 26 tok/s quite usable for a claw-like agent though? You can chat with it on a IM platform and get notified as soon as it replies, you're not dependent on real-time quick interaction.
Aurornis
If this is your first time using open weight models right after release, know that there are always bugs in the early implementations and even quantizations.
Every project races to have support on launch day so they don’t lose users, but the output you get may not be correct. There are already several problems being discovered in tokenizer implementations and quantizations may have problems too if they use imatrix.
So you’re going to see a lot of “I tried it but it sucks because it can’t even do tool calls” and other reports about how the models don’t work at all in the coming weeks from people who don’t realize they were using broken implementations.
If you want to try cutting edge open models you need to be ready to constantly update your inference engine and check your quantization for updates and re-download when it’s changed. The mad rush to support it on launch day means everything gets shipped as soon as it looks like it can produce output tokens, not when it’s tested to be correct.
colechristensen
You seem like you know what you're talking about... what inference engine should I use? (linux, 4090)
I keep having "I tried it but it sucks" issues mostly around tool calling and it's not clear if it's the model or ollama. And not one model in particular, any of them really.
kristopolous
Are you getting tool call and multimodal working? I don't see it in the quantized unsloth ggufs...
zachperkel
how many TPS does a build like this achieve on gemma 4 26b?
kanehorikawa
[dead]
neo_doom
Huge Claude user here… can someone help me set some realistic expectations if I bought a Mac mini and spun one up? I use Claude primarily for dev work and Home Lab projects. Are the open models good enough to run locally and replace the Claude workload? Or am I better off with my $20/mo Claude subscription?
NietTim
They are good for small tasks but you would not be able to use it like you use Claude and most likely be disappointed. But also, I do not know how you use claude.
There are many services online which offer hosted services for these models, my advice for anyone who is thinking about buying hardware to self host this is to try those first, that way you can get an impression of the capabilities and limitations of those models before you commit to buying hardware
alfiedotwtf
So far, I’ve found gpt-oss-20B to be pretty good agentic wise, but it’s nothing like Claude Code using its paid models.
(I haven’t tried the 120B, which I’ve read is significantly better than 20B)
hamdingers
Best way to find out is to buy $10 of OpenRouter credits and try the models for yourself.
From my experience doing this, they're nowhere close, but it's entertaining to check in once in a while.
MrScruff
I've been playing with the open models since the original llama leak. They're getting better over time, are useful for tasks of moderate complexity and it's just cool to have a binary blob of knowledge that you can run locally without an internet connection.
However you should manage your expectations. Whatever the benchmarks say, you'll quickly realise they're not at all competing with Sonnet let alone Opus. Even the largest open weights models aren't really doing that.
techpulselab
[dead]
jiusanzhou
[dead]
spencer-p
Weird that the steps are for "Gemma 4 12b", which does not exist, and then switches to 26b midway through.
There's also a step to verify that it doesn't fit on the GPU with ollama ps showing "14%/86% CPU/GPU". Doesn't this mean you'll have really bad performance?
Schiendelman
The Mac mini doesn't have different memory for the CPU and GPU, so maybe that's ignorable?
aplomb1026
[dead]
jasonriddle
Slightly off topic, but question for folks.
I'm hoping to replace coding with Claude Sonnet 4.5 with a model with an open source or open weights model. Are any of the models on Ollama.com cloud offering (https://ollama.com/search?c=cloud) or any of the models on OpenRouter.ai a close replacement? I know that no model right now matches the full performance and capabilities of Claude Sonnet 4.5, but I want to know how close I can get and with which model(s).
If there is a model you say can replace it, talk about how long you have been using it for, and using what harness (Claude code, opencode, etc), and some strengths and weakness you have noticed. I'm not interested in what benchmarks say, I want to hear about real world use from programmers using these models.
scottcha
Yes GLM5 and KimiK2.5 are pretty close replacements for sonnet.
dimgl
In short: no.
Nothing comes close, in my opinion. Sonnet and Opus are still the best models. The Codex variants of the GPT models are also great. I've tried MiniMax, GLM, Qwen and Kimi and for anything even remotely complex these models seriously struggle.
greenstevester
[dead]
krzyk
By desk you mean that "Mac mini"? Because it is pricey. In my country it is 1000 USD (from Apple for basic M4 with 24GB). My desk was 1/5th of that price.
And considering that this Mac mini won't be doing anything else is there a reason why not just buy subscription from Claude, OpenAI, Google, etc.?
Are those open models more performant compared to Sonnet 4.5/4.6? Or have at least bigger context?
redrove
There is virtually no reason to use Ollama over LM Studio or the myriad of other alternatives.
Ollama is slower and they started out as a shameless llama.cpp ripoff without giving credit and now they “ported” it to Go which means they’re just vibe code translating llama.cpp, bugs included.
iLoveOncall
> There is virtually no reason to use Ollama over LM Studio or the myriad of other alternatives.
Hmm, the fact that Ollama is open-source, can run in Docker, etc.?
alifeinbinary
I really like LM Studio when I can use it under Windows but for people like me with Intel Macs + AMD gpu ollama is the only option because it can leverage the gpu using MoltenVK aka Vulkan, unofficially. We're still testing it, hoping to get the Vulkan support in the main branch soon. It works perfectly for single GPUs but some edge cases when using multiple GPUs are unsupported until upstream support from MoltenVK comes through. But yeah, I agree, it wasn't cool to repackage Georgi's work like that.
lousken
lm studio is not opensource and you can't use it on the server and connect clients to it?
meltyness
I feel like the READMEs for these 3 large popular packages already illustrate tradeoffs better than hacker news argument
gen6acd60af
LM Studio is closed source.
And didn't Ollama independently ship a vision pipeline for some multimodal models months before llama.cpp supported it?
faitswulff
Does LM Studio have an equivalent to the ollama launch command? i.e. `ollama launch claude --model qwen3.5:35b-a3b-coding-nvfp4`
logicallee
>Ollama is slower
I've benchmarked this on an actual Mac Mini M4 with 24 GB of RAM, and averaged 24.4 t/s on Ollama and 19.45 t/s on LM Studio for the same ~10 GB model (gemma4:e4b), a difference which was repeated across three runs and with both models warmed up beforehand. Unless there is an error in my methodology, which is easy to repeat[1], it means Ollama is a full 25% faster. That's an enormous difference. Try it for yourself before making such claims.
[1] script at: https://pastebin.com/EwcRqLUm but it warms up both and keeps them in memory, so you'll want to close almost all other applications first. Install both ollama and LM Studio and download the models, change the path to where you installed the model. Interestingly I had to go through 3 different AI's to write this script: ChatGPT (on which I'm a Pro subscriber) thought about doing so then returned nothing (shenanigans since I was benchmarking a competitor?), I had run out of my weekly session limit on Pro Max 20x credits on Claude (wonder why I need a local coding agent!) and then Google rose to the challenge and wrote the benchmark for me. I didn't try writing a benchmark like this locally, I'll try that next and report back.
jrm4
Do y'all mean backend or the Ollama frontend or both? I find it trivially easy to sub in my local Ollama api thing in virtually all of the interesting frontend things. I'm quite curious about the "why not Ollama" here.
easygenes
Why is ollama so many people’s go-to? Genuinely curious, I’ve tried it but it feels overly stripped down / dumbed down vs nearly everything else I’ve used.
Lately I’ve been playing with Unsloth Studio and think that’s probably a much better “give it to a beginner” default.
polotics
Ollama got some first-mover advantage at the time when actually building and git pulling llama.cpp was a bit of a moat. The devs' docker past probably made them overestimate how much they could lay claim to mindshare. However, no one really could have known how quickly things would evolve... Now I mostly recommend LM-studio to people.
What does unsloth-studio bring on top?
diflartle
Ollama is good enough to dabble with, and getting a model is as easy as ollama pull <model name> vs figuring it out by yourself on hugging face and trying to make sense on all the goofy letters and numbers between the forty different names of models, and not needing a hugging face account to download.
So you start there and eventually you want to get off the happy path, then you need to learn more about the server and it's all so much more complicated than just using ollama. You just want to try models, not learn the intricacies of hosting LLMs.
DiabloD3
Advertising, mostly.
Ollama's org had people flood various LLM/programming related Reddits and Discords and elsewhere, claiming it was an 'easy frontend for llama.cpp', and tricked people.
Only way to win is to uninstall it and switch to llama.cpp.
wolvoleo
For me it's just the server. I use openwebui as interface. I don't want it all running on the same machine.
jrm4
Ollama user with the opposite question -- why not? What am I missing out on? I'm using it as the backend for playing with other frontend stuff and it seems to work just fine.
And as someone running at 16gb card, I'm especially curious as to if I'm missing out on better performance?
linolevan
What I really don't get is why more people don't talk about LMStudio, I switched to it months ago and it seems like a straight upgrade.
robotswantdata
Why are you using Ollama? Just use llama.cpp
brew install llama.cpp
use the inbuilt CLI, Server or Chat interface. + Hook it up to any other app
Bigsy
For MLX I'd guess.
boutell
Last night I had to install the VO.20 pre-release of ollama to use this model. So I'm wondering if these instructions are accurate.
logicallee
In case someone would like to know what these are like on this hardware, I tested Gemma 4 32b (the ~20 GB model, the largest Gemma model Google published) and Gemma 4 gemma4:e4b (the ~10 GB model) on this exact setup (Mac Mini M4 with 24 GB of RAM using Ollama), I livestreamed it:
The ~10 GB model is super speedy, loading in a few seconds and giving responses almost instantly. If you just want to see its performance, it says hello around the 2 minute mark in the video (and fast!) and the ~20 GB model says hello around 5 minutes 45 seconds in the video. You can see the difference in their loading times and speed, which is a substantial difference. I also had each of them complete a difficult coding task, they both got it correct but the 20 GB model was much slower. It's a bit too slow to use on this setup day to day, plus it would take almost all the memory. The 10 GB model could fit comfortably on a Mac Mini 24 GB with plenty of RAM left for everything else, and it seems like you can use it for small-size useful coding tasks.
aetherspawn
Which harness (IDE) works with this if any? Can I use it for local coding right now?
lambda
Yes, you can use it for local coding. Most harnesses can be pointed at a local endpoint which provides an OpenAI compatible API, though I've had some trouble using recent versions of Codex with llama.cpp due to an API incompatibility (Codex uses the newer "responses" API, but in a way that llama.cpp hasn't fully supported).
I personally prefer Pi as I like the fact that it's minimalist and extensible. But some people just use Claude Code, some OpenCode, there are a ton of options out there and most of them can be used with local models.
kristopolous
It needs to support tool calling and many of the quantized ggufs don't so you have to check.
I've got a workaround for that called petsitter where it sits as a proxy between the harness and inference engine and emulates additional capabilities through clever prompt engineering and various algorithms.
They're abstractly called "tricks" and you can stack them as you please.
You can run the quantized model on ollama, put petsitter in front of it, put the agent harness in front of that and you're good to go
If you have trouble, file bugs. Please!
Thank you
milchek
I tested briefly with a MacBook Pro m4 with 36gb. Run in LM Studio with open code as the frontend and it failed over and over on tool calls. Switched back to qwen. Anyone else on similar setup have better luck?
internet101010
I failed to run in LM Studio on M5 with 32gb at even half max context. Literally locked up computer and had to reboot.
Ran gemma-4-26B-A4B-it-GGUF:Q4_K_M just fine with llama.cpp though. First time in a long time that I have been impressed by a local model. Both speed (~38t/s) and quality are very nice.
jasonjmcghee
Haven't had time to try yet, but heard from others that they needed to update both the main and runtime versions for things to work.
Aurornis
Tool calls falling is a problem with the inference engine’s implementation and/or the quant. Update and try again in a few days.
This is how all open weight model launches go.
volume_tech
[dead]
mark_l_watson
The article has a few good tips for using Ollama. Perhaps it should note that the Gemma 4 models are not really trained for strong performance with coding agents like OpenCode, Claude Code, pi, etc. The Gemma 4 models are excellent for applications requiring tool use, data extraction to JSON, etc. I asked Gemini Pro about this earlier and Gemini Pro recommended qwen 3.5 models specifically for coding, and backed that up with interesting material on training. This makes sense, and is something that I do: use strong models to build effective applications using small efficient models.
Aurornis
> I asked Gemini Pro about this earlier and Gemini Pro recommended qwen 3.5 models specifically for coding, and backed that up with interesting material on training.
The Gemma models were literally released yesterday. You can’t ask LLMs for advice on these topics and get accurate information.
Please don’t repeat LLM-sourced answers as canonical information
renewiltord
Oh yeah absolute genius. I asked GPT-2 about Claude Opus 4.6 and it said “this is not a recommendation. You might get some benefits from Opus… but this is not what you want”. Damn, real wisdom from the OG there. What a legend
anonyfox
M5 air here with 32gb ram and 10/10 cores. Anyone got some luck with mlx builds on oMLX so far? Not at my machine right now and would love to know if these models already work including tool calling
smith7018
I know that someone got Gemma 4 E4B working with MLX [1] but I don't know much more than that.
The latest release v0.3.2 has partial support, generation is supported but not all special tokens are handled. I've done some personal testing to add tool calling and <|channel> thinking support. https://github.com/Yukon/omlx
renewiltord
Just told Claude to sort it out and it ran it. 26 tok/s on the Mac mini I use for personal claw type program. Unusable for local agent but it’s okay.
zozbot234
Isn't 26 tok/s quite usable for a claw-like agent though? You can chat with it on a IM platform and get notified as soon as it replies, you're not dependent on real-time quick interaction.
Aurornis
If this is your first time using open weight models right after release, know that there are always bugs in the early implementations and even quantizations.
Every project races to have support on launch day so they don’t lose users, but the output you get may not be correct. There are already several problems being discovered in tokenizer implementations and quantizations may have problems too if they use imatrix.
So you’re going to see a lot of “I tried it but it sucks because it can’t even do tool calls” and other reports about how the models don’t work at all in the coming weeks from people who don’t realize they were using broken implementations.
If you want to try cutting edge open models you need to be ready to constantly update your inference engine and check your quantization for updates and re-download when it’s changed. The mad rush to support it on launch day means everything gets shipped as soon as it looks like it can produce output tokens, not when it’s tested to be correct.
colechristensen
You seem like you know what you're talking about... what inference engine should I use? (linux, 4090)
I keep having "I tried it but it sucks" issues mostly around tool calling and it's not clear if it's the model or ollama. And not one model in particular, any of them really.
kristopolous
Are you getting tool call and multimodal working? I don't see it in the quantized unsloth ggufs...
zachperkel
how many TPS does a build like this achieve on gemma 4 26b?
kanehorikawa
[dead]
neo_doom
Huge Claude user here… can someone help me set some realistic expectations if I bought a Mac mini and spun one up? I use Claude primarily for dev work and Home Lab projects. Are the open models good enough to run locally and replace the Claude workload? Or am I better off with my $20/mo Claude subscription?
NietTim
They are good for small tasks but you would not be able to use it like you use Claude and most likely be disappointed. But also, I do not know how you use claude.
There are many services online which offer hosted services for these models, my advice for anyone who is thinking about buying hardware to self host this is to try those first, that way you can get an impression of the capabilities and limitations of those models before you commit to buying hardware
alfiedotwtf
So far, I’ve found gpt-oss-20B to be pretty good agentic wise, but it’s nothing like Claude Code using its paid models.
(I haven’t tried the 120B, which I’ve read is significantly better than 20B)
hamdingers
Best way to find out is to buy $10 of OpenRouter credits and try the models for yourself.
From my experience doing this, they're nowhere close, but it's entertaining to check in once in a while.
MrScruff
I've been playing with the open models since the original llama leak. They're getting better over time, are useful for tasks of moderate complexity and it's just cool to have a binary blob of knowledge that you can run locally without an internet connection.
However you should manage your expectations. Whatever the benchmarks say, you'll quickly realise they're not at all competing with Sonnet let alone Opus. Even the largest open weights models aren't really doing that.
techpulselab
[dead]
jiusanzhou
[dead]
spencer-p
Weird that the steps are for "Gemma 4 12b", which does not exist, and then switches to 26b midway through.
There's also a step to verify that it doesn't fit on the GPU with ollama ps showing "14%/86% CPU/GPU". Doesn't this mean you'll have really bad performance?
Schiendelman
The Mac mini doesn't have different memory for the CPU and GPU, so maybe that's ignorable?
aplomb1026
[dead]
jasonriddle
Slightly off topic, but question for folks.
I'm hoping to replace coding with Claude Sonnet 4.5 with a model with an open source or open weights model. Are any of the models on Ollama.com cloud offering (https://ollama.com/search?c=cloud) or any of the models on OpenRouter.ai a close replacement? I know that no model right now matches the full performance and capabilities of Claude Sonnet 4.5, but I want to know how close I can get and with which model(s).
If there is a model you say can replace it, talk about how long you have been using it for, and using what harness (Claude code, opencode, etc), and some strengths and weakness you have noticed. I'm not interested in what benchmarks say, I want to hear about real world use from programmers using these models.
scottcha
Yes GLM5 and KimiK2.5 are pretty close replacements for sonnet.
dimgl
In short: no.
Nothing comes close, in my opinion. Sonnet and Opus are still the best models. The Codex variants of the GPT models are also great. I've tried MiniMax, GLM, Qwen and Kimi and for anything even remotely complex these models seriously struggle.
[dead]
By desk you mean that "Mac mini"? Because it is pricey. In my country it is 1000 USD (from Apple for basic M4 with 24GB). My desk was 1/5th of that price.
And considering that this Mac mini won't be doing anything else is there a reason why not just buy subscription from Claude, OpenAI, Google, etc.?
Are those open models more performant compared to Sonnet 4.5/4.6? Or have at least bigger context?
There is virtually no reason to use Ollama over LM Studio or the myriad of other alternatives.
Ollama is slower and they started out as a shameless llama.cpp ripoff without giving credit and now they “ported” it to Go which means they’re just vibe code translating llama.cpp, bugs included.
> There is virtually no reason to use Ollama over LM Studio or the myriad of other alternatives.
Hmm, the fact that Ollama is open-source, can run in Docker, etc.?
I really like LM Studio when I can use it under Windows but for people like me with Intel Macs + AMD gpu ollama is the only option because it can leverage the gpu using MoltenVK aka Vulkan, unofficially. We're still testing it, hoping to get the Vulkan support in the main branch soon. It works perfectly for single GPUs but some edge cases when using multiple GPUs are unsupported until upstream support from MoltenVK comes through. But yeah, I agree, it wasn't cool to repackage Georgi's work like that.
lm studio is not opensource and you can't use it on the server and connect clients to it?
I feel like the READMEs for these 3 large popular packages already illustrate tradeoffs better than hacker news argument
LM Studio is closed source.
And didn't Ollama independently ship a vision pipeline for some multimodal models months before llama.cpp supported it?
Does LM Studio have an equivalent to the ollama launch command? i.e. `ollama launch claude --model qwen3.5:35b-a3b-coding-nvfp4`
>Ollama is slower
I've benchmarked this on an actual Mac Mini M4 with 24 GB of RAM, and averaged 24.4 t/s on Ollama and 19.45 t/s on LM Studio for the same ~10 GB model (gemma4:e4b), a difference which was repeated across three runs and with both models warmed up beforehand. Unless there is an error in my methodology, which is easy to repeat[1], it means Ollama is a full 25% faster. That's an enormous difference. Try it for yourself before making such claims.
[1] script at: https://pastebin.com/EwcRqLUm but it warms up both and keeps them in memory, so you'll want to close almost all other applications first. Install both ollama and LM Studio and download the models, change the path to where you installed the model. Interestingly I had to go through 3 different AI's to write this script: ChatGPT (on which I'm a Pro subscriber) thought about doing so then returned nothing (shenanigans since I was benchmarking a competitor?), I had run out of my weekly session limit on Pro Max 20x credits on Claude (wonder why I need a local coding agent!) and then Google rose to the challenge and wrote the benchmark for me. I didn't try writing a benchmark like this locally, I'll try that next and report back.
Do y'all mean backend or the Ollama frontend or both? I find it trivially easy to sub in my local Ollama api thing in virtually all of the interesting frontend things. I'm quite curious about the "why not Ollama" here.
Why is ollama so many people’s go-to? Genuinely curious, I’ve tried it but it feels overly stripped down / dumbed down vs nearly everything else I’ve used.
Lately I’ve been playing with Unsloth Studio and think that’s probably a much better “give it to a beginner” default.
Ollama got some first-mover advantage at the time when actually building and git pulling llama.cpp was a bit of a moat. The devs' docker past probably made them overestimate how much they could lay claim to mindshare. However, no one really could have known how quickly things would evolve... Now I mostly recommend LM-studio to people.
What does unsloth-studio bring on top?
Ollama is good enough to dabble with, and getting a model is as easy as ollama pull <model name> vs figuring it out by yourself on hugging face and trying to make sense on all the goofy letters and numbers between the forty different names of models, and not needing a hugging face account to download.
So you start there and eventually you want to get off the happy path, then you need to learn more about the server and it's all so much more complicated than just using ollama. You just want to try models, not learn the intricacies of hosting LLMs.
Advertising, mostly.
Ollama's org had people flood various LLM/programming related Reddits and Discords and elsewhere, claiming it was an 'easy frontend for llama.cpp', and tricked people.
Only way to win is to uninstall it and switch to llama.cpp.
For me it's just the server. I use openwebui as interface. I don't want it all running on the same machine.
Ollama user with the opposite question -- why not? What am I missing out on? I'm using it as the backend for playing with other frontend stuff and it seems to work just fine.
And as someone running at 16gb card, I'm especially curious as to if I'm missing out on better performance?
What I really don't get is why more people don't talk about LMStudio, I switched to it months ago and it seems like a straight upgrade.
Why are you using Ollama? Just use llama.cpp
brew install llama.cpp
use the inbuilt CLI, Server or Chat interface. + Hook it up to any other app
For MLX I'd guess.
Last night I had to install the VO.20 pre-release of ollama to use this model. So I'm wondering if these instructions are accurate.
In case someone would like to know what these are like on this hardware, I tested Gemma 4 32b (the ~20 GB model, the largest Gemma model Google published) and Gemma 4 gemma4:e4b (the ~10 GB model) on this exact setup (Mac Mini M4 with 24 GB of RAM using Ollama), I livestreamed it:
https://www.youtube.com/live/G5OVcKO70ns
The ~10 GB model is super speedy, loading in a few seconds and giving responses almost instantly. If you just want to see its performance, it says hello around the 2 minute mark in the video (and fast!) and the ~20 GB model says hello around 5 minutes 45 seconds in the video. You can see the difference in their loading times and speed, which is a substantial difference. I also had each of them complete a difficult coding task, they both got it correct but the 20 GB model was much slower. It's a bit too slow to use on this setup day to day, plus it would take almost all the memory. The 10 GB model could fit comfortably on a Mac Mini 24 GB with plenty of RAM left for everything else, and it seems like you can use it for small-size useful coding tasks.
Which harness (IDE) works with this if any? Can I use it for local coding right now?
Yes, you can use it for local coding. Most harnesses can be pointed at a local endpoint which provides an OpenAI compatible API, though I've had some trouble using recent versions of Codex with llama.cpp due to an API incompatibility (Codex uses the newer "responses" API, but in a way that llama.cpp hasn't fully supported).
I personally prefer Pi as I like the fact that it's minimalist and extensible. But some people just use Claude Code, some OpenCode, there are a ton of options out there and most of them can be used with local models.
It needs to support tool calling and many of the quantized ggufs don't so you have to check.
I've got a workaround for that called petsitter where it sits as a proxy between the harness and inference engine and emulates additional capabilities through clever prompt engineering and various algorithms.
They're abstractly called "tricks" and you can stack them as you please.
https://github.com/day50-dev/Petsitter
You can run the quantized model on ollama, put petsitter in front of it, put the agent harness in front of that and you're good to go
If you have trouble, file bugs. Please!
Thank you
I tested briefly with a MacBook Pro m4 with 36gb. Run in LM Studio with open code as the frontend and it failed over and over on tool calls. Switched back to qwen. Anyone else on similar setup have better luck?
I failed to run in LM Studio on M5 with 32gb at even half max context. Literally locked up computer and had to reboot.
Ran gemma-4-26B-A4B-it-GGUF:Q4_K_M just fine with llama.cpp though. First time in a long time that I have been impressed by a local model. Both speed (~38t/s) and quality are very nice.
Haven't had time to try yet, but heard from others that they needed to update both the main and runtime versions for things to work.
Tool calls falling is a problem with the inference engine’s implementation and/or the quant. Update and try again in a few days.
This is how all open weight model launches go.
[dead]
The article has a few good tips for using Ollama. Perhaps it should note that the Gemma 4 models are not really trained for strong performance with coding agents like OpenCode, Claude Code, pi, etc. The Gemma 4 models are excellent for applications requiring tool use, data extraction to JSON, etc. I asked Gemini Pro about this earlier and Gemini Pro recommended qwen 3.5 models specifically for coding, and backed that up with interesting material on training. This makes sense, and is something that I do: use strong models to build effective applications using small efficient models.
> I asked Gemini Pro about this earlier and Gemini Pro recommended qwen 3.5 models specifically for coding, and backed that up with interesting material on training.
The Gemma models were literally released yesterday. You can’t ask LLMs for advice on these topics and get accurate information.
Please don’t repeat LLM-sourced answers as canonical information
Oh yeah absolute genius. I asked GPT-2 about Claude Opus 4.6 and it said “this is not a recommendation. You might get some benefits from Opus… but this is not what you want”. Damn, real wisdom from the OG there. What a legend
M5 air here with 32gb ram and 10/10 cores. Anyone got some luck with mlx builds on oMLX so far? Not at my machine right now and would love to know if these models already work including tool calling
I know that someone got Gemma 4 E4B working with MLX [1] but I don't know much more than that.
1: https://github.com/bolyki01/localllm-gemma4-mlx
The latest release v0.3.2 has partial support, generation is supported but not all special tokens are handled. I've done some personal testing to add tool calling and <|channel> thinking support. https://github.com/Yukon/omlx
Just told Claude to sort it out and it ran it. 26 tok/s on the Mac mini I use for personal claw type program. Unusable for local agent but it’s okay.
Isn't 26 tok/s quite usable for a claw-like agent though? You can chat with it on a IM platform and get notified as soon as it replies, you're not dependent on real-time quick interaction.
If this is your first time using open weight models right after release, know that there are always bugs in the early implementations and even quantizations.
Every project races to have support on launch day so they don’t lose users, but the output you get may not be correct. There are already several problems being discovered in tokenizer implementations and quantizations may have problems too if they use imatrix.
So you’re going to see a lot of “I tried it but it sucks because it can’t even do tool calls” and other reports about how the models don’t work at all in the coming weeks from people who don’t realize they were using broken implementations.
If you want to try cutting edge open models you need to be ready to constantly update your inference engine and check your quantization for updates and re-download when it’s changed. The mad rush to support it on launch day means everything gets shipped as soon as it looks like it can produce output tokens, not when it’s tested to be correct.
You seem like you know what you're talking about... what inference engine should I use? (linux, 4090)
I keep having "I tried it but it sucks" issues mostly around tool calling and it's not clear if it's the model or ollama. And not one model in particular, any of them really.
Are you getting tool call and multimodal working? I don't see it in the quantized unsloth ggufs...
how many TPS does a build like this achieve on gemma 4 26b?
[dead]
Huge Claude user here… can someone help me set some realistic expectations if I bought a Mac mini and spun one up? I use Claude primarily for dev work and Home Lab projects. Are the open models good enough to run locally and replace the Claude workload? Or am I better off with my $20/mo Claude subscription?
They are good for small tasks but you would not be able to use it like you use Claude and most likely be disappointed. But also, I do not know how you use claude.
There are many services online which offer hosted services for these models, my advice for anyone who is thinking about buying hardware to self host this is to try those first, that way you can get an impression of the capabilities and limitations of those models before you commit to buying hardware
So far, I’ve found gpt-oss-20B to be pretty good agentic wise, but it’s nothing like Claude Code using its paid models.
(I haven’t tried the 120B, which I’ve read is significantly better than 20B)
Best way to find out is to buy $10 of OpenRouter credits and try the models for yourself.
From my experience doing this, they're nowhere close, but it's entertaining to check in once in a while.
I've been playing with the open models since the original llama leak. They're getting better over time, are useful for tasks of moderate complexity and it's just cool to have a binary blob of knowledge that you can run locally without an internet connection.
However you should manage your expectations. Whatever the benchmarks say, you'll quickly realise they're not at all competing with Sonnet let alone Opus. Even the largest open weights models aren't really doing that.
[dead]
[dead]
Weird that the steps are for "Gemma 4 12b", which does not exist, and then switches to 26b midway through.
There's also a step to verify that it doesn't fit on the GPU with ollama ps showing "14%/86% CPU/GPU". Doesn't this mean you'll have really bad performance?
The Mac mini doesn't have different memory for the CPU and GPU, so maybe that's ignorable?
[dead]
Slightly off topic, but question for folks.
I'm hoping to replace coding with Claude Sonnet 4.5 with a model with an open source or open weights model. Are any of the models on Ollama.com cloud offering (https://ollama.com/search?c=cloud) or any of the models on OpenRouter.ai a close replacement? I know that no model right now matches the full performance and capabilities of Claude Sonnet 4.5, but I want to know how close I can get and with which model(s).
If there is a model you say can replace it, talk about how long you have been using it for, and using what harness (Claude code, opencode, etc), and some strengths and weakness you have noticed. I'm not interested in what benchmarks say, I want to hear about real world use from programmers using these models.
Yes GLM5 and KimiK2.5 are pretty close replacements for sonnet.
In short: no.
Nothing comes close, in my opinion. Sonnet and Opus are still the best models. The Codex variants of the GPT models are also great. I've tried MiniMax, GLM, Qwen and Kimi and for anything even remotely complex these models seriously struggle.
[dead]
By desk you mean that "Mac mini"? Because it is pricey. In my country it is 1000 USD (from Apple for basic M4 with 24GB). My desk was 1/5th of that price.
And considering that this Mac mini won't be doing anything else is there a reason why not just buy subscription from Claude, OpenAI, Google, etc.?
Are those open models more performant compared to Sonnet 4.5/4.6? Or have at least bigger context?
There is virtually no reason to use Ollama over LM Studio or the myriad of other alternatives.
Ollama is slower and they started out as a shameless llama.cpp ripoff without giving credit and now they “ported” it to Go which means they’re just vibe code translating llama.cpp, bugs included.
> There is virtually no reason to use Ollama over LM Studio or the myriad of other alternatives.
Hmm, the fact that Ollama is open-source, can run in Docker, etc.?
I really like LM Studio when I can use it under Windows but for people like me with Intel Macs + AMD gpu ollama is the only option because it can leverage the gpu using MoltenVK aka Vulkan, unofficially. We're still testing it, hoping to get the Vulkan support in the main branch soon. It works perfectly for single GPUs but some edge cases when using multiple GPUs are unsupported until upstream support from MoltenVK comes through. But yeah, I agree, it wasn't cool to repackage Georgi's work like that.
lm studio is not opensource and you can't use it on the server and connect clients to it?
I feel like the READMEs for these 3 large popular packages already illustrate tradeoffs better than hacker news argument
LM Studio is closed source.
And didn't Ollama independently ship a vision pipeline for some multimodal models months before llama.cpp supported it?
Does LM Studio have an equivalent to the ollama launch command? i.e. `ollama launch claude --model qwen3.5:35b-a3b-coding-nvfp4`
>Ollama is slower
I've benchmarked this on an actual Mac Mini M4 with 24 GB of RAM, and averaged 24.4 t/s on Ollama and 19.45 t/s on LM Studio for the same ~10 GB model (gemma4:e4b), a difference which was repeated across three runs and with both models warmed up beforehand. Unless there is an error in my methodology, which is easy to repeat[1], it means Ollama is a full 25% faster. That's an enormous difference. Try it for yourself before making such claims.
[1] script at: https://pastebin.com/EwcRqLUm but it warms up both and keeps them in memory, so you'll want to close almost all other applications first. Install both ollama and LM Studio and download the models, change the path to where you installed the model. Interestingly I had to go through 3 different AI's to write this script: ChatGPT (on which I'm a Pro subscriber) thought about doing so then returned nothing (shenanigans since I was benchmarking a competitor?), I had run out of my weekly session limit on Pro Max 20x credits on Claude (wonder why I need a local coding agent!) and then Google rose to the challenge and wrote the benchmark for me. I didn't try writing a benchmark like this locally, I'll try that next and report back.
Do y'all mean backend or the Ollama frontend or both? I find it trivially easy to sub in my local Ollama api thing in virtually all of the interesting frontend things. I'm quite curious about the "why not Ollama" here.
Why is ollama so many people’s go-to? Genuinely curious, I’ve tried it but it feels overly stripped down / dumbed down vs nearly everything else I’ve used.
Lately I’ve been playing with Unsloth Studio and think that’s probably a much better “give it to a beginner” default.
Ollama got some first-mover advantage at the time when actually building and git pulling llama.cpp was a bit of a moat. The devs' docker past probably made them overestimate how much they could lay claim to mindshare. However, no one really could have known how quickly things would evolve... Now I mostly recommend LM-studio to people.
What does unsloth-studio bring on top?
Ollama is good enough to dabble with, and getting a model is as easy as ollama pull <model name> vs figuring it out by yourself on hugging face and trying to make sense on all the goofy letters and numbers between the forty different names of models, and not needing a hugging face account to download.
So you start there and eventually you want to get off the happy path, then you need to learn more about the server and it's all so much more complicated than just using ollama. You just want to try models, not learn the intricacies of hosting LLMs.
Advertising, mostly.
Ollama's org had people flood various LLM/programming related Reddits and Discords and elsewhere, claiming it was an 'easy frontend for llama.cpp', and tricked people.
Only way to win is to uninstall it and switch to llama.cpp.
For me it's just the server. I use openwebui as interface. I don't want it all running on the same machine.
Ollama user with the opposite question -- why not? What am I missing out on? I'm using it as the backend for playing with other frontend stuff and it seems to work just fine.
And as someone running at 16gb card, I'm especially curious as to if I'm missing out on better performance?
What I really don't get is why more people don't talk about LMStudio, I switched to it months ago and it seems like a straight upgrade.
Why are you using Ollama? Just use llama.cpp
brew install llama.cpp
use the inbuilt CLI, Server or Chat interface. + Hook it up to any other app
For MLX I'd guess.
Last night I had to install the VO.20 pre-release of ollama to use this model. So I'm wondering if these instructions are accurate.
In case someone would like to know what these are like on this hardware, I tested Gemma 4 32b (the ~20 GB model, the largest Gemma model Google published) and Gemma 4 gemma4:e4b (the ~10 GB model) on this exact setup (Mac Mini M4 with 24 GB of RAM using Ollama), I livestreamed it:
https://www.youtube.com/live/G5OVcKO70ns
The ~10 GB model is super speedy, loading in a few seconds and giving responses almost instantly. If you just want to see its performance, it says hello around the 2 minute mark in the video (and fast!) and the ~20 GB model says hello around 5 minutes 45 seconds in the video. You can see the difference in their loading times and speed, which is a substantial difference. I also had each of them complete a difficult coding task, they both got it correct but the 20 GB model was much slower. It's a bit too slow to use on this setup day to day, plus it would take almost all the memory. The 10 GB model could fit comfortably on a Mac Mini 24 GB with plenty of RAM left for everything else, and it seems like you can use it for small-size useful coding tasks.
Which harness (IDE) works with this if any? Can I use it for local coding right now?
Yes, you can use it for local coding. Most harnesses can be pointed at a local endpoint which provides an OpenAI compatible API, though I've had some trouble using recent versions of Codex with llama.cpp due to an API incompatibility (Codex uses the newer "responses" API, but in a way that llama.cpp hasn't fully supported).
I personally prefer Pi as I like the fact that it's minimalist and extensible. But some people just use Claude Code, some OpenCode, there are a ton of options out there and most of them can be used with local models.
It needs to support tool calling and many of the quantized ggufs don't so you have to check.
I've got a workaround for that called petsitter where it sits as a proxy between the harness and inference engine and emulates additional capabilities through clever prompt engineering and various algorithms.
They're abstractly called "tricks" and you can stack them as you please.
https://github.com/day50-dev/Petsitter
You can run the quantized model on ollama, put petsitter in front of it, put the agent harness in front of that and you're good to go
If you have trouble, file bugs. Please!
Thank you
I tested briefly with a MacBook Pro m4 with 36gb. Run in LM Studio with open code as the frontend and it failed over and over on tool calls. Switched back to qwen. Anyone else on similar setup have better luck?
I failed to run in LM Studio on M5 with 32gb at even half max context. Literally locked up computer and had to reboot.
Ran gemma-4-26B-A4B-it-GGUF:Q4_K_M just fine with llama.cpp though. First time in a long time that I have been impressed by a local model. Both speed (~38t/s) and quality are very nice.
Haven't had time to try yet, but heard from others that they needed to update both the main and runtime versions for things to work.
Tool calls falling is a problem with the inference engine’s implementation and/or the quant. Update and try again in a few days.
This is how all open weight model launches go.
[dead]
The article has a few good tips for using Ollama. Perhaps it should note that the Gemma 4 models are not really trained for strong performance with coding agents like OpenCode, Claude Code, pi, etc. The Gemma 4 models are excellent for applications requiring tool use, data extraction to JSON, etc. I asked Gemini Pro about this earlier and Gemini Pro recommended qwen 3.5 models specifically for coding, and backed that up with interesting material on training. This makes sense, and is something that I do: use strong models to build effective applications using small efficient models.
> I asked Gemini Pro about this earlier and Gemini Pro recommended qwen 3.5 models specifically for coding, and backed that up with interesting material on training.
The Gemma models were literally released yesterday. You can’t ask LLMs for advice on these topics and get accurate information.
Please don’t repeat LLM-sourced answers as canonical information
Oh yeah absolute genius. I asked GPT-2 about Claude Opus 4.6 and it said “this is not a recommendation. You might get some benefits from Opus… but this is not what you want”. Damn, real wisdom from the OG there. What a legend
M5 air here with 32gb ram and 10/10 cores. Anyone got some luck with mlx builds on oMLX so far? Not at my machine right now and would love to know if these models already work including tool calling
I know that someone got Gemma 4 E4B working with MLX [1] but I don't know much more than that.
1: https://github.com/bolyki01/localllm-gemma4-mlx
The latest release v0.3.2 has partial support, generation is supported but not all special tokens are handled. I've done some personal testing to add tool calling and <|channel> thinking support. https://github.com/Yukon/omlx
Just told Claude to sort it out and it ran it. 26 tok/s on the Mac mini I use for personal claw type program. Unusable for local agent but it’s okay.
Isn't 26 tok/s quite usable for a claw-like agent though? You can chat with it on a IM platform and get notified as soon as it replies, you're not dependent on real-time quick interaction.
If this is your first time using open weight models right after release, know that there are always bugs in the early implementations and even quantizations.
Every project races to have support on launch day so they don’t lose users, but the output you get may not be correct. There are already several problems being discovered in tokenizer implementations and quantizations may have problems too if they use imatrix.
So you’re going to see a lot of “I tried it but it sucks because it can’t even do tool calls” and other reports about how the models don’t work at all in the coming weeks from people who don’t realize they were using broken implementations.
If you want to try cutting edge open models you need to be ready to constantly update your inference engine and check your quantization for updates and re-download when it’s changed. The mad rush to support it on launch day means everything gets shipped as soon as it looks like it can produce output tokens, not when it’s tested to be correct.
You seem like you know what you're talking about... what inference engine should I use? (linux, 4090)
I keep having "I tried it but it sucks" issues mostly around tool calling and it's not clear if it's the model or ollama. And not one model in particular, any of them really.
Are you getting tool call and multimodal working? I don't see it in the quantized unsloth ggufs...
how many TPS does a build like this achieve on gemma 4 26b?
[dead]
Huge Claude user here… can someone help me set some realistic expectations if I bought a Mac mini and spun one up? I use Claude primarily for dev work and Home Lab projects. Are the open models good enough to run locally and replace the Claude workload? Or am I better off with my $20/mo Claude subscription?
They are good for small tasks but you would not be able to use it like you use Claude and most likely be disappointed. But also, I do not know how you use claude.
There are many services online which offer hosted services for these models, my advice for anyone who is thinking about buying hardware to self host this is to try those first, that way you can get an impression of the capabilities and limitations of those models before you commit to buying hardware
So far, I’ve found gpt-oss-20B to be pretty good agentic wise, but it’s nothing like Claude Code using its paid models.
(I haven’t tried the 120B, which I’ve read is significantly better than 20B)
Best way to find out is to buy $10 of OpenRouter credits and try the models for yourself.
From my experience doing this, they're nowhere close, but it's entertaining to check in once in a while.
I've been playing with the open models since the original llama leak. They're getting better over time, are useful for tasks of moderate complexity and it's just cool to have a binary blob of knowledge that you can run locally without an internet connection.
However you should manage your expectations. Whatever the benchmarks say, you'll quickly realise they're not at all competing with Sonnet let alone Opus. Even the largest open weights models aren't really doing that.
[dead]
[dead]
Weird that the steps are for "Gemma 4 12b", which does not exist, and then switches to 26b midway through.
There's also a step to verify that it doesn't fit on the GPU with ollama ps showing "14%/86% CPU/GPU". Doesn't this mean you'll have really bad performance?
The Mac mini doesn't have different memory for the CPU and GPU, so maybe that's ignorable?
[dead]
Slightly off topic, but question for folks.
I'm hoping to replace coding with Claude Sonnet 4.5 with a model with an open source or open weights model. Are any of the models on Ollama.com cloud offering (https://ollama.com/search?c=cloud) or any of the models on OpenRouter.ai a close replacement? I know that no model right now matches the full performance and capabilities of Claude Sonnet 4.5, but I want to know how close I can get and with which model(s).
If there is a model you say can replace it, talk about how long you have been using it for, and using what harness (Claude code, opencode, etc), and some strengths and weakness you have noticed. I'm not interested in what benchmarks say, I want to hear about real world use from programmers using these models.
Yes GLM5 and KimiK2.5 are pretty close replacements for sonnet.
In short: no.
Nothing comes close, in my opinion. Sonnet and Opus are still the best models. The Codex variants of the GPT models are also great. I've tried MiniMax, GLM, Qwen and Kimi and for anything even remotely complex these models seriously struggle.