Ask HN: Has anyone replaced Claude/GPT with a local model for daily coding?
Has anyone here fully swapped Claude/GPT for a local model as their main coding tool, not just for side experiments? If so, please share your setup and performance (e.g tok/s)
I have! I care about data privacy and LLMs being free. I'm using the Pi coding harness but containerized and sandboxed, to make sure it's running completely offline. On my Mac Studio with 128GB RAM (or MacBook with 36GB RAM) I'm using Qwen3.6 35b, with only 3b active parameters so that it runs really fast. I've done a complete redesign for my website's homepage and blog with Django + Wagtail. The latter is interesting, because Wagtail is a bit less well-known, so the agent, without giving it internet access, doesn't always know how to develop for Wagtail. I've used Qwen3.5 122b for when things get more complex. At 10b active parameters, it's significantly slower though.
I've noticed a few things compared to large models like Claude. For starters, you really need to know what you're asking, and be precise; it doesn't do much thinking for you. Any assumptions left open, and it'll take the easiest route to reach the goal (e.g. CSS in HTML), often not the best in terms of architecture.
It gets into loops quite often, and surprisingly often gets the edit tool call wrong, after which it will spend lots of thinking tokens and re-read files instead of retrying (despite the system prompt suggesting so).
Comparing agentic Qwen3.6 35b to Claude Opus is like a junior with knowledge across the board, that you really need to guide, versus a senior that thinks with you on architecture. If Opus gives a 15x speedup, local and fully offline Qwen gives a 5x speedup. Which, given that it's completely free, is still mind-boggling to me :)
This is very similar to my setup. Pi in a container (I do let it have network access, just no access to creds or anything, only the one directory that I'm working on at the time and my ~/.pi directory), talking to llama.cpp in another container. I'm on a Strix Halo 128 GiB unified memory laptop.
I've never used the frontier models in earnest, I don't believe in using proprietary tools for my programming, so I can't really compare.
And I'm still a AI skeptic, so I'm doing more testing and kicking the tires than I am actually using it. That means I spend a lot of time trying to break various models, probe them for strengths and weaknesses, etc.
But I find that when I do try to use it for real for agentic coding, Qwen 3.6 35B-A3B is definitely the one I reach for the most often.
For other chat tasks and translation, I'll frequently use Gemma 4 31B.
For audio, I'll use Gemma 4 12B.
I keep a bunch of other models around to try out every once in a while (Qwen 3.5 122B-A10B, Qwen 3.6 27B, Nemotron 3 Super 122B-A12B, Step 3.7 Flash and Minimax M2.7 both at somewhat more aggressive quants, and GPT-OSS 120B if I want super fast but not terribly smart), but so far Qwen 3.6 35B-A3B is really the sweet spot for coding on a setup like this.
Hopefully this isn't off-topic, but your setup sounds just like mine, Strix Halo and (I'm assuming) llama.cpp on ROCm, and I'm finding that the Qwen hybrid models don't handle prompt caching and instead re-process the context in full on every turn. I'm wondering if you were able to solve this and how?
I use Vulkan mostly instead of ROCm. Vulkan is actually a bit faster, paradoxically. I do switch out and try them both out, and it's not a huge difference, but I've been mostly saying on Vulkan.
The re-processing context every turn problem is definitely something I've hit. Some of the causes have been solved upstream in llama.cpp; make sure you're up to date.
But another cause of the issue that has a big effect is that older Qwen models didn't support preserving thinking. This means that each time you have a long sequence of tool calls with interleaved thinkging, as soon as you had your next turn in the chat, it would have to re-process all of that as it would drop all of the reasoning.
Qwen 3.6, however, now supports preserving thinking. This can use a bit more context, becasue you're not dropping the thinking every turn, but it re-uses the cache better, not causing you to have to reprocess a whole turn at a time each time.
In my models.ini, I have this for the Qwen3.6 models:
> Qwen hybrid models don't handle prompt caching and instead re-process the context in full on every turn. I'm wondering if you were able to solve this and how?
Isn't this the nature of how LLMs work? Or do you mean that it recalculates the entire KV cache instead of saving the old KV cache, in which case the problem is likely in your executor (llama.cpp, vllm, e.g.) configuration or capabilities?
So, one of the ways that this problem manifests is that most local models aren't trained on preserving the full reasoning between turns. Every turn, they skip passing the reasoning trace from previous turns to the the LLM. So if on one turn you have a long interleaved chain of reasoning and tool calls, then it responds to you, and then you give a new prompt to fix something, it has to re-process all of those tools calls now with the reasoning stripped out.
Qwen 3.6 has finally been trained both with and without preserving thinking, so you can optionally enable preserving thinking. This will use up a bit more context, but it will avoid having to do this re-processing of long agentic turns, and also the preserved thinking can avoid having to re-do some of the same reasoning over again in later turns.
Besides that, modern LLMs don't only use full attention (apparently, attention is not all you need). Full attention is very expensive to compute and store (0(n^2)). But additionally, full attention is actually bad at certain kinds of reasoning; keeping track of some value that gets replaced over the course of time, for example. So most models these days use various forms of local attention which is fixed length and gets updated as you go; sliding window attention, Mamba-2 state space models, etc.
But one advantage of attention is that you can go back and reprocess by truncating the KV cache and starting over. You can't do that with other forms of local attention; you've lost the state earlier in the sequence.
So to allow you to go back without fully recomputing the cache all over again, your engine will save snapshots of the local attention state at various times, so if you need to go back to recompute the cache, you can start from the last snapshot. However, these snapshots can get large, you can't keep too many of these, so sometimes you need to go back quite far to get to one, or they're all past the point you need to go back to and you need to start over again from the beginning.
There have been particular bugs in llama.cpp that have caused this to be triggered more often than it should; for instance, it wouldn't take snapshots before turns that included images at one point, so if you had an image heavy agentic workflow, that issue plus the lack of preserving thinking would mean you would frequently have to go back and start over from scratch.
Some of these issue have been fixed, some are addressed by preserving thinking. There are still some issues sometimes; for instance, one that's hard to fix is that the tokens generated autoregressively don't always parse the same when doing prefill. For instance, you could generate something as two tokens "pre" and "fill", but it turns out that "prefill" is also a single token so the tokenizer will use that, so when you send that back again on the next turn, it will see a divergence and have to recompute from that point. It might be possible to ignore that and use the not fully greedy tokenization that's in the cache, but I've definitely seen llama.cpp have to do some cache recomputation due to that.
What harness are you using? Some of them (e.g. OpenCode) mutate the system prompt every turn, and therefore can't work with a KV cache.
I've had the best luck with Pi so far, but it comes without some bells and whistles you might be used to (e.g. plan mode, subagents, MCP client support)
For the edit tool, you should consider implementing a hash-based approach where each line of code is hashed and referenced by it when doing replacements. You can read up on the approach here: https://blog.can.ac/2026/02/12/the-harness-problem/
I didn't do much benchmarking, but anecdotally, I found it to be making less edit errors. YMMV
What kind of coding do you do?
Do you keep track of frontier models to vibe check the differences and re-evaluate constantly or are you ok with having a nerfed model forever?
(not being judmental, just really wanto to know your framework here)
Some of the work I do, I do for an (EU) organisation that doesn't have clear rules or guidelines on the use of AI yet. Though I have seen colleague-developers blatantly putting source code into external Claude-like models, I stay true to my principles and don't. I know for certain that everything that I run through my local, offline Pi Container Sandbox cannot leave the machine, and thus can't result in a data breach. I do this for the peace of mind.
I do (unscientifically) experiment whenever a new capable local LLM (<=130b) releases with a license that permits commercial use. As for knowing my models require more work than Opus, I don't mind still having to puzzle on getting the architecture right. In any case, it forces me to stay in the loop of what's being built, which is a good thing.
The harness and the LLM parameters are pretty essential to getting better results and reducing loops. Tweak the parameters and you can mostly eliminate loops without negatively affecting performance (it's a bit complex but ask a SOTA AI to guide you and it's not hard). The harness should also react more intelligently to failures; it can do things like return additional context or hints as it tracks error rates and avg duration of calls. Pi can be easily extended, and it's suggested by the author you modify it to perform better for your use case.
> you really need to know what you're asking, and be precise
Any chance that you could share some recent prompts to give other HNers a head start on his to approach Qwen? If you are uncomfortable posting them here, my Gmail username is the same as my HN username.
I'm glad you're asking. I already started writing a blog post on how to best make use of local models. I'll share it as soon as I have a complete enough list. If anyone else reading this would like to chime in with their tips & tricks, let us know!
For the time being, off the top of my head, I'd say:
- Prompt Engineering tips & tricks apply here (like being complete in the relevant context you provide in your question, and the specific task(s) the agent should do like reasoning, modifying one file, or trying to fix a complex task all at once (not recommended)).
- If you already know which files the agent should look into, mention them to save time and potentially context.
- In my personal workflow, I write down lots of atomic TODOs needed to solve a problem. As I write it down, I'll notice assumptions I'm making, or the fact that the TODO could still be decomposed further into (atomic) subtasks.
- It's best to get a feeling yourself for how Qwen handles your repository. I noticed if I don't specify an architecture for development, it'll make quick & dirty fixes. If I don't tell it to remove debug statements, it won't. This is what was meant with "be precise" – Claude Opus might think for you and act in your best interest. Smaller Qwen models will just do what you ask them to, and no more. They have design knowledge, but you have to explicitly ask them to "activate" that part of their knowledge.
Given your knowledge on this - do you think we'll see an open source model with Opus levels of capability? IMO if/when this happens - I would 100% stop using Anthropic.
Let me put it like this. I started with local LLMs when ChatGPT still used GPT-3.5. I was amazed how my MacBook with 8GB RAM could run openhermes2.5-mistral: a 7b parameter model that could generate short stories that sort of made sense. Incredible!
Two years later, and I'm running Qwen3.6 35b agentically to develop the start of a repository and automatically run tests to then improve on itself. I never thought we'd get here so quickly with LLMs back then.
I'm pretty sure in two years we'll have current Opus-like quality in the 30-100b parameter model range. But at that point, Opus 6.3 will reason along for us so much better still, that we'll still look at those models in awe. It's great to look ahead, but let's not forget to appreciate how effective the current local models already are :)
Haha well I ask because I don't really want/need anything beyond Opus most of the time. And I'm paranoid that Anthropic is going to be forced to charge the true cost of all this before too long.
The other upside of running local LLMs is that there's no cloud provider to suddenly charge more for the same, or even less, model use.
It's personal, but I prefer CapEx over OpEx for this. If you can purchase a device upfront that runs a decent local LLM, you get the peace of mind that your setup won't suddenly change over time and can only get better.
If you believe the benchmarks, Qwen 3.6 35B-A3B already outperforms Claude 4 Opus.
Now, there's a bit of a degree to which some of the open source models do some benchmaxxing, and bigger models with more params may always feel like they have more depth. But anyhow, right now you have something that is arguably comparable to Claude 4 Opus on your laptop. I can't really compare myself because I never used it. It looks like Claude 4 Opus is still available on OpenRouter, so you could try it out and compare yourself if you're interested.
It will likely always be the case that there are proprietary cloud models that are more powerful than what you can run on a laptop. You can just do a whole lot more with terabytes of VRAM on multi-GPU clusters than you can do on a laptop. So for folks who must have the most capable, you're probably not going to want to leave Anthropic.
But right now, the models you can run on your laptop are comparable to the cloud models that were popular when vibecoding and Claude Code first took off.
You really need to take the benchmarks with a massive pinch of salt. I’ve been testing local LLMs since the original llama and there’s nothing I’ve tried that is in the same category as Opus.
Which Opus? They certainly outperform Claude 3 Opus.
Anyhow, feel free to try them out head to head on OpenRouter. I'd love to see someone write up their results, of a modern local sized open source model vs. frontier models from ~a year ago, on something other than the standard benchmarks.
There's a guy on Youtube named Bijan Bowen who tests all the models (open and frontier) on a series of one/few shot programming exercises and has been for a long while now. You can pretty much watch him compare the results for any two models you're likely to be interested in.
I'm not affiliated, I just like his style and have found it handy. I know it's not very rigorous, but it's good enough for me and I've found his examples to pretty closely match the results I see in real life.
Qwen 3.6 produced far more working functionality than Claude 4 Opus did.
Obviously, just one test of a single one-shot prompt of a silly toy OS, but yeah, this particular test shows Qwen 3.6 running locally dramatically outperforming Claude 4 Opus, which was a frontier model a year ago.
I’m normally comparing frontier open/cheap models against frontier closed source. I use deepseek/glm regularly, they’re fine and you can get real work done with them but it’s super obvious when you switch back to opus or even sonnet. A 3B active param MoE model is not comparable.
Yeah. I was pointing out that local 3b active models outperform frontier models from a year ago.
Will this trend continue? Who knows. Both the frontier and local model will probably continue to get better. Which one will hit the top of the S-curve first? Hard to say, really. But what you can do right now locally is better than what you could do a year ago on the frontier, and lots of people were already using it pretty heavily a year ago.
Hoever, November is when most folks agree that the frontier models got good enough for much of their work. Local models aren't quite there yet (where by "local" I mean "can run at reasonable speed and quant on a system less that $10,000 with today's RAM and GPU prices"). The biggest open weights models are getting there, but those require something like an 8x H100 server to reasonably run.
It's likely that there will always be a gap between frontier and local if you're comparing models at the same time, you can just do a lot more with terabytes of HBM than gigabytes of DDR. But will local models get good enough to be usable for useful work? For many folks, they already are.
People can't seem to agree on what "Opus class" even means (the latest Opus is apparently pretty weak) but DeepSeek Pro, Kimi and GLM all are quite capable.
Nothing compares to Opus when it comes to "taste" in web design in my experience. Nothing compares to opus in very difficult HPC/model inference development. I worked on this with opus: https://github.com/computerex/dlgo
OpenAI was offering 2x usage at one point and I still used opus just because it's so much more effective.
To me totally yes, even further, if they keep their existing route, over time people will stop using Anthropic.
More and more specialized and ultra-performant chips are going to flood the consumer market. Especially once new hardware foundries will start producing (well if we don't die from WW3 in the interval).
In 10 years from now, when even basic computers will have 128 GB of memory, and phones will have super optimized tuned models, then what will be the point of Anthropic ?
Just use Gemma/Gemini/Siri or whatever.
Pornography and uncensored models is also pushing toward local models.
It's not like needs of people grows exponentially, the needs follow an asymptote instead (they are capped).
The real revolution is offline robots and self-driving cars, but LLMs are already quite maxed.
For programmers, now, what Anthropic offers is like 3% improvement on a known test (like this pelican riding a bicycle), or on questions leaked from benchmark insiders.
It's ok but not like revolutionary (Fable was better but it was unusable, easy 20 minutes per one prompt due to overthinking).
I am right there with you. Mind-boggling. It's a indistinguishable from magic technology!! I tried running some basic tasks through Qwen with Opencode on a 10 year old dual Xeon server for shits and giggles. I gave it a simple task like "use ffprobe first but convert this webm to mp4" and it was able to complete the task with zero network calls outside my network. On 10 year old hardware. It took about 3 minutes to complete the task. Now you may be saying 3 minutes? pfft. But I dare you to do it yourself. You're gonna be googling the CLI switches for at least 10 minutes and setting up your command. I had it actually optimize all the switches on the fly for me based on an initial ffprobe to see what is optimal.
> 10 year old dual Xeon server...On 10 year old hardware.
Hold on, what are the specs of your rig? How much RAM?
I'm been considering getting an old refurbished 2018 Mac Mini with 64Gb of DDR4 RAM but everything I've read suggests this will be way slower than my 16Gb M1 Pro Macbook.
You can absolutely still use this to do some basic stuff like tell opencode to convert a video file from one format to another. But frankly you're better off getting two AMD GPUs. Say a dual 7900XT would get way better performance.
The thing is, to do a proper fix it would really need all of the context (maybe the tool call that failed was for an edit to a file that was last touched way at the beginning of the context), so you'd need to either keep that smaller model running doing prompt processing all the time, or have a very long wait while it does prompt processing on your whole session.
And then also, sometimes the tool call errors are because of something like a file was changed out from under it; the larger model is probably going to do a better job of figuring that out and fixing it up.
Finally, in Pi, you can always just use the /tree command to skip back to before a series of failed tool calls, with a summary if you want to let the model know what happened. The Pi /tree command is pretty powerful in managing your context
An illustrative example I've seen a lot is creating Jira tickets in projects with custom fields marked as mandatory. It tries to create the ticket without the field and the tool call fails. The LLM needs access to the full context so that it can generate text to put in the "Why couldn't this meeting be an email?" field.
I'm actually quite sure that directly retrying the tool call would often fix the edit-call already. But these models have been trained to "think" for a while for any problem solving, so they'll presume the problem of the edit is more fundamental and spend unnecessary tokens filling up the context.
I'll experiment more with the effectiveness of AGENTS.md rules for local Pi agents. I feel like smaller (local) LLMs just lack in attentiveness to elements in the context window, like precise instructions, compared to e.g. Claude models.
also the context window sizes are too low. I can't operate in 65,000 windows any more because even just reading the code's file structure overruns it and gets me nowhere. Definitely its own art form.
200k context windows and above for me now
I saw a paper last night that should help this a lot though
I get that it's a deal breaker to some; it definitely requires patience.
In Pi, /new is my best friend and most-used command for sure. For simple tasks (I decompose complex ones anyway since I don't trust small local LLMs to do this for me), the model doesn't need much context, given that I'm proficient in my codebase myself: "I'd like Feature X. Look into files 1, 2 and 3 to make your edits."
I replaced a $100/m subscription to claude in favor of running pi harness pointed at unsloth studio, using both qwen (unsloth/Qwen3.6-35B-A3B-MTP-GGUF) and gemma (unsloth/gemma-4-26B-A4B-it-GGUF) models, depending on my mood.
I have a machine I built about 5 years ago with dual RTX3090s in it (I was going to build a new gaming machine anyways, and the llama release had just dropped so I tacked another used 3090 onto the build), and I get ~150tok/s on either of those models (at UD-Q4_K_XL quant) and can use the entire 300k context length without having to exit VRAM.
To be very clear - it's not as good as claude. But it's free and not so much worse that it matters significantly.
For my personal needs, free beats $100/m.
I also have an openclaw instance pointed at the same inference server, and it's great for that (genuinely solid use-case for local models).
Some example projects
- Replacement launcher for android tvs (with usage monitoring and tracking for kids)
- Custom admin portals for my k8s cluster services
- Custom home assistant integrations/automations (recently some shelly devices for power monitoring and switching)
- Grocery list management and meal planning (mostly via openclaw)
- some custom workflows for 3d asset generation in comfyui.
---
Long story short, if you're trying to make money via software... I'd probably still recommend using a paid provider. But the local models are very capable of cool stuff.
1x RTX3090 is absolutely not overkill for gaming however. Nowadays it's barely enough to get 60FPS in 4K in some recently released games. But the shocking part is that my 3090 is still probably worth as much as when I bought it about 4 years ago.
AFAIK nvidia cards dont work in tandem (aka sli in the past) very well these days. So that aint true.
Also, 2 gens old means bad performance at ray tracing, abysmal path tracing if at all. Pretty sure it can't run smoothly CP2077 in native 4k without dlss upscalers with all on ultra.
Yes, today is not a great time to purchase hardware.
When I bought, I paid $850 a piece. And I needed one anyways for the gaming I was going to do.
My guess is the next good time to buy is going to be 24-36 months from now, depending on how the AI bubble goes.
---
I'll add to this, I personally don't like Apple hardware (not so much related to the hardware as their company philosophy) but their machines with unified memory (or AMDs latest unified memory offerings) get pretty equivalent speeds to my 3090s, and are probably a much better modern entrypoint to local llms.
There's a reason the joke is that Silicon Valley software devs bought up all the Mac minis for OpenClaw.
You can get a 48gb unified RAM M4 pro mac mini for ~2k. If you're not going to do much else with the machine, it's what I'd pick as my budget inference device right now. Spend a year of claude now, get ~150tok/s for the next decade (plus) for ~free.
If you want more capable and are willing to spend a little more, go with the newer Ryzen AI Max+ 395 machines.
You'll spend less on power too.
My last suggestion would be to go buy an RTX3090 at this point. You can do a lot better for a lot cheaper.
That should do pretty well. Memory bandwidth is the biggest bottleneck for token generation, at 644 GB/s you should be able to do pretty well on a 9070, while prompt proessing is more compute bound and Nvidia tends to have the edge there.
16 GiB won't fit you much, so you'd probably want at least 2x, and preferably 3x of those, and then you need a motherboard, power, etc. that can handle that.
You can get 60tps with three 1080tis and the sparse model, and I bet two 16gb 5060tis would do the same for ~1200. One 3090 is enough for a useful system, even on an old am4 host.
In 3.6 years, chances are they are still worth $3k. Unless some new chip fab pops up that can spam the chip market. Even if the AI bubble bursts, I doubt we'll see high-RAM GPUs sell off.
> Quantization-Aware Training (QAT) [...] allows preserving similar quality to bfloat16 while dramatically reducing the memory requirements to load the model
I've actually tried this exact same model locally as well.. albeit on just a single 3090 at 128k context and I got around 40-60tok/s with Q4_K quantization.
The thing that bugged me the most was really the quality of the output on moderately complex real-world coding tasks. Having to switch between "prompt/vibe" and "manually implement" is such a big context switch burden, because you really have to ask yourself every few minutes if you're "holding it wrong" or the model is just too stupid.
It also doesn't really seem to handle transitions from "low-level implementation detail" to "high-level design" well, e.g., it wouldn't easily render tables and such. With Claude I don't have this issue.. so I think for now my verdict would be that it's not really a viable replacement. I really hope it will be in a few months time.
Oh and I used "aider" to replace claude CLI, which maybe that's also sub-optimal.. I'm not sure. The MCP marketplaces are useful of course, though arguably you could just manually replace them over time.
I don't generally switch to implementing myself on the model, although there are definitely times where I stop it and correct it mid-task.
It's prone to thinking longer and more repetitively, again - it's definitely not opus 4.7/4.8.
I've been using pi.dev as my harness for it, and been pleasantly surprised by how nice it feels (I have used aider, but only very briefly and quite a while back - so I can't realistically compare).
I would say it's roughly where I felt claude was a year back - Most of the sessions need to be more "pair programming" and less "I let it run for hours".
I'm a big fan of frequent "human in the loop" style workflows even when I'm on something like opus at work, though. I have opinions about lots of things, and re-inforcing that the model should stop and ask frequently seems to get me considerably better output, without having to "re-roll" if you will.
I've done a good bit of management, and I think it's roughly producing what a junior dev might produce in a day every 5 minutes. And just like a junior dev, you need to be steering it back on track fairly often.
Opus feels more like a mid-level at this point. I can hand it a chunk of work and "leave" but I still get better output if I'm checked-in and watching/steering.
I'm so out of the loop on this stuff, it's the first time in my IT career I feel really behind on things.
I've used Claude Opus to quickly and effectively pound out some 100-200 line scripts that integrate with a vendor's API, and it one-shotted them both almost perfectly.
I wonder if for a lot of these local models, the scope of the AI assistance should simply be smaller: You architect the tools and the function definitions, and then tell AI to implement one at a time? Does anyone do that rigorously?
No real change in inference speed. It basically just allows me to slot in more context or a bigger model.
A single RTX-3090 will do approximately the same tok/s, but it won't fit the entire 300k context in VRAM.
Sometimes that matters, a lot of times it doesn't.
On the speed front - MOE models are great. Biggest perf difference in modern models is the move to MOE architectures.
I get very similar quality from the both the Gemma-4 31B dense model, and the Gemma-4 26B MOE model (both at Q4 quant) but the MOE version runs at ~3 times the speed (150tok/s vs 46tok/s).
Running AMD Lemonade as the daily rig, Started with Ollama then over to LMStudio and now standardized on AMD Lemonade which has been helpful to monitor cRAM, CPU, GPU and gRam. The multi-models on Lemonade make it straight forward to run a stack for LLM, Voice to Text, NPU, and Image Generation. Platform also works with Nvidia, Apple, Intel and AMD chip sets.
I’ve tried in a 36GB MacBook Pro and haven’t had much success beyond very basic work. Issue for me was the context runs out quick even with smaller models and it’s slower. To get some half decent performance I’d imagine you want 128gb memory and are spending a lot more on hardware. At that point it becomes a question on whether you’d rather have frontier models at a subscription or sink that money into your own equipment. Of course, for those with privacy in mind your only option is forking out the cash for the higher end machines.
About 90% of my coding is on Qwen 3.6 27b and Open Code with some custom skills and Semble. It is NOT as smart as CC or Codex but its enough to get most of my work done. I didn't set out to replace CC and Codex (I have an RTX 6000 so the TPS is faster than I care about, but the RTX 6000 was originally for other work). I only tried this just to see how close you could get to a frontier model for coding as an experiment, but it was good enough that I stuck with it. I still fall back to Codex for really complicated stuff and to polish UI's as that seems to be the weakest element to working in Qwen.This isn't a recommendation because I don't think most people have an RTX 6000 laying around and the cost would be many years of MAX CC or Codex subscriptions, but at least this seems possible. Maybe in a few more years it will even be practical.
Other Notes: I have had to set the compact target to 75% on a 256k context window as once the conversation length goes about 100k I start seeing a drop in the quality and speed. This becomes very problematic after about 150k. I tried Qwen 3.5 122b too but it actually seems much worse at coding than 3.6 27b even though its much larger. Maybe because I am using a 4bit quant or maybe I just don't have it configured correctly? I know 3.6 is newer but I didn't expect it to out perform a model that is much larger from the prior generation. Gemma 4 31b is a good model for other tasks but at least my personal experience is that Qwen outperforms in coding. Nemotron Super 120b is great at a lot of stuff but it also seems to be not as good at coding as Qwen. This was very surprising to me.
Same here, I use Qwen 3.6 27b (Q6 quant) with llama.cpp on an RTX 5090 using the pi agent exclusively now. The fact that it's local means that I never have to think about token pricing, quotas, time of day, or data sensitivity. I have limited the GPU from 600W to 450W which means the system stays whisper quiet during inference.
I have become so "lazy" (in a good way), so far that I've started using the model for lots of daily mundane things on top of just coding:
* "commit this on a branch, push, create a PR and assign $nickname for review"
* "Use the Stripe CLI to download all open and overdue invoices and reconcile them with this CSV export from our bank account."
* "Use these Elasticsearch credentials to summarise what kind of operations are causing load at the moment."
* "Tell me if our codebase already supports X and where it's implemented."
Qwen3.5-122B is actually Qwen3.5-122B-A10B. The A10B means that this is a "mixture of experts" model where only 10B parameters are activated at a given time. Whereas Qwen3.6-27B is a "dense" model where all 27B parameters are activated all the time. So for many tasks, you'd expect the 27B dense model to be better than the 122B-A10B model.
I am forced to use Qwen 3.6 27b at work and found it next to useless.
I might as well do all the work manually rather than having it implement another mess or get the debugging entirely wrong.
It feels like anything less than Sonnet is just a waste of time, apart from use as a smarter search function.
It also strikes me as strange that you would mention Codex for UI polish, as it's notoriously bad at UI, and far behind Claude Opus. Altman specifically posted that they are working to improve this for the next model release.
Bad AI written documentation and commits are not great, particularly when you work in a team.
I almost find it offensive when colleagues open a MR with an obvious slop description that's frequently inaccurate.
That said, I find AI useful for a lot of drudgery like resolving merge conflicts or splitting changes out into separate MRs.
Particularly with the latter I had issues with small models, they butchered the changes I wanted moved. Not even on the second attempt did GPT 5.4 mini manage to move 10-20 lines to another file without modifying them in the process.
Yeah MoE is a little worse for the same size, but you can often run bigger MoEs at respectable speeds even on cpu ram offload. The dense models really need to be 100% vram
Yes. Llama.cpp + Qwen3.6-35b (MTP) + OpenCode is quite capable and runs on a single RTX 3090 and is faster than most cloud models. Quality is like running edge models from 8-12 months ago. Setup details at https://github.com/pierotofy/LocalCodingLLM/
"Quality is like running edge models from 8-12 months ago."
That sounds great for hobbyists but IMHO it wasn't until Opus 4.6 was released six months go (Dec 25, 2025) that we had a model good enough for professionals to use as a primary driver of their coding agents. That seems to be the threshold worth aiming for.
I strongly agree on that being the release where these tools got good enough to substantially speed up my professional work. I have to admit I was super skeptical of AI coding until then.
So thalen it might be 6-8 months to get to useable on a local open model? Of course state of the art will be a year ahead, a generation at the current pace.
That's cool if you prefer it, but it is hard to imagine it being a strictly rational choice when much better quality is available at a price that is small relative to the cost of an employee. Or is there something specific about your use-case?
Not all work requires every facet to be so sharply optimized, and there may be other constraints that are completely invisible to you. Some that were easy for me to imagine: the parent works in a heavily regulated industry, their IT team is slow-moving and paranoid and this is a safe, under-the-radar workaround, the output is "good enough" for their purposes and they find tinkering with it to be fun.
Regardless I don't think it's fruitful to be so condescending with such little insight into this person's situation. Even if you had total insight -- let people be and withhold your judgement, or at least keep it to yourself. Making people feel stupid is a great way to turn people off to pretty much anything else you have to say
Won’t it depend on what you use it for? A less capable system might be fine for boilerplate, moderate re-factoring, etc. Not everyone is building whole features in one go.
To me, what's not rational is believing you must rent the tools of your trade while exposing all of your employer's intellectual property to a third party. Difference of opinion.
It's not my opinion that you "must" rent tools but it certainly is the pragmatic choice in 2026. I would be as happy as anyone for this situation to change and I expect it to at some point.
i have a 128gb m4 max macbook pro i've been wanting to tinker with this stuff but genuinely never find the time. any mac users in here running similar to the above that can share their experience?
i always see great debates with local stuff but the space is constantly moving goalposts and all the vernacular is pretty unfamiliar to me. i'd love to understand what people with objective experience feel they've traded away (or gained) when going local so i can determine for myself if these things are a good fit.
If you have a 128GB Mac you really ought to try out: https://github.com/antirez/ds4 by the creator of redis. This is probably as close to it gets to state-of-the-art local LLM + agentic coding.
Using this just this morning on my DGX Spark. A little slower than frontier models but my $200/mo weekly usage exhausted with 3 days left on the week...
(Shouldn't have done that refactoring job in high mode)
I have the same machine. You might look into https://omlx.ai/ a „macOS-native MLX server“. pi.dev for the agent with MCP, web-search and sub-agents extension.
download LM Studio to play with, and it will let you search for models... try Qwen3.6-35B-A3B at 4,5 or 6 bits (6 bit XL is near perfect) and use pi coder or another harness to access it... you can also try Unsloth studio and try same model to start. LM Studio slighter easier to use, Unsloth probably better quality. Neither one is super great quality by the way (meaning: they crash or act weirdly too often to be full production solutions, but can work for local coding). ONCE YOU DOWNLOAD EITHER APP... it will let you search huggingface for the models. Just type qwen to start looking and ... start messing around. And you connect the pi coder harness using the http interface that LM Studio and Unsloth offer to the engine API, so make sure you figure out that url and turn it on... something like 127.0.0.1:1234/api would be a typical IP (localhost) and port (1234 is used by LM Studio)
Do you do your dev work on the windows machine (referenced in the docs), or do you remotely access it from a separate machine? I ask because I have a RTX 3090 kicking around in a gaming desktop, but I don't use it for any dev work (I use a Macbook Pro).
I have a similar set up and have been using it to learn and tinker with open models. I run Ollama on the gaming desktop and point OpenCode to it from my MacBook. Works nicely for me so far.
I don't think you're going to get many "true" answers to this. The opportunity cost of not using the latest and best models is just too much right now.
Every month I research this and come to the same conclusion: the time, effort, and cost required to get local models (and the coding tools around them) to perform even close to Claude Code with sonnet/opus just not worth it right now. If it was, it would be distributive enough to be in the news.
Not that I'm discounting someone hasn't already solved this, just trying to Occam razor my way out of diving too deep down rabbit holes.
At some point, there will come a saturation point for that "Opportunity cost FOMO train ride", and I think we are already past that point. Mythos class models are a whole different beasts and cutting edge on reasoning but not much use for the problem domains most developers are trying to solve.
The present Sonnet/Opus versions (~4.8) will likely be what everyone in the enterprise might end up using eventually. And even though local models aren't there yet, there are budget alternatives from the families of DeepSeek, Kimi, GPT, MiniMax, etc. available through APIs of NVidida, OpenRouter, Groq, etc. which are very much Sonnet grade.
Personally, I don't think we're at that point yet. While I do think model improvement is starting to plateau (reaching a local ceiling), I'm not convinced local models are as good as sonnet/opus yet. The gap is still too much. But I'm excited for those models to reach those levels.
If you truly believe that it WILL get there within the next couple of years, then you might as well start playing with it now (and, yes, you will be very surprised, especially for shorter/smaller projects or nicely modularized larger projects)
Sounds like a correct conclusion to me also. I am trying to transition to a layered system: local, then OpenCode with commercial vendor APIs for models like DeepSeek v4 flash, then DeepSeek v4 Pro.
With a layered approach we can slowly shift to running more locally and still get required work done. Really, my local setup is so much better than it was 2 months ago, and extremely better than 6 months ago - on the same hardware.
This seems to be the answer. Building a rig with a decent graphics card will cost $2k+ and will produce sub-par results. Might as well milk the $100/m Claude sub until open-source alternatives reach parity with today's frontier models.
But you're pretty much measuring opportunity cost in tokens per second, no?
I think it strongly remains to be seen whether e.g. tokens per second (multiplied or whatever by percieved quality of private model) actually means "better or more useful output."
I strongly suspect it does not. (though I also strongly suspect this will be very difficult to measure because the incentive to lie about metrics here will be so strong.)
If you’re arguing that model metrics don’t necessarily translate into useful output, I agree. That’s not how I measure the success of a mode and not really the point I'm trying to make. I try to set things up and test it on my actual projects.
What I’m saying is that if local models were actually comparable to Claude Code in practice, we wouldn’t be having threads like this. It would be obvious to the people using them, and it would be massively disruptive. Why would individuals and companies pay hundreds or thousands for Claude Code if they could run something locally and consistently get similar results?
Every month I revisit the local ecosystem hoping the answer has changed. So far, my experience has been that it hasn’t.
I think they are referring to the opportunity cost of time saved on doing things a local model cannot do or fixing it's mistakes against the cost of a subscription
The problem with this question is that it encompasses a huge spectrum of capabilities and expectations. If you can only run an 8B model and expect it to be good at vibe coding / one shotting things you're going to have a bad time.
If you're able to run a model on the scale of ~30B, you can find that with a reasonably scoped and well defined task they do very well. I've found both Gemma4-31B and Qwen3.6-27B to be the best in this range at the moment. You can swap in the MoE models for faster inference, but they are noticeably worse at most tasks. They can one-shot / vibe code tasks with small scope, but still do much better with guidance.
If you really want frontier-like capabilities, you'll probably need at least 128GB of memory and either huge compute or a lot of patience. Most people just don't have either the money or the patience to make these local models work.
The patience required for local model usage goes far beyond just waiting for tokens though. It takes a lot of effort to get things configured and working properly for your workflow and hardware.
I use Gemma 4 26B A4B on my Macbook (M4 Pro, 48 GB RAM) to study Rust (and ask other myriad questions). I don't trust it to do a good job in an IDE/harness to one-shot anything but the most trivial of changes. Still, it's fast and good enough that it could handle being a "co-pilot" on small to medium context tasks where you've got your hands on the wheel and your eyes on the road — and are driving under the speed limit. That's remarkable given where we were a couple of years ago.
I don't think I'd be using AI to code at all if this weren't the case. (I don't want to feel stunted or stuck just from losing my internet connection.)
My experience with smaller models, in this case specifically GPT 5.4 Mini, is that they cannot two-shot moving a 10-20 line code change to another file without modifying it and introducing bugs.
I did not expect perfect reliability, but I thought they could at least get it right on the second attempt once you point out the difference. No such luck, it confidently tells you that now the code is the same, with yet another subtle bug added in the difference.
I don't know what work one would need to do where these garbage-class models would be adequate. Maybe they can masquerade as competent for a few minutes, but in the end the results simply are not right. At best they are suitable for a smarter search or autocomplete, in my opinion.
Yup, although technically not replaced because I never used either of those products because I don't like sending my code to their black box. I have 2x24GB AMD gpu's, gotten from gamers on my local marketplace, one is connected with a 40cm riser cable. Running Qwen 27B and am very happy with its performance. Q8 with 135k context (arbitrary number, I could push it to 256). I like to use qwen 35B3A for mapping out entire code paths through our relatively complicated codebase/infra at work.
I think it's so good that I now scour the local marketplaces for good buys on 24GB cards that don't seem run through by miners and the likes, to build an even bigger rig for parallel execution.
Power usage is also totally not an issue, AI workload is very different from gaming.
tldr
llama.cpp-vulkan with opencode on total 48GB VRAM AMD cards on arch btw.
Using OpenCode + OhMyOpenCode + Qwen 3.6 35B-A3B Q_4_KM on an Ada 4000 (20GB VRAM) at 55 tok/sec for generation (slower than it sounds as OpenCode has a bunch of context it adds). Meaning to check out pi when I get a minute as I hear that one mentioned a lot lately.
I am using Opus to generate plans that the local agent then follows, then validated by Opus. So I'm not at 100% local but these models are increasingly part of my production workflow. Probably not worth doing - yet - unless you are a hobbyist who likes spending time and money tinkering.
This setup is certainly not as "good" as Opus or other frontier models but they are "good enough" for an increasing number of rote tasks. You don't need to drive a Rolls Royce to the supermarket, when a used Corolla gets you there just fine.
It also enables new workflows that would be cost-prohibitive with frontier LLMs (especially as token costs rise) - eg. overnight I use the Chrome devtools MCP and have the above setup fuzz-test as a user for a number of hours and see if it can break things. Even got it working with multi-modal so it can check screenshots, which blows my mind (and not my wallet, as Claude+screenshots burns $$$).
The "12-18 months behind frontier" sounds about right, it's about where I was with gpt-4o and basic harnesses back then. In another 12-18 months my bet is we have Opus-level models that can be run locally for <$5k... but the frontier models will be even further forward (unless governments have blocked them). Fun times.
Not “local” and not interactive coding but sharing since it might be helpful. I have 2x RTX Pro 6000 Blackwell running DeepSeek V4 Flash. I get 160 tok/s raw but it’s a reasoning model. For my use case, I have it auto-write code and another system auto-review the code.
I occasionally use it with pi to write some code and it’s blazing fast but it’s mostly habit that keeps me with CC and Codex.
I run a small business (https://technologybrother.com) that runs a few small SaaS so I ordered the GPUs through corporate sales. If the barrier is getting an LLC, those are relatively cheap. The nice thing is that if you've got a legitimate business with use for GPUs you can get into the Nvidia Inception Program which has a pretty solid discount.
Not nearly as much as you might think. 1.2kw where I live translates to about $0.12/hr, and that's when running full clip. If you have a decent solar hookup, it's small fraction on a sunny day.
The expensive part is the upfront hardware cost and the electrical system upgrade you'll need to give your house.
You have to apply a lot of careful architecture and TDD to your approach. Eliminate technical risk by tackling hard things early and wrapping them up in a simple, easy to use interface.
I find I can get some projects done 2-3 times faster than if I wrote them by hand. It can also save about 5-10x time on mundane or broadly scoped projects by helping me consolidate and try out ideas very quickly.
Setup-wise, I switch between vLLM using nvidia/Gemma-4-31B-IT-NVFP4 and llama.cpp using unsloth/gemma-4-31B-it-qat-GGUF with MTP. I throttle the GPU power usage to 400W.
My current llama.cpp setup gets token generation rates between 60-150 t/s depending on MTP draft acceptance rates. Prefill is between 1500-4000 t/s depending on context length/depth.
But, guys, when you say Claude/ GPT models, do you stop to think what are these "models"?
One day I thought about how can GPT send thinking parts one after another with a markdown header summary of the thinking block itself. Just think about it.
As a matter of fact, think about these operations, api endpoints, observe their output.
These so called SOTA models are not what meets the eye, and are not at all comparable in the infra department to local models. There is crazy orchestration going on due to the scale of these operations. But also these hard constraints lead to innovation. Innovation nobody speaks about.
I wouldn't say we cannot catchup, but serving our local models through llama, vllm is just the A, B, C of it all. In reality I think what is needed is a replication of said orchestration which I hinted at above.
The SOTA models are a deep orchestration of multiple models operating together it isn't a single model. As such no single model ever will catchup to them until it replicates through training first and then maybe through model architecture this orchestration.
Finally, I would wager that the SOTA "models", as one of these models in this orchestration setup, as served for general consumption, are not so much more capable than qwen 3.6.
I am sure that if you change your perspective you will start noticing the scale of the "magic".
Sure, connect opencode to an openai/chatgpt endpoint and use it. You will notice multiple "thinking" parts per "turn".
I put all of these in quotation because... they are part of the orchestration game. For example, it is not known if the thinking parts of a particular turn are chain of thought thinking summaries or just plain response which is masquaraded and thus orchestrated into appearing as thinking.
Further notice the cadence, word choice and sentence formation. Notice sentence construction. Notice "thinking part" construction and sequencing.
There is pretty heavy orchestration.
> I don't understand, why does it make you think this is the case?
Because not all tokens are equal. And if you waste expensive tokens on mundane tasks you will go out of business. This is the reason.
As I said, if you observe the output from these api endpoints you will notice it.
For personal needs I connected VSCode with llama.cpp running Qwen 3.6 27B or Gemma 4 31B and it's good enough to cancel my cloud subscription.
Qwen running on my 1st GPU at q4@176k context from 70 to 50 tok/s with MTP, pretty good for coding.
Gemma on the other hand is using both GPUs, running q8@64k context, doing document sentiment analysis, summarization, proofreading and translating, at consistent 25 tok/s. Somewhat slow but usable for batched workflows. Might get some more once llama.cpp starts supporting MTP with tensor split mode.
Still using frontier LLMs at dayjob since I'm not paying it and those are obviously better. Hopefully we'll have a Sonnet 4.6/Opus 4.5 level 30B model in a year or so.
EDIT: Prompt processing starts from 800 t/s and drops to 400 t/s. In most cases my starting prompts are around 16k-24k of tokens and require from 60 to 90 seconds to be processed. Not great but acceptable.
Yes, Qwen3.6-35B-A3B on a Strix Halo 128GB (Bosgame M5).
I have way too much VRAM forme such a model but Qwen never released the 122B version of Qwen3.6, which is the best class of model for my hardware. But at the same time my electricity bill is negligible, this is originally a laptop chip and it shows, it consumes almost nothing while idle and a little above 120W during prompt processing.
And Qwen3.6 has been surprisingly effective for me, I still use Clause occasionally but only for like 10% of my needs which allows me to stay well under the quota even with the cheapest plan.
Speed: ~800tps prompt processing and 50tps for token generation (with no speculative decoding).
Unfortunately on Strix Halo or any similar unified memory set up, dense models are gonna be dirt slow due to the tiny memory bandwidth... But I agree, 27B is superior.
I think it is work to set up but I'm also learning a lot setting it up. Mainly using qwen/qwen3.6-35b-a3b mlx with my 48GB M4 MBP which leaves me just enough headroom for docker dev-container and other basics. I use LM Studio to run and am using it via VSCode. A big difference made the system prompt improving the tool integration (I asked GPT for guidance on that). Before that it was not making changes but regenerating code often messing up than helping.
I mostly run my MBP on low power even when it is plugged in to avoid the noise and heat. Full power maybe doubles speed but more than doubles power.
What can it do: Simple restructuring of pages. Where did it and other models fail: Splitting up Pinia store which GPT-5.4 did without fail. I think with more tuning, guidance for tool use and maybe some support tooling around it performance can increase further.
I've had some success with local models by chaining "agents" together in a workflow. Each agent has a different prompt and uses a different ollama model based on what their role is. The project manager, schema agent(qwen3:14b), etc. doesn't use the same model as the coding agent (qwen2.5-coder:7b). Between each step is an orchestrator and with a Playwright task which attempts to surface errors to the agent who introduced the previous code block. Only error-free blocks are forwarded to the next workflow step.
Probably the biggest improvement was including a backend-for-agents service definition which instructed the schema agent they were to only produce only a manifest based on the task, and to pass off that off to the next agent.
In short, I split tasks up into many pieces by defining a workflow where agents are only allowed to do very specific things before their work is passed along. This keeps them grounded and capable while also creating places for me to intervene if a workflow was say 25% or 90% successful.
Have you (or anyone else) tried letting agents compete? For example, give the same coding task to two models, or to the same model with a different seed, and have the reviewer choose the better result.
Some think the human brain works similarly: thousands of mini-brain cortical columns, each with a slightly different take on the situation, voting in a majority-rules system.
I have an optane and lots of ram, so I tried full-fat models for writing some function overnight, as I get about 0.7 t/s. My current go-to test is to update a scalar function to transpose a bit-matrix to one using avx512. the cloud models all play with that like its nothing. Kimi 2.6 and GLM 5.1 both failed miserably.
I'm using 4x RTX 5070's and first-gen AMD threadripper (1950X) to run Qwen3.6 27B (MTP) Q6_K with llama.cpp and it works great as a daily driver with Pi. Around 50-60 toks/sec. I also connect a few other applications to it such as OpenWeb UI and recently set up Bifrost, an LLM gateway, to be the primary access point for the models I serve.
I've tried other models such as Qwen3.6 35B A3B and I've found that 27B works better for me when it comes to coding. It's slower being a dense model but the quality seems much better. Inference on my system for Qwen3.6 35B A3B is around 130-140 toks/sec, non-MTP, which is insanely fast!
You don't need 4x 5070's to run Qwen3.6 27B, three or maybe even two will work. However, I use MTP (multi-token prediction) to speed up 27B and that eats up more memory because the draft model requires its own context.
Another thing to keep in mind is that the tools you're using have their system prompts that are loaded into the model for each conversation. When I fire up Pi, working with the model is very snappy at start. When I interact with the LLM via Hermes CLI, it's much slower. That's because each prompt with Hermes is loading so much stuff (skills, tools, etc.) into the context and then it's there forever until the conversation ends.
I like running models at home for privacy, but I also like how there are no quotas, usage isn't a worry. If the future is "loop engineering" then you will be burning through tokens and $$$ using a cloud models.
My system idles around 200W and is around 350-450W when inference load is high. Decoding (token generation) isn't all that efficient, and your GPUs sit idle more than you think during inference. Advancements like diffusion may 1) speed up decoding and 2) let you utilize more of your idle GPU.
As someone that spends all day every day talking to LLMs, I'd say the OSS frontier models + a good harness is already a sufficient combo. For local deployments, we are missing one or two hardware generations (and may not get that soon since hardware companies are heavily favoring datacenter segment) to fully move to a local setup.
My experience is that it's not the models themselves that are limiting right now, it's the clunky alternative harnesses with weird missing features making for bad ergonomics around stuff like queue management, interruption, subagents, goals, etc.
It's also annoying that OpenCode doesn't even try to support local LLMs properly.
Getting OpenCode to work is possible, but extremely manual and clunky to configure. I have written a script to automate converting my llama-server configs into an OpenCode config, and that helps, but it's not ideal.
I have seriously considered writing Yet Another Coding Harness in my free time. I have some ideas for what would make it nice.
I've used the cli agents for claude, cursor, and pi, plus several custom harnesses I've written myself from time to time as experiments (and I guess technically gastown, if we're calling that a harness).
Pi is... just fine.
It does what I need it to, has a decent selection of tooling out of the box, integrates nicely with other tools, and generally gets out of my way enough that I don't think about it much anymore.
If you can run ~30b models at decent speeds, I think most folks would be pleasantly surprised at how capable they are with pi.
pi.dev is more like an agent developer kit. It's basically a substrate upon which you spend hours/days/weeks building your own agents or coding framework. It's pretty much the neovim to claude's vscode.
I mean - the base experience is just fine, with perfectly reasonable built in tools for file access and editing, plus bash.
But yes - it expands a lot if you're willing to play with it.
I'd actually say the vscode comparison is wrong, because vscode is very much "bring your own extension" in the same way that Pi is. While Claude is much more "visual studio" vibes. It's thick, it's opinionated, and it's absolutely not something you can really customize, but it can feel slick for supported workflows.
I've been wondering lately if it would help to take a medium sized model and either in cloud or some local setup actually do Reinforcement Learning from Human Feedback (RLHF) on every prompt as a chore - I don't know if trying to manually finetune a model to your use habits would ruin it or help - ideally if you were diligent you could get rid of some of the ticks that make models for the general public difficult to work with e.g. overly sycophantic, overly verbose, annoying tendency to explain via analogies
but perhaps one individuals prompt feedback just isn't going to ever be enough I'm not sure how much you need (I know people working at big companies that have purchased in-house agents fine-tuned on internal documents etc.. and apparently these end up with bizarre behaviours not necessarily more helpful than the standard models)
I'd like to be able to essentially edit every response given by an agent and then finetune on the difference between what it produced and how I edited the text. Personally I would just remove a lot of the adjectives and try to distill the responses to core responses but I worry based on some of the work done by Owain Evans and other alignment researchers that this can sometimes push agents into tricky-to-predict tendancies.
Models that you can run at home (Like Qwen 35B) aren't remotely close to Opus or GPT 5.5. Not even close. The only open models that are in that neighbor are around 1T params, so forget about running at home.
It's kind of like driving a shitbox. It can often drive you from A to B, and some people will try to convince you it's fine. It's not.
There's no logical reason other than absolutely requiring the privacy, doing it for fun, or niche use cases like airplanes and so on. If you can't spend the insanely subsidized $20 for codex, you can use an API for chinese models which will run circles around these tiny models.
I'm using Qwen 3.6 on my MacBook Pro M5 Pro with 48BG RAM for any work that I am particularly privacy conscious about, like any work with my journaling. It's been working great! I don't have any direct comparisons, but I've been satisfied with the results.
Not yet. Without pure Apple game or decent GPUs, even with a lot of RAM and threads, all you get is about 30-50 tokens/second, and that's thinking turned off. Without these optimizations your model will have a field day with your MCPs, skills and agent descriptions and you will watch the paint dry before seeing the first output token. Local model serving means you have to fight for every token in your context window, which is quite opposite of what Claude/GPT/Copilot are pushing the industry towards.
Agreed on this. Anthropic has now changed the verbiage on the definitions of the models under `/model` to say that Opus is for everyday usage, and Sonnet is for routine tasks.
There's apparently a reason Sonnet and Haiku have been left in previous version #s.
Still encouraging, though, that things are catching up. We can't expect $20k local setups to match $20bn compute clusters.
Our software dev (smartest guy I ever met) is using OpenCode and Tmux with Open Source models. He says the DeepSeek is his model of choice for coding (he call's it "pretty GOOD". He's running two 3090s on an i9 with 128GB RAM. https://www.msn.com/en-us/news/technology/china-s-open-deeps...
I'm largely 'all natural', any of my little LLM usage is local. 128G Strix system, a not-super-dense Qwen or Gemma variant will get 50-80 tok/s output. Not subscribing to Anthropic/OpenAI/etc even in the unlikely event these are the last local models released; simply not needed. Entirely fine without and in-model tool usage covers my currency concerns.
One of the interesting setups I saw is using expensive frontier models to write and update markdown for your app: specs, product requirements, architecture, etc
but then use cheap/local model to implement the specs.
Markdown is more effective at compressing information and fits the context window easier, than hundreds of source code files
but this requires second and third passes, to smooth out the rough edges
Just sharing my $0.02 here - I have ethical objections to using OpenAI or Anthropic products so I was a reluctant adopter of LLMs at all. Local models address most, though not all, my moral objections so I’ve been using them for work and personal projects for about a month.
The hardware I have (32gb Macs and a gaming PC with 10gb 3080) can only get me to Qwen3.6-35B-A3B at various quants but that’s enough (200-400 PP, 20-30 TG).
It’s taken some time to learn how to best utilize it - some things take a bit of babysitting or direction - but it’s quite useful. Not having ever used CC I can’t compare but it’s been a great assistant or pair programmer for everything from embedded C++ to Vue. I wish I could run 27B as there have been moments when this model feels like it just can’t quite figure something out but those moments are quite rare. For a lot of tasks it’s a huge time saver and has proved super capable at digging into and fixing bugs given pretty vague instructions.
Qwen3.6-27B supports a 1 million token context window.
Of course, you have to have the right hardware to be able to run with a context window like that, as it takes about 100GB of memory on my DGX Spark to do that with full f16 KV cache on the q4_k_xl model.
I'm using deepseek V4 on two rtx 6000 pros and its working great. Opus is so slow that I get deepseek to do most of the work and Opus is only used to validate and help plan.
tough ask, but since we're here: has anyone done this with 16GB of VRAM? I've been getting projects finished with LM Studio, but it definitely could stand to be more efficient. lots of time wasted with trying to get models to understand a problem with so few tokens.
Not 100%, I still fall back to Claude for most day-job stuff. But I've been trying to use Qwen 3.6 and Gemma 4 on my framework desktop mainboard (Strix Halo) as much as possible.
I've been working on an ops style tool for local LLM inference. Proxying, api keys, request logging, model rewriting and much much more.
I tried gemma-4-26B-A4B just to see if it could help me read/sort my emails on a relatively under-powered setup (16GB VRAM + 32GB RAM) and it's not going well.. the model burns 24K tokens just on searching for the right tool and then dumps the email contents into context - i tried to get it to use code-mode to save context but the code-mode implementation can't save files so it was useless and im going to try to switch to "ssh-mode" into my devbox container. Still relatively new to this, so I'm probably doing something wrong
I'm in the middle of building my own based on LiquidAI/LFM2.5-1.2B-Instruct [1]. I run it on the CPU locally and get reasonable performance. I'm currently using it to solve small problems - but expanding it daily.
I'm looking forward to having Claude Fable at home. THAT is when I'll THINK about replacing Claude (who knows what their next models will be capable of, Fable was damn good for the three days I had it).
we keep moving the goalposts on when we're gonna be happy with local. first it was sonnet at home as the good enough, then opus, now it's the mysterious leading model that runs on infrastructure we can't feasibly have at home
Will the AI labs always make sure there is at least a years worth of differential? I guess the underlying business premise is that each new release has a step function change that prevents this kind of behaviour..
If the government is going to gate access to frontier models from here on out, even if new releases are a step function change… which they’re not… then it may be even more comparable to what’s available with a subscription.
Will the inevitable M5 releases from Apple change this equation in any meaningful way?
I'm waiting to swap out my last gen Intel iMac with a new M5 mini of some kind, with the eye to hopefully be able to run some models locally. I envision a mini (heh) arms race to simply swapping out an M(X-1) for an M(X) annually as this field shakes out.
There’s evidence that combining models can achieve frontier-level performance (e.g. OpenRouter Fusion). I’m wondering if that’s the more realistic option: combine Opus with a local model to save on token costs.
Yes, I have.
1. Two RTX 3090s in Linux 22.04
2. Running Qwen3.6-27B Q6_K_XL GGUF
3. Using my own harness AZPal, I build myself, also wire it with Hermes Agent, works fine
4. Many times it solve problem that Codex can't solve
I haven't yet, but I just bought a 128GB M5 Max 40 core which I'm hoping can do it (if not, it's a good laptop regardless, I actually need that amount of RAM for non-LLM stuff)
Always a bit disappointed in the details in these kinds of threads. When you do get answers, they're never specific enough to try out on your own. It'll be something like "I use Qwen 3.5 and get great results!" OK but what quantization are you using? What llama parameters? What context size? What GPU are you running it on, and how much VRAM does it have? Are you hosting it on a separate box, or running it locally on your dev machine? What coding agent tool are you using, and how is it configured / hooked up to the model?
- OS: Bazzite (atomic fedora - this machine is running Steam "big picture" mode on my TV when not in use for LLM tasks)
- Virtualization: Podman Quadlets, which allows me to run container images as managed systemd units
- Network: tailscale
- Inference: llama.cpp vulkan (better performance than ROCM, though I'm keeping an eye on it in the future)
- LLM API surface: llama-swap (running as a podman quadlet exposed via tailscale svc) allows running multiple models on a single endpoint.
- Web/Chat Access: open-webui (running as podman quadlet exposed via tailscale svc) allows me to access any of the models I'm using for coding harness access for chat/general purpose queries via web browser. I also have the "conduit" app for my iPhone that allows me to hit the same models from my phone.
Models:
- Qwen3.6-27B-MTP-UD-Q4_K_XL.gguf - Unsloth Q4 quant of the qwen 3.6 27B model weights, with MTP enabled. MTP is important as it improves the speed the model can run at.
- Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf - Unsloth Q4 quant of 35B-A3B. Not MTP right now because I was having some issues with it?
- gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf - Gemma 4, which I use sometimes via open-webui instead of Qwen, but I generally think Qwen does a better job
Flags (specific for Qwen 27b, since that's primary model):
- `-ngl 99` offload all layers to GPU
- `-c 80000` 80K context window. I'd like this to be higher, but since my GPU also has to run the desktop session for the machine, I need to leave some VRAM overhead to keep the desktop from OOM-ing
- `-np 1` single slot (no parallel request handling)
- `--no-context-shift` error instead of silently sliding the context window when full
- `--cache-reuse 256` reuse cached prefix in chunks of 256 tokens (prompt cache)
- `-b 2048` logical batch size (tokens per submission)
- `-ub 1024` physical micro-batch (per GPU pass)
- `--cache-type-k q8_0 --cache-type-v q8_0` symmetric 8-bit K/V cache. Q8 is as low as I've been able to go without getting some issues with tool calling
- `-fa on` flash attention
- `--spec-type draft-mtp` use the model's built-in MTP as the draft model
- `--spec-draft-n-max 3` propose up to 3 draft tokens per step
- `--spec-draft-n-min 0` allow zero drafts if confidence is low
- `--spec-draft-type-k q8_0 --spec-draft-type-v q8_0` KV quant for the draft path
- `--reasoning-format deepseek` parse <think> blocks in proper format
- `--chat-template-kwargs '{"enable_thinking": true}'` turns on Qwen's thinking mode on by default (clients can override)
- `--jinja` use the GGUF's Jinja chat template
- `--temp 0.6` moderate randomness (Qwen recommended value for coding)
- `--top-p 0.95` nucleus sampling (Qwen recommended value for coding)
- `--top-k 20` top-20 candidates (Qwen recommended value for coding)
- `--min-p 0.0 disabled (Qwen recommended value for coding)
Performance (27b, primary model):
- ~65t/s for token generation
- ~600 t/s for prompt processing.
- If these numbers don't mean much to you, perceptually this feels about on-par with cloud model speed, maybe slightly faster.
- ~30s cold start when swapping from a different model or starting up session from idle via llama-swap.
I have llama-swap set up to unload the model after 10 min of idle, because I sometimes use this machine for gaming as well. A little annoying, but a small price to pay to be able to use the machine for other stuff (gaming) when I'm not using it with coding tasks.
CLI/Harness:
- Crush harness (https://github.com/charmbracelet/crush) less feature rich than Claude Code, but with a smaller system prompt and better built-in LSP support. I point it at the tailnet DNS (https://llama.<tailnet>:<port>)
- Exa MCP for web search (https://exa.ai/) this alone makes the model far more useable. It's shocking how often the official claude code or codex harness get botblocked on web fetches, and the results of a good web fetch can be the difference between a good turn and a bad turn.
A lot of people get hung up on whether Qwen 3.x models are "as smart as" some parallel Anthropic model. Most people seem to agree it's somewhere between Haiku 4.5 and Sonnet 4.5. Personally, I think the biggest thing that makes the Qwen 3.x series of models _feel_ good to use for coding workflows is that its the first time that tool calling actually works consistently on local models. If tool calling is busted even 5% of the time, it can totally ruin the flow. I think that's also why people tend to say the "harness is more important than the model" or whatever. I have a few other models set up but 27B with MTP is the best compromise of speed and quality that I've found.
This setup works well enough for me that I dropped my personal Claude Code subscription. At work I'm still using frontier models, but personally I don't feel like I need that much power for anything I work on in my personal life. I'm "lucky" that I made the random financially unwise choice to buy a 7900XTX in late 2022 for $1k as a gaming card. I had no clue it would actually be a pretty decent LLM card 3-4 years later.
Edit: sorry for the horrible formatting, I always forget that HN doesn't actually do markdown :(
I would like to say I run 100% local, but I use Opus + Gemini Pro cumulatively for 3 or 4 hours a week. I also like to use DeepSeek v4 flash with OpenCode for small quick tasks.
I did just publish a free to read online book "The Rise of Local Coding Agents" [1] where I document my setup that I enjoy using. I use little-coder (built on pi) and have good results for small Python and TypeScript applications. I struggle getting good results with Common Lisp and Clojure.
For me, the problem with all local LLM-basic coding agents is slow runtime.
I use SpecKit to create a very detailed plan with a high amount of specificity using paid Claude plan.
Then I give it to local LLM (eg: Qwen / Gemma 4) via CLI. This is possible through usage of llm-mlx on Mac (or ollama on any machine given sufficient on hardware) which serve OpenAPI endpoints compatible for Aider (CLI) or Visual Studio Code to vibe along with the agentic coding assistant.
The paid products have an advantage but are not necessary if you don't mind to be more-involved with the process and have low expectations.
mbp16 m5 max 128gb, antirez/ds4, deepseekv4-flash. Works well for relatively dense (let's say <20k LoC per project) C codebases that are essentially a bunch of custom specialized stores, http servers, network infra, media transformers, etc.
Runs through Pi with a custom prompt (basically "don't speculate blindly, isolate things, make them traceable and measurable, then verify") and behind a pretty restrictive bwrap setup - RO bind everything other than ~/.pi, cdw and a separate tmpfs, unshare almost everything other than the network - for which I use a network namespace that only allows tcp connections to a specific ip and port (i.e the inference mac) - i.e. netns exec into bwrap.
Can't compare it to SOTA or higher-requirements models on what I work on - policy. That said, on a bunch of test pieces - it obviously isn't gpt-5.5, it definitely lags behind k2.6/glm/ds4-pro, but it absolutely is usable. Of course, on such codebases, forget about one-shotting or trusting it blindly or anything of the sort - you ask it, guide it, restart the context from time to time to have a "fresh dice roll" and to keep the context small and clean, etc. Compared to anything smaller (incl. all the usual local qwen models) - on a test piece, it figured out that memfd and mmap were used for setting up a ring buffer with natural wraparound handling (double mapping the first page at the end) and didn't tell me "this is for sharing memory between processes" or some other BS.
Performance as described in the tables in the readme here:
https://github.com/antirez/ds4
...with a bit less than half that at "low power" (30w). Both are usable.
Has anyone been storing their cc sessions for future training data on their own models? I'd love to build a system that fine-tunes on cc sessions and a good first step is capturing my own sessions well.
Didn't realize they did this. I have avoided pushing data to huggingface. This is all -deeply- private info and I haven't really reviewed their privacy policies and the like. I'll give them a look.
I would like to know whether someone was able to use lower tier models for activities other than coding e.g. a limited version of a personal note manager - and what the hardware requirements in RAM for these models were.
I work with a few models on servers, so not local, but self hosted with ollama. gemma-4, glm 4.7 flash, and qwen 3.6. glm is the best at coding agentically. But I still don't think any of them reach the levels of gpt 5.5 or opus 4.8.
I have tried locally but I find that the implicit breakeven is somewhere around 1 year of use given the high power costs where I live. Not really worth it but maybe if I move some day!
I use Pi and Qwen 3.6 27b locally on a 4090 for all my personal projects. I still use Claude for day job work since they pay for it, and my employer expects me to use it. I rarely touch it otherwise.
i used to mix remote and local minimax 2.7(q3) on my strix halo, it run at 30 tg and 220 tokens pp... it was a bit painful slow, but it was a good feeling i could stay offline. unfortunately m3 which is at opus .8 levels is 460b parameters and doesn't even fit in 128gb of memory, let alone a big context. strix halo feels like a toy for ai purposes. https://kyuz0.github.io/amd-strix-halo-toolboxes/
My strix halo board is feeling more useful and less toylike with the recent performance gains combined from MTP, better quantization, and generalized performance improvements across the stack. For example, I can run Unsloth's Gemma4-31B 4-bit QAT model with around 30tg and 200pp. I don't find that to be too slow at all. Particularly because it's nearly full accuracy and good enough for a lot of different stuff I throw at it.
I think it also helps that I'm using my machine to do home server stuff. It excels at all of the traditional workloads. Then I can lean on the AI to help with automation here and there. I find it deeply satisfying.
you can absolutely use it for some workloads, but as soon as you have some extra complexity for a big repo it'll take forever and the economics are so silly to the point that the electricity bill would be comparable to a subscription. I love having the possibility of running things locally if some random dude decide to pull them plug, and give me solice the fact that i can have 100% private inference, but as the main driver during the day? shoot me
I tried many, many times and I keep trying. But I just don't see this happening: those tiny models that we can run on our machines (I have an M4 Max Mac, so I can reasonably run qwen3.6-35b-a3b or gemma-4-26b-a4b-qat at this time) are NOWHERE near as smart as the huge monsters like Opus/Fable. Nowhere. I can see a lot of people deluding themselves.
Sure, you can get the local models to generate plausibly-looking code for simple cases. But compared to how I solve complex design problems in a large codebase with Claude Code and Opus/Fable, this isn't worth my time.
I am working on exactly this issue right now. My approach is that a highly optimized harness (pi.dev) with the right backing knowledgebase (a custom, self-updating wiki with lots of QC layers) can get close to most of my usage patterns for my Claude Max 20x subscription. I use Gemma 4 26B QAT served by a custom fork of llama.cpp, with 4-8 slots of 256k context at Q8. It's a very good model when the harness keeps it on rails. In an age of 1M context windows, 256k may seem small but it's been plenty for my work (scientific programming). A $20/month subscription to Ollama-cloud gets me good coverage of consults out to frontier models for difficult plans or debugging (again this is all woven into my highly customized pi install).
I'm still optimizing it (with claude, to be clear), but my testing is very encouraging. I worry a lot about companies (and the government) controlling access to machine intelligence, so local is the way to go.
Related: Are there any viable distributed AI models?
Like how we've had SETI at Home, Folding at Home, BitTorrent etc. People are clearly willing to donate their computer resources to distributed projects.
Maybe in a dAI network anyone could submit content for training on, and each user running a "node" could have their own custom private conditions on which type of content to accept for training or inference.
Like someone who dislikes anime could say "never accept anime related content or queries" so their node would basically opt-out from any data or questions about anime.
I think it'd be very hard to achieve viable tokens/s or get arithmetic intensity to be high enough in general, since many cases in existing training and inference are memory bandwidth limited. Definitely seems possible to conceptually have a slow pipeline that is distributed though.
This is unlikely to happen in any meaningful fashion for quite some time.
(TLDR; Distributed compute for models will require hardware at a level only really possible with data-centers at the moment.)
Token generation operates at such a scale to demand enough from a single GPU as it will often saturate the bandwidth capabilities of consumer grade interconnects like PCIe. Which fundamentally implies that distributing a model's compute across vast distances is too much of a challenge without significant infrastructure.
To give an example, When we split a model's compute between two seperate cards on a single workstation, this doesnt mean we end up with 2x the compute bandwidth for a model. Instead the increase becomes something small like 20% depending on model, because the inconnects (PCIe on consumer hardware) will quickly become so saturated with data being copied between the two GPUs so as to become a bottleneck. And remember that this is something that happens locally with PCIe, which (depending on generation) will cap out at around 20-35 GB/s depending on the generation of motherboard.
Model performance is very much tied to having the fastest and highest bandwidth single card available so as to keep data transfer operations to a minimum as the sheer volume of data necessary for the model to run is immense.
I simply cant imagine how slow and unusable a model would be if the copy operations necessary for its compute needed to be performed over unreliable network speeds where there will be significant performance loss as network speeds are not reliably distributed across the globe, and their unreliable nature would demand increased overhead due to data verification.
I run many models (but mainly Gemma-4) using oMLX (for caching) on a 32GB M1 max using (gasp) Xcode. For tok/sec response times, I'd say it responds faster than I could read the prompt aloud in many cases (and I'm not constantly polling the Claude status page).
For months I spent time curating the AI+harness+skills+MCP servers, but now mainly just code with it. I find myself not bothering to use Claude (but keep paying "just in case").
That's feasible in part because my prompts have very specific objectives, constraints, and suggested staging, because I want the code to be exactly as I would write it, and I want to weigh in at specific moments. I would say the speed-up is 2-4X instead of the 10X of vibe-coding greenfield projects. The problem is not the coding speed, but building something complicated that's also correct and flexible (i.e., a directional accuracy). E.g., the agents help with abandoning a less-fruitful API shape instead of sticking with what works in a local maxima.
One flaw there is that I'm still writing code that feels clean to humans, which now is probably a waste. LLM's might be happier with 10+ parameters on one API instead of a plethora of configuration objects and convenience wrappers.
I do the same. deepseekv4 fast for the 90% of the tasks, if it can't lift it, I use deepseekv4 pro. I use crush as coding agent but removed the blocked commands because I also do a lot of system administration. Love it. I use 8 USD in 7 weeks and use it quiet extensively for all sorts of things, programming, system administration, google search replacement, investments, you name it.
Yes. I use Owen on my MacBook m1 (16gb) daily, running inside Ollama. Works well. Is not particularly fast, and I need to create a custom imagem that sets the temperature of the model to zero starting, so I don't get over creative with its bullshit, but it works reasonable week.
Secretly the problems many people have with agentic coding are related to poor choice of sampling settings, but the world will wait several more years before this is understood well. top_p and top_k are garbage but they are intentionally kept on purpose because subsequent methods enable coherent high temperature sampling, which is an absolute no go for alignment/safety reasons.
The secret to actually good agentic outputs even with small models? Llamacpp has support for this little known sampler called "top-n sigma". You should use that, set it to 1 and set temperature to literally whatever you want (it could be infinity) and your model will just magically work to your maximum context window. That's because long context generation is a sampling problem.
I use Qwen 3.6 on a remote GPU that my work offers. Works fine. Slow and steady, works hard, gets the job done. Probably better at diagnosing than making new code, but whatever.
I tried to. I just couldn't get over how it made my otherwise whisper quiet M3 Max MacBook Pro 14 for the performance. The sweet spot has been adopting Claude Code to use the Chinese models. Deepseek V4 Pro is very, very good. But I am such a casual local user of AI that my 20/month Claude subscription is enough and I find myself using that more and more.
I have been running this stack since well before Claude Code became popular. It works OK but I've found it to be very slow; and despite having a big context window, it seems to lose track of what it's working on and goes down a rabbit hole (or just wastes tokens trying to use the web browser) for hours and is hard to get back on track. I even tried spinning up two sub-agents but even after years of trying to prompt them, they are almost useless in terms of coding ability, so that is looking to be a waste of spending at least so far but maybe the model will improve as time goes on.
Just attach OpenRouter to your coding agent tool and try yourself. All relevant open weight models are there. Every person have different needs and expectations
Local? No.
Via opencode Go subscription using GLM mainly? Yes, I still use Gemini/Claude/GPT via api from openrouter for adjacent tasks, I would say 20$ per month max in api token costs.
Disclaimer: I am a Linux infra/k8s guy, I write production code but it's mainly glue code and mainly in golang.
Addendum: most value we get is from "document intelligence" and that's all Gemma and Qwen on H100/H200
I've noticed a few things compared to large models like Claude. For starters, you really need to know what you're asking, and be precise; it doesn't do much thinking for you. Any assumptions left open, and it'll take the easiest route to reach the goal (e.g. CSS in HTML), often not the best in terms of architecture.
It gets into loops quite often, and surprisingly often gets the edit tool call wrong, after which it will spend lots of thinking tokens and re-read files instead of retrying (despite the system prompt suggesting so).
Comparing agentic Qwen3.6 35b to Claude Opus is like a junior with knowledge across the board, that you really need to guide, versus a senior that thinks with you on architecture. If Opus gives a 15x speedup, local and fully offline Qwen gives a 5x speedup. Which, given that it's completely free, is still mind-boggling to me :)
I've never used the frontier models in earnest, I don't believe in using proprietary tools for my programming, so I can't really compare.
And I'm still a AI skeptic, so I'm doing more testing and kicking the tires than I am actually using it. That means I spend a lot of time trying to break various models, probe them for strengths and weaknesses, etc.
But I find that when I do try to use it for real for agentic coding, Qwen 3.6 35B-A3B is definitely the one I reach for the most often.
For other chat tasks and translation, I'll frequently use Gemma 4 31B.
For audio, I'll use Gemma 4 12B.
I keep a bunch of other models around to try out every once in a while (Qwen 3.5 122B-A10B, Qwen 3.6 27B, Nemotron 3 Super 122B-A12B, Step 3.7 Flash and Minimax M2.7 both at somewhat more aggressive quants, and GPT-OSS 120B if I want super fast but not terribly smart), but so far Qwen 3.6 35B-A3B is really the sweet spot for coding on a setup like this.
The re-processing context every turn problem is definitely something I've hit. Some of the causes have been solved upstream in llama.cpp; make sure you're up to date.
But another cause of the issue that has a big effect is that older Qwen models didn't support preserving thinking. This means that each time you have a long sequence of tool calls with interleaved thinkging, as soon as you had your next turn in the chat, it would have to re-process all of that as it would drop all of the reasoning.
Qwen 3.6, however, now supports preserving thinking. This can use a bit more context, becasue you're not dropping the thinking every turn, but it re-uses the cache better, not causing you to have to reprocess a whole turn at a time each time.
In my models.ini, I have this for the Qwen3.6 models:
There are still occasional issues I hit where it will have to re-process, but getting up to date and enabling preserve_thinking has helped a ton.I'll have to give the preserve_thinking a shot.
Isn't this the nature of how LLMs work? Or do you mean that it recalculates the entire KV cache instead of saving the old KV cache, in which case the problem is likely in your executor (llama.cpp, vllm, e.g.) configuration or capabilities?
Qwen 3.6 has finally been trained both with and without preserving thinking, so you can optionally enable preserving thinking. This will use up a bit more context, but it will avoid having to do this re-processing of long agentic turns, and also the preserved thinking can avoid having to re-do some of the same reasoning over again in later turns.
Besides that, modern LLMs don't only use full attention (apparently, attention is not all you need). Full attention is very expensive to compute and store (0(n^2)). But additionally, full attention is actually bad at certain kinds of reasoning; keeping track of some value that gets replaced over the course of time, for example. So most models these days use various forms of local attention which is fixed length and gets updated as you go; sliding window attention, Mamba-2 state space models, etc.
But one advantage of attention is that you can go back and reprocess by truncating the KV cache and starting over. You can't do that with other forms of local attention; you've lost the state earlier in the sequence.
So to allow you to go back without fully recomputing the cache all over again, your engine will save snapshots of the local attention state at various times, so if you need to go back to recompute the cache, you can start from the last snapshot. However, these snapshots can get large, you can't keep too many of these, so sometimes you need to go back quite far to get to one, or they're all past the point you need to go back to and you need to start over again from the beginning.
There have been particular bugs in llama.cpp that have caused this to be triggered more often than it should; for instance, it wouldn't take snapshots before turns that included images at one point, so if you had an image heavy agentic workflow, that issue plus the lack of preserving thinking would mean you would frequently have to go back and start over from scratch.
Some of these issue have been fixed, some are addressed by preserving thinking. There are still some issues sometimes; for instance, one that's hard to fix is that the tokens generated autoregressively don't always parse the same when doing prefill. For instance, you could generate something as two tokens "pre" and "fill", but it turns out that "prefill" is also a single token so the tokenizer will use that, so when you send that back again on the next turn, it will see a divergence and have to recompute from that point. It might be possible to ignore that and use the not fully greedy tokenization that's in the cache, but I've definitely seen llama.cpp have to do some cache recomputation due to that.
I've had the best luck with Pi so far, but it comes without some bells and whistles you might be used to (e.g. plan mode, subagents, MCP client support)
I didn't do much benchmarking, but anecdotally, I found it to be making less edit errors. YMMV
I find that running better quantization, like Q8 tend to prevent this even though its a bit slower to run, it saves overall time with less churn
Using 3.6-27b is even slower again than 3.6-35b, but I find the accuracy really pays off
I do (unscientifically) experiment whenever a new capable local LLM (<=130b) releases with a license that permits commercial use. As for knowing my models require more work than Opus, I don't mind still having to puzzle on getting the architecture right. In any case, it forces me to stay in the loop of what's being built, which is a good thing.
Thank you.
For the time being, off the top of my head, I'd say:
- Prompt Engineering tips & tricks apply here (like being complete in the relevant context you provide in your question, and the specific task(s) the agent should do like reasoning, modifying one file, or trying to fix a complex task all at once (not recommended)).
- If you already know which files the agent should look into, mention them to save time and potentially context.
- In my personal workflow, I write down lots of atomic TODOs needed to solve a problem. As I write it down, I'll notice assumptions I'm making, or the fact that the TODO could still be decomposed further into (atomic) subtasks.
- It's best to get a feeling yourself for how Qwen handles your repository. I noticed if I don't specify an architecture for development, it'll make quick & dirty fixes. If I don't tell it to remove debug statements, it won't. This is what was meant with "be precise" – Claude Opus might think for you and act in your best interest. Smaller Qwen models will just do what you ask them to, and no more. They have design knowledge, but you have to explicitly ask them to "activate" that part of their knowledge.
Two years later, and I'm running Qwen3.6 35b agentically to develop the start of a repository and automatically run tests to then improve on itself. I never thought we'd get here so quickly with LLMs back then.
I'm pretty sure in two years we'll have current Opus-like quality in the 30-100b parameter model range. But at that point, Opus 6.3 will reason along for us so much better still, that we'll still look at those models in awe. It's great to look ahead, but let's not forget to appreciate how effective the current local models already are :)
It's personal, but I prefer CapEx over OpEx for this. If you can purchase a device upfront that runs a decent local LLM, you get the peace of mind that your setup won't suddenly change over time and can only get better.
Now, there's a bit of a degree to which some of the open source models do some benchmaxxing, and bigger models with more params may always feel like they have more depth. But anyhow, right now you have something that is arguably comparable to Claude 4 Opus on your laptop. I can't really compare myself because I never used it. It looks like Claude 4 Opus is still available on OpenRouter, so you could try it out and compare yourself if you're interested.
It will likely always be the case that there are proprietary cloud models that are more powerful than what you can run on a laptop. You can just do a whole lot more with terabytes of VRAM on multi-GPU clusters than you can do on a laptop. So for folks who must have the most capable, you're probably not going to want to leave Anthropic.
But right now, the models you can run on your laptop are comparable to the cloud models that were popular when vibecoding and Claude Code first took off.
Anyhow, feel free to try them out head to head on OpenRouter. I'd love to see someone write up their results, of a modern local sized open source model vs. frontier models from ~a year ago, on something other than the standard benchmarks.
I'm not affiliated, I just like his style and have found it handy. I know it's not very rigorous, but it's good enough for me and I've found his examples to pretty closely match the results I see in real life.
Claude 4 Opus: https://youtu.be/J7omabtqnBM?t=193
Qwen 3.6 35B A3B: https://youtu.be/gVU-DQeqkI0?t=215
Qwen 3.6 produced far more working functionality than Claude 4 Opus did.
Obviously, just one test of a single one-shot prompt of a silly toy OS, but yeah, this particular test shows Qwen 3.6 running locally dramatically outperforming Claude 4 Opus, which was a frontier model a year ago.
Will this trend continue? Who knows. Both the frontier and local model will probably continue to get better. Which one will hit the top of the S-curve first? Hard to say, really. But what you can do right now locally is better than what you could do a year ago on the frontier, and lots of people were already using it pretty heavily a year ago.
Hoever, November is when most folks agree that the frontier models got good enough for much of their work. Local models aren't quite there yet (where by "local" I mean "can run at reasonable speed and quant on a system less that $10,000 with today's RAM and GPU prices"). The biggest open weights models are getting there, but those require something like an 8x H100 server to reasonably run.
It's likely that there will always be a gap between frontier and local if you're comparing models at the same time, you can just do a lot more with terabytes of HBM than gigabytes of DDR. But will local models get good enough to be usable for useful work? For many folks, they already are.
OpenAI was offering 2x usage at one point and I still used opus just because it's so much more effective.
More and more specialized and ultra-performant chips are going to flood the consumer market. Especially once new hardware foundries will start producing (well if we don't die from WW3 in the interval).
In 10 years from now, when even basic computers will have 128 GB of memory, and phones will have super optimized tuned models, then what will be the point of Anthropic ?
Just use Gemma/Gemini/Siri or whatever.
Pornography and uncensored models is also pushing toward local models.
It's not like needs of people grows exponentially, the needs follow an asymptote instead (they are capped).
The real revolution is offline robots and self-driving cars, but LLMs are already quite maxed.
For programmers, now, what Anthropic offers is like 3% improvement on a known test (like this pelican riding a bicycle), or on questions leaked from benchmark insiders.
It's ok but not like revolutionary (Fable was better but it was unusable, easy 20 minutes per one prompt due to overthinking).
Hold on, what are the specs of your rig? How much RAM?
I'm been considering getting an old refurbished 2018 Mac Mini with 64Gb of DDR4 RAM but everything I've read suggests this will be way slower than my 16Gb M1 Pro Macbook.
I've been meaning to write a blog post but well whatever here's the md.
https://gist.github.com/hparadiz/f3596d00a62d8ebb2dadcc46ee5...
Qwen3.5 9B performed best.
You can absolutely still use this to do some basic stuff like tell opencode to convert a video file from one format to another. But frankly you're better off getting two AMD GPUs. Say a dual 7900XT would get way better performance.
We truly live in the dumbest timeline.
And then also, sometimes the tool call errors are because of something like a file was changed out from under it; the larger model is probably going to do a better job of figuring that out and fixing it up.
Finally, in Pi, you can always just use the /tree command to skip back to before a series of failed tool calls, with a summary if you want to let the model know what happened. The Pi /tree command is pretty powerful in managing your context
I'll experiment more with the effectiveness of AGENTS.md rules for local Pi agents. I feel like smaller (local) LLMs just lack in attentiveness to elements in the context window, like precise instructions, compared to e.g. Claude models.
matches my experience and a deal breaker
also the context window sizes are too low. I can't operate in 65,000 windows any more because even just reading the code's file structure overruns it and gets me nowhere. Definitely its own art form.
200k context windows and above for me now
I saw a paper last night that should help this a lot though
In Pi, /new is my best friend and most-used command for sure. For simple tasks (I decompose complex ones anyway since I don't trust small local LLMs to do this for me), the model doesn't need much context, given that I'm proficient in my codebase myself: "I'd like Feature X. Look into files 1, 2 and 3 to make your edits."
I replaced a $100/m subscription to claude in favor of running pi harness pointed at unsloth studio, using both qwen (unsloth/Qwen3.6-35B-A3B-MTP-GGUF) and gemma (unsloth/gemma-4-26B-A4B-it-GGUF) models, depending on my mood.
I have a machine I built about 5 years ago with dual RTX3090s in it (I was going to build a new gaming machine anyways, and the llama release had just dropped so I tacked another used 3090 onto the build), and I get ~150tok/s on either of those models (at UD-Q4_K_XL quant) and can use the entire 300k context length without having to exit VRAM.
To be very clear - it's not as good as claude. But it's free and not so much worse that it matters significantly.
For my personal needs, free beats $100/m.
I also have an openclaw instance pointed at the same inference server, and it's great for that (genuinely solid use-case for local models).
Some example projects
- Replacement launcher for android tvs (with usage monitoring and tracking for kids)
- Custom admin portals for my k8s cluster services
- Custom home assistant integrations/automations (recently some shelly devices for power monitoring and switching)
- Grocery list management and meal planning (mostly via openclaw)
- some custom workflows for 3d asset generation in comfyui.
---
Long story short, if you're trying to make money via software... I'd probably still recommend using a paid provider. But the local models are very capable of cool stuff.
Also, 2 gens old means bad performance at ray tracing, abysmal path tracing if at all. Pretty sure it can't run smoothly CP2077 in native 4k without dlss upscalers with all on ultra.
When I bought, I paid $850 a piece. And I needed one anyways for the gaming I was going to do.
My guess is the next good time to buy is going to be 24-36 months from now, depending on how the AI bubble goes.
---
I'll add to this, I personally don't like Apple hardware (not so much related to the hardware as their company philosophy) but their machines with unified memory (or AMDs latest unified memory offerings) get pretty equivalent speeds to my 3090s, and are probably a much better modern entrypoint to local llms.
There's a reason the joke is that Silicon Valley software devs bought up all the Mac minis for OpenClaw.
You can get a 48gb unified RAM M4 pro mac mini for ~2k. If you're not going to do much else with the machine, it's what I'd pick as my budget inference device right now. Spend a year of claude now, get ~150tok/s for the next decade (plus) for ~free.
If you want more capable and are willing to spend a little more, go with the newer Ryzen AI Max+ 395 machines.
You'll spend less on power too.
My last suggestion would be to go buy an RTX3090 at this point. You can do a lot better for a lot cheaper.
How do AMD cards perform with LLMs? A 9070 is sold for ~$600 and has 16GB VRAM
16 GiB won't fit you much, so you'd probably want at least 2x, and preferably 3x of those, and then you need a motherboard, power, etc. that can handle that.
Since you're running quantized (at UD-Q4_K_XL) , check out the "qat" models (unsloth/gemma-4-26B-A4B-it-qat-GGUF) !
- https://huggingface.co/unsloth/gemma-4-26B-A4B-it-qat-GGUF (With "Jun 9 Update: Added MTP support.")
- https://blog.google/innovation-and-ai/technology/developers-...
> Quantization-Aware Training (QAT) [...] allows preserving similar quality to bfloat16 while dramatically reducing the memory requirements to load the model
I've actually tried this exact same model locally as well.. albeit on just a single 3090 at 128k context and I got around 40-60tok/s with Q4_K quantization.
The thing that bugged me the most was really the quality of the output on moderately complex real-world coding tasks. Having to switch between "prompt/vibe" and "manually implement" is such a big context switch burden, because you really have to ask yourself every few minutes if you're "holding it wrong" or the model is just too stupid.
It also doesn't really seem to handle transitions from "low-level implementation detail" to "high-level design" well, e.g., it wouldn't easily render tables and such. With Claude I don't have this issue.. so I think for now my verdict would be that it's not really a viable replacement. I really hope it will be in a few months time.
Oh and I used "aider" to replace claude CLI, which maybe that's also sub-optimal.. I'm not sure. The MCP marketplaces are useful of course, though arguably you could just manually replace them over time.
It's prone to thinking longer and more repetitively, again - it's definitely not opus 4.7/4.8.
I've been using pi.dev as my harness for it, and been pleasantly surprised by how nice it feels (I have used aider, but only very briefly and quite a while back - so I can't realistically compare).
I would say it's roughly where I felt claude was a year back - Most of the sessions need to be more "pair programming" and less "I let it run for hours".
I'm a big fan of frequent "human in the loop" style workflows even when I'm on something like opus at work, though. I have opinions about lots of things, and re-inforcing that the model should stop and ask frequently seems to get me considerably better output, without having to "re-roll" if you will.
I've done a good bit of management, and I think it's roughly producing what a junior dev might produce in a day every 5 minutes. And just like a junior dev, you need to be steering it back on track fairly often.
Opus feels more like a mid-level at this point. I can hand it a chunk of work and "leave" but I still get better output if I'm checked-in and watching/steering.
I've used Claude Opus to quickly and effectively pound out some 100-200 line scripts that integrate with a vendor's API, and it one-shotted them both almost perfectly.
I wonder if for a lot of these local models, the scope of the AI assistance should simply be smaller: You architect the tools and the function definitions, and then tell AI to implement one at a time? Does anyone do that rigorously?
A single RTX-3090 will do approximately the same tok/s, but it won't fit the entire 300k context in VRAM.
Sometimes that matters, a lot of times it doesn't.
On the speed front - MOE models are great. Biggest perf difference in modern models is the move to MOE architectures.
I get very similar quality from the both the Gemma-4 31B dense model, and the Gemma-4 26B MOE model (both at Q4 quant) but the MOE version runs at ~3 times the speed (150tok/s vs 46tok/s).
Other Notes: I have had to set the compact target to 75% on a 256k context window as once the conversation length goes about 100k I start seeing a drop in the quality and speed. This becomes very problematic after about 150k. I tried Qwen 3.5 122b too but it actually seems much worse at coding than 3.6 27b even though its much larger. Maybe because I am using a 4bit quant or maybe I just don't have it configured correctly? I know 3.6 is newer but I didn't expect it to out perform a model that is much larger from the prior generation. Gemma 4 31b is a good model for other tasks but at least my personal experience is that Qwen outperforms in coding. Nemotron Super 120b is great at a lot of stuff but it also seems to be not as good at coding as Qwen. This was very surprising to me.
I have become so "lazy" (in a good way), so far that I've started using the model for lots of daily mundane things on top of just coding:
It feels like anything less than Sonnet is just a waste of time, apart from use as a smarter search function.
It also strikes me as strange that you would mention Codex for UI polish, as it's notoriously bad at UI, and far behind Claude Opus. Altman specifically posted that they are working to improve this for the next model release.
All the drudgery.
I almost find it offensive when colleagues open a MR with an obvious slop description that's frequently inaccurate.
That said, I find AI useful for a lot of drudgery like resolving merge conflicts or splitting changes out into separate MRs.
Particularly with the latter I had issues with small models, they butchered the changes I wanted moved. Not even on the second attempt did GPT 5.4 mini manage to move 10-20 lines to another file without modifying them in the process.
The trade-off of MoE is that it is worse but faster for the same total size.
That sounds great for hobbyists but IMHO it wasn't until Opus 4.6 was released six months go (Dec 25, 2025) that we had a model good enough for professionals to use as a primary driver of their coding agents. That seems to be the threshold worth aiming for.
in my stuff now i use an OT library that claude put finishing touches on in September.
Regardless I don't think it's fruitful to be so condescending with such little insight into this person's situation. Even if you had total insight -- let people be and withhold your judgement, or at least keep it to yourself. Making people feel stupid is a great way to turn people off to pretty much anything else you have to say
i always see great debates with local stuff but the space is constantly moving goalposts and all the vernacular is pretty unfamiliar to me. i'd love to understand what people with objective experience feel they've traded away (or gained) when going local so i can determine for myself if these things are a good fit.
(Shouldn't have done that refactoring job in high mode)
> "Quality is like running edge models from 8-12 months ago"
Don't expect Opus, expect more like Haiku. If you micromanage it, you'll get great results. If you want it to be a human in a box, it'll flounder.
I'm looking at https://ollama.com/search and the top few models like kimi-k2.7-code say "cloud" and I can't seem to ollama pull them.
I thought the whole POINT of ollama was not-cloud?
It was at first, then the developers realized they had a massive userbase they could monetize. A tale as old as open source...
Every month I research this and come to the same conclusion: the time, effort, and cost required to get local models (and the coding tools around them) to perform even close to Claude Code with sonnet/opus just not worth it right now. If it was, it would be distributive enough to be in the news.
Not that I'm discounting someone hasn't already solved this, just trying to Occam razor my way out of diving too deep down rabbit holes.
The present Sonnet/Opus versions (~4.8) will likely be what everyone in the enterprise might end up using eventually. And even though local models aren't there yet, there are budget alternatives from the families of DeepSeek, Kimi, GPT, MiniMax, etc. available through APIs of NVidida, OpenRouter, Groq, etc. which are very much Sonnet grade.
Personally, I don't think we're at that point yet. While I do think model improvement is starting to plateau (reaching a local ceiling), I'm not convinced local models are as good as sonnet/opus yet. The gap is still too much. But I'm excited for those models to reach those levels.
If you truly believe that it WILL get there within the next couple of years, then you might as well start playing with it now (and, yes, you will be very surprised, especially for shorter/smaller projects or nicely modularized larger projects)
With a layered approach we can slowly shift to running more locally and still get required work done. Really, my local setup is so much better than it was 2 months ago, and extremely better than 6 months ago - on the same hardware.
I think it strongly remains to be seen whether e.g. tokens per second (multiplied or whatever by percieved quality of private model) actually means "better or more useful output."
I strongly suspect it does not. (though I also strongly suspect this will be very difficult to measure because the incentive to lie about metrics here will be so strong.)
What I’m saying is that if local models were actually comparable to Claude Code in practice, we wouldn’t be having threads like this. It would be obvious to the people using them, and it would be massively disruptive. Why would individuals and companies pay hundreds or thousands for Claude Code if they could run something locally and consistently get similar results?
Every month I revisit the local ecosystem hoping the answer has changed. So far, my experience has been that it hasn’t.
If you're able to run a model on the scale of ~30B, you can find that with a reasonably scoped and well defined task they do very well. I've found both Gemma4-31B and Qwen3.6-27B to be the best in this range at the moment. You can swap in the MoE models for faster inference, but they are noticeably worse at most tasks. They can one-shot / vibe code tasks with small scope, but still do much better with guidance.
If you really want frontier-like capabilities, you'll probably need at least 128GB of memory and either huge compute or a lot of patience. Most people just don't have either the money or the patience to make these local models work.
The patience required for local model usage goes far beyond just waiting for tokens though. It takes a lot of effort to get things configured and working properly for your workflow and hardware.
I don't think I'd be using AI to code at all if this weren't the case. (I don't want to feel stunted or stuck just from losing my internet connection.)
I did not expect perfect reliability, but I thought they could at least get it right on the second attempt once you point out the difference. No such luck, it confidently tells you that now the code is the same, with yet another subtle bug added in the difference.
I don't know what work one would need to do where these garbage-class models would be adequate. Maybe they can masquerade as competent for a few minutes, but in the end the results simply are not right. At best they are suitable for a smarter search or autocomplete, in my opinion.
I think it's so good that I now scour the local marketplaces for good buys on 24GB cards that don't seem run through by miners and the likes, to build an even bigger rig for parallel execution.
Power usage is also totally not an issue, AI workload is very different from gaming.
tldr llama.cpp-vulkan with opencode on total 48GB VRAM AMD cards on arch btw.
I am using Opus to generate plans that the local agent then follows, then validated by Opus. So I'm not at 100% local but these models are increasingly part of my production workflow. Probably not worth doing - yet - unless you are a hobbyist who likes spending time and money tinkering.
This setup is certainly not as "good" as Opus or other frontier models but they are "good enough" for an increasing number of rote tasks. You don't need to drive a Rolls Royce to the supermarket, when a used Corolla gets you there just fine.
It also enables new workflows that would be cost-prohibitive with frontier LLMs (especially as token costs rise) - eg. overnight I use the Chrome devtools MCP and have the above setup fuzz-test as a user for a number of hours and see if it can break things. Even got it working with multi-modal so it can check screenshots, which blows my mind (and not my wallet, as Claude+screenshots burns $$$).
The "12-18 months behind frontier" sounds about right, it's about where I was with gpt-4o and basic harnesses back then. In another 12-18 months my bet is we have Opus-level models that can be run locally for <$5k... but the frontier models will be even further forward (unless governments have blocked them). Fun times.
I occasionally use it with pi to write some code and it’s blazing fast but it’s mostly habit that keeps me with CC and Codex.
Where did you find/order these? All the sites I can find are either out of stock, only sell to businesses, or are otherwise sketchy...
The expensive part is the upfront hardware cost and the electrical system upgrade you'll need to give your house.
I find it useful.
This side project highlights a similar approach to how I scope and tackle projects at work now:
https://git.theodohertyfamily.com/wg-wrap.git/tree/README.md
https://git.theodohertyfamily.com/wg-wrap.git/tree/CASE_STUD...
You have to apply a lot of careful architecture and TDD to your approach. Eliminate technical risk by tackling hard things early and wrapping them up in a simple, easy to use interface.
I find I can get some projects done 2-3 times faster than if I wrote them by hand. It can also save about 5-10x time on mundane or broadly scoped projects by helping me consolidate and try out ideas very quickly.
Setup-wise, I switch between vLLM using nvidia/Gemma-4-31B-IT-NVFP4 and llama.cpp using unsloth/gemma-4-31B-it-qat-GGUF with MTP. I throttle the GPU power usage to 400W.
My current llama.cpp setup gets token generation rates between 60-150 t/s depending on MTP draft acceptance rates. Prefill is between 1500-4000 t/s depending on context length/depth.
One day I thought about how can GPT send thinking parts one after another with a markdown header summary of the thinking block itself. Just think about it.
As a matter of fact, think about these operations, api endpoints, observe their output.
These so called SOTA models are not what meets the eye, and are not at all comparable in the infra department to local models. There is crazy orchestration going on due to the scale of these operations. But also these hard constraints lead to innovation. Innovation nobody speaks about.
I wouldn't say we cannot catchup, but serving our local models through llama, vllm is just the A, B, C of it all. In reality I think what is needed is a replication of said orchestration which I hinted at above.
The SOTA models are a deep orchestration of multiple models operating together it isn't a single model. As such no single model ever will catchup to them until it replicates through training first and then maybe through model architecture this orchestration.
Finally, I would wager that the SOTA "models", as one of these models in this orchestration setup, as served for general consumption, are not so much more capable than qwen 3.6.
I am sure that if you change your perspective you will start noticing the scale of the "magic".
I don't understand, why does it make you think this is the case?
> how can GPT send thinking parts one after another with a markdown header summary of the thinking block itself
Can you give an example?
Sure, connect opencode to an openai/chatgpt endpoint and use it. You will notice multiple "thinking" parts per "turn".
I put all of these in quotation because... they are part of the orchestration game. For example, it is not known if the thinking parts of a particular turn are chain of thought thinking summaries or just plain response which is masquaraded and thus orchestrated into appearing as thinking.
Further notice the cadence, word choice and sentence formation. Notice sentence construction. Notice "thinking part" construction and sequencing.
There is pretty heavy orchestration.
> I don't understand, why does it make you think this is the case?
Because not all tokens are equal. And if you waste expensive tokens on mundane tasks you will go out of business. This is the reason.
As I said, if you observe the output from these api endpoints you will notice it.
Qwen running on my 1st GPU at q4@176k context from 70 to 50 tok/s with MTP, pretty good for coding.
Gemma on the other hand is using both GPUs, running q8@64k context, doing document sentiment analysis, summarization, proofreading and translating, at consistent 25 tok/s. Somewhat slow but usable for batched workflows. Might get some more once llama.cpp starts supporting MTP with tensor split mode.
Still using frontier LLMs at dayjob since I'm not paying it and those are obviously better. Hopefully we'll have a Sonnet 4.6/Opus 4.5 level 30B model in a year or so.
EDIT: Prompt processing starts from 800 t/s and drops to 400 t/s. In most cases my starting prompts are around 16k-24k of tokens and require from 60 to 90 seconds to be processed. Not great but acceptable.
I have way too much VRAM forme such a model but Qwen never released the 122B version of Qwen3.6, which is the best class of model for my hardware. But at the same time my electricity bill is negligible, this is originally a laptop chip and it shows, it consumes almost nothing while idle and a little above 120W during prompt processing.
And Qwen3.6 has been surprisingly effective for me, I still use Clause occasionally but only for like 10% of my needs which allows me to stay well under the quota even with the cheapest plan.
Speed: ~800tps prompt processing and 50tps for token generation (with no speculative decoding).
I mostly run my MBP on low power even when it is plugged in to avoid the noise and heat. Full power maybe doubles speed but more than doubles power.
What can it do: Simple restructuring of pages. Where did it and other models fail: Splitting up Pinia store which GPT-5.4 did without fail. I think with more tuning, guidance for tool use and maybe some support tooling around it performance can increase further.
Probably the biggest improvement was including a backend-for-agents service definition which instructed the schema agent they were to only produce only a manifest based on the task, and to pass off that off to the next agent.
In short, I split tasks up into many pieces by defining a workflow where agents are only allowed to do very specific things before their work is passed along. This keeps them grounded and capable while also creating places for me to intervene if a workflow was say 25% or 90% successful.
Like "Here's this consumer grade GPU. Using only this GPU but with whatever models and workflow you want, see how well you can do on xyz benchmark."
Contestants would be given like 1 hour max and scored based on % of questions answered, % of questions correct and total time to finish.
Like "The Local AI challenge"
Some think the human brain works similarly: thousands of mini-brain cortical columns, each with a slightly different take on the situation, voting in a majority-rules system.
I've tried other models such as Qwen3.6 35B A3B and I've found that 27B works better for me when it comes to coding. It's slower being a dense model but the quality seems much better. Inference on my system for Qwen3.6 35B A3B is around 130-140 toks/sec, non-MTP, which is insanely fast!
You don't need 4x 5070's to run Qwen3.6 27B, three or maybe even two will work. However, I use MTP (multi-token prediction) to speed up 27B and that eats up more memory because the draft model requires its own context.
Another thing to keep in mind is that the tools you're using have their system prompts that are loaded into the model for each conversation. When I fire up Pi, working with the model is very snappy at start. When I interact with the LLM via Hermes CLI, it's much slower. That's because each prompt with Hermes is loading so much stuff (skills, tools, etc.) into the context and then it's there forever until the conversation ends.
I like running models at home for privacy, but I also like how there are no quotas, usage isn't a worry. If the future is "loop engineering" then you will be burning through tokens and $$$ using a cloud models.
My system idles around 200W and is around 350-450W when inference load is high. Decoding (token generation) isn't all that efficient, and your GPUs sit idle more than you think during inference. Advancements like diffusion may 1) speed up decoding and 2) let you utilize more of your idle GPU.
At first thought, they are quite skewed toward compute (vs VRAM), which is great for gamers but not so great for running LLMs.
(I run a 5070 in my desktop)
Like "Here's this consumer grade GPU. Using only this GPU but with whatever models and workflow you want, see how well you can do on xyz benchmark."
Contestants would be given like 1 hour max and scored based on % of questions answered, % of questions correct and total time to finish.
Like "The Local AI challenge"
It's also annoying that OpenCode doesn't even try to support local LLMs properly.
Getting OpenCode to work is possible, but extremely manual and clunky to configure. I have written a script to automate converting my llama-server configs into an OpenCode config, and that helps, but it's not ideal.
I have seriously considered writing Yet Another Coding Harness in my free time. I have some ideas for what would make it nice.
I've used the cli agents for claude, cursor, and pi, plus several custom harnesses I've written myself from time to time as experiments (and I guess technically gastown, if we're calling that a harness).
Pi is... just fine.
It does what I need it to, has a decent selection of tooling out of the box, integrates nicely with other tools, and generally gets out of my way enough that I don't think about it much anymore.
If you can run ~30b models at decent speeds, I think most folks would be pleasantly surprised at how capable they are with pi.
Tack on some of the extensions (ex https://pi.dev/packages/pi-mcp-adapter?name=mcp and https://pi.dev/packages/pi-web-access?name=search) and I get web tooling (ex - perplexity search), access to mcps to do things like drive chrome (https://browsermcp.io/) or firefox (https://github.com/mozilla/firefox-devtools-mcp)
It's fine. Is it as good as a subsidized top tier model? Nope. Is it free and still very capable? Yup.
And personally, I've been having a LOT of fun with the pi sdk (https://pi.dev/docs/latest/sdk)
Which is something that all the other providers charge you api access rates for (ex - thousands a month).
But yes - it expands a lot if you're willing to play with it.
I'd actually say the vscode comparison is wrong, because vscode is very much "bring your own extension" in the same way that Pi is. While Claude is much more "visual studio" vibes. It's thick, it's opinionated, and it's absolutely not something you can really customize, but it can feel slick for supported workflows.
but perhaps one individuals prompt feedback just isn't going to ever be enough I'm not sure how much you need (I know people working at big companies that have purchased in-house agents fine-tuned on internal documents etc.. and apparently these end up with bizarre behaviours not necessarily more helpful than the standard models)
I'd like to be able to essentially edit every response given by an agent and then finetune on the difference between what it produced and how I edited the text. Personally I would just remove a lot of the adjectives and try to distill the responses to core responses but I worry based on some of the work done by Owain Evans and other alignment researchers that this can sometimes push agents into tricky-to-predict tendancies.
https://cursor.com/blog/real-time-rl-for-composer
About Owain Evans work: I think he did SFT. On Twitter someone was saying that RL is not as susceptible to what he showed. I'd like to try that
It's kind of like driving a shitbox. It can often drive you from A to B, and some people will try to convince you it's fine. It's not.
There's no logical reason other than absolutely requiring the privacy, doing it for fun, or niche use cases like airplanes and so on. If you can't spend the insanely subsidized $20 for codex, you can use an API for chinese models which will run circles around these tiny models.
Is that characterization based on some objective facts or benchmarks?
Results depend on the model, of course, and your computer is the limit. Mine wasn't up to the task, unfortunately.
There's apparently a reason Sonnet and Haiku have been left in previous version #s.
Still encouraging, though, that things are catching up. We can't expect $20k local setups to match $20bn compute clusters.
but then use cheap/local model to implement the specs.
Markdown is more effective at compressing information and fits the context window easier, than hundreds of source code files
but this requires second and third passes, to smooth out the rough edges
has anyone tried that?
The hardware I have (32gb Macs and a gaming PC with 10gb 3080) can only get me to Qwen3.6-35B-A3B at various quants but that’s enough (200-400 PP, 20-30 TG).
It’s taken some time to learn how to best utilize it - some things take a bit of babysitting or direction - but it’s quite useful. Not having ever used CC I can’t compare but it’s been a great assistant or pair programmer for everything from embedded C++ to Vue. I wish I could run 27B as there have been moments when this model feels like it just can’t quite figure something out but those moments are quite rare. For a lot of tasks it’s a huge time saver and has proved super capable at digging into and fixing bugs given pretty vague instructions.
I’m using Pi as my harness.
Of course, you have to have the right hardware to be able to run with a context window like that, as it takes about 100GB of memory on my DGX Spark to do that with full f16 KV cache on the q4_k_xl model.
It’s slower but you can run them.
I've been working on an ops style tool for local LLM inference. Proxying, api keys, request logging, model rewriting and much much more.
https://github.com/ndom91/llama-dash
How much does this ware out the hardware?
Also, if privacy is the main reason for running local models, why not use venice.ai and equivalent?
Some of the benchmarks appear to back this up [0]
Of course, a lot depends how you are using it (inference parameters, harness, prompting, etc.), but the model is quite important too.
[0]: https://artificialanalysis.ai/models/open-source/small?model...
I considered investing in better hardware but doing the math, it is cheaper for me to pay for DeepSeek (yeah, I know not everyone can do that).
[1] https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct
I'm waiting to swap out my last gen Intel iMac with a new M5 mini of some kind, with the eye to hopefully be able to run some models locally. I envision a mini (heh) arms race to simply swapping out an M(X-1) for an M(X) annually as this field shakes out.
https://medium.com/p/f237d575e861
I mostly use it as a google search if I forget a thing, or doing the boilerplates.
I am using a mix of a non harness chat for the reply speed, and opencode / vim-ai for my boilerplates.
$0.00 / month. That's the budget.
Nemotron super 3 110B works well for 1M context long vibecoding sessions
I also use Pi harness with no extension
My Homelab AI Dev Platform
https://news.ycombinator.com/item?id=48542433
Also,the lack of enterprise tooling to help selected an appropriate model and tooling to run a local LLM does not help.
Hardware:
- GPU: AMD 7900xtx, 24gb vram
- CPU: AMD 5950x, AM4
- RAM: 64gb DDR4 3600
Software:
- OS: Bazzite (atomic fedora - this machine is running Steam "big picture" mode on my TV when not in use for LLM tasks)
- Virtualization: Podman Quadlets, which allows me to run container images as managed systemd units
- Network: tailscale
- Inference: llama.cpp vulkan (better performance than ROCM, though I'm keeping an eye on it in the future)
- LLM API surface: llama-swap (running as a podman quadlet exposed via tailscale svc) allows running multiple models on a single endpoint.
- Web/Chat Access: open-webui (running as podman quadlet exposed via tailscale svc) allows me to access any of the models I'm using for coding harness access for chat/general purpose queries via web browser. I also have the "conduit" app for my iPhone that allows me to hit the same models from my phone.
Models:
- Qwen3.6-27B-MTP-UD-Q4_K_XL.gguf - Unsloth Q4 quant of the qwen 3.6 27B model weights, with MTP enabled. MTP is important as it improves the speed the model can run at.
- Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf - Unsloth Q4 quant of 35B-A3B. Not MTP right now because I was having some issues with it?
- gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf - Gemma 4, which I use sometimes via open-webui instead of Qwen, but I generally think Qwen does a better job
Flags (specific for Qwen 27b, since that's primary model):
- `-ngl 99` offload all layers to GPU
- `-c 80000` 80K context window. I'd like this to be higher, but since my GPU also has to run the desktop session for the machine, I need to leave some VRAM overhead to keep the desktop from OOM-ing
- `-np 1` single slot (no parallel request handling)
- `--no-context-shift` error instead of silently sliding the context window when full
- `--cache-reuse 256` reuse cached prefix in chunks of 256 tokens (prompt cache)
- `-b 2048` logical batch size (tokens per submission)
- `-ub 1024` physical micro-batch (per GPU pass)
- `--cache-type-k q8_0 --cache-type-v q8_0` symmetric 8-bit K/V cache. Q8 is as low as I've been able to go without getting some issues with tool calling
- `-fa on` flash attention
- `--spec-type draft-mtp` use the model's built-in MTP as the draft model
- `--spec-draft-n-max 3` propose up to 3 draft tokens per step
- `--spec-draft-n-min 0` allow zero drafts if confidence is low
- `--spec-draft-type-k q8_0 --spec-draft-type-v q8_0` KV quant for the draft path
- `--reasoning-format deepseek` parse <think> blocks in proper format
- `--chat-template-kwargs '{"enable_thinking": true}'` turns on Qwen's thinking mode on by default (clients can override)
- `--jinja` use the GGUF's Jinja chat template
- `--temp 0.6` moderate randomness (Qwen recommended value for coding)
- `--top-p 0.95` nucleus sampling (Qwen recommended value for coding)
- `--top-k 20` top-20 candidates (Qwen recommended value for coding)
- `--min-p 0.0 disabled (Qwen recommended value for coding)
Performance (27b, primary model):
- ~65t/s for token generation
- ~600 t/s for prompt processing.
- If these numbers don't mean much to you, perceptually this feels about on-par with cloud model speed, maybe slightly faster.
- ~30s cold start when swapping from a different model or starting up session from idle via llama-swap.
I have llama-swap set up to unload the model after 10 min of idle, because I sometimes use this machine for gaming as well. A little annoying, but a small price to pay to be able to use the machine for other stuff (gaming) when I'm not using it with coding tasks.
CLI/Harness:
- Crush harness (https://github.com/charmbracelet/crush) less feature rich than Claude Code, but with a smaller system prompt and better built-in LSP support. I point it at the tailnet DNS (https://llama.<tailnet>:<port>)
- Headroom (https://github.com/chopratejas/headroom) to maximize the 80k context window
- Exa MCP for web search (https://exa.ai/) this alone makes the model far more useable. It's shocking how often the official claude code or codex harness get botblocked on web fetches, and the results of a good web fetch can be the difference between a good turn and a bad turn.
A lot of people get hung up on whether Qwen 3.x models are "as smart as" some parallel Anthropic model. Most people seem to agree it's somewhere between Haiku 4.5 and Sonnet 4.5. Personally, I think the biggest thing that makes the Qwen 3.x series of models _feel_ good to use for coding workflows is that its the first time that tool calling actually works consistently on local models. If tool calling is busted even 5% of the time, it can totally ruin the flow. I think that's also why people tend to say the "harness is more important than the model" or whatever. I have a few other models set up but 27B with MTP is the best compromise of speed and quality that I've found.
This setup works well enough for me that I dropped my personal Claude Code subscription. At work I'm still using frontier models, but personally I don't feel like I need that much power for anything I work on in my personal life. I'm "lucky" that I made the random financially unwise choice to buy a 7900XTX in late 2022 for $1k as a gaming card. I had no clue it would actually be a pretty decent LLM card 3-4 years later.
Edit: sorry for the horrible formatting, I always forget that HN doesn't actually do markdown :(
I did just publish a free to read online book "The Rise of Local Coding Agents" [1] where I document my setup that I enjoy using. I use little-coder (built on pi) and have good results for small Python and TypeScript applications. I struggle getting good results with Common Lisp and Clojure.
For me, the problem with all local LLM-basic coding agents is slow runtime.
[1] https://leanpub.com/read/local-coding-agents
Then I give it to local LLM (eg: Qwen / Gemma 4) via CLI. This is possible through usage of llm-mlx on Mac (or ollama on any machine given sufficient on hardware) which serve OpenAPI endpoints compatible for Aider (CLI) or Visual Studio Code to vibe along with the agentic coding assistant.
The paid products have an advantage but are not necessary if you don't mind to be more-involved with the process and have low expectations.
Runs through Pi with a custom prompt (basically "don't speculate blindly, isolate things, make them traceable and measurable, then verify") and behind a pretty restrictive bwrap setup - RO bind everything other than ~/.pi, cdw and a separate tmpfs, unshare almost everything other than the network - for which I use a network namespace that only allows tcp connections to a specific ip and port (i.e the inference mac) - i.e. netns exec into bwrap.
Can't compare it to SOTA or higher-requirements models on what I work on - policy. That said, on a bunch of test pieces - it obviously isn't gpt-5.5, it definitely lags behind k2.6/glm/ds4-pro, but it absolutely is usable. Of course, on such codebases, forget about one-shotting or trusting it blindly or anything of the sort - you ask it, guide it, restart the context from time to time to have a "fresh dice roll" and to keep the context small and clean, etc. Compared to anything smaller (incl. all the usual local qwen models) - on a test piece, it figured out that memfd and mmap were used for setting up a ring buffer with natural wraparound handling (double mapping the first page at the end) and didn't tell me "this is for sharing memory between processes" or some other BS.
Performance as described in the tables in the readme here: https://github.com/antirez/ds4 ...with a bit less than half that at "low power" (30w). Both are usable.
I think it also helps that I'm using my machine to do home server stuff. It excels at all of the traditional workloads. Then I can lean on the AI to help with automation here and there. I find it deeply satisfying.
Sure, you can get the local models to generate plausibly-looking code for simple cases. But compared to how I solve complex design problems in a large codebase with Claude Code and Opus/Fable, this isn't worth my time.
I'm still optimizing it (with claude, to be clear), but my testing is very encouraging. I worry a lot about companies (and the government) controlling access to machine intelligence, so local is the way to go.
Like how we've had SETI at Home, Folding at Home, BitTorrent etc. People are clearly willing to donate their computer resources to distributed projects.
Maybe in a dAI network anyone could submit content for training on, and each user running a "node" could have their own custom private conditions on which type of content to accept for training or inference.
Like someone who dislikes anime could say "never accept anime related content or queries" so their node would basically opt-out from any data or questions about anime.
(TLDR; Distributed compute for models will require hardware at a level only really possible with data-centers at the moment.)
Token generation operates at such a scale to demand enough from a single GPU as it will often saturate the bandwidth capabilities of consumer grade interconnects like PCIe. Which fundamentally implies that distributing a model's compute across vast distances is too much of a challenge without significant infrastructure.
To give an example, When we split a model's compute between two seperate cards on a single workstation, this doesnt mean we end up with 2x the compute bandwidth for a model. Instead the increase becomes something small like 20% depending on model, because the inconnects (PCIe on consumer hardware) will quickly become so saturated with data being copied between the two GPUs so as to become a bottleneck. And remember that this is something that happens locally with PCIe, which (depending on generation) will cap out at around 20-35 GB/s depending on the generation of motherboard.
Model performance is very much tied to having the fastest and highest bandwidth single card available so as to keep data transfer operations to a minimum as the sheer volume of data necessary for the model to run is immense. I simply cant imagine how slow and unusable a model would be if the copy operations necessary for its compute needed to be performed over unreliable network speeds where there will be significant performance loss as network speeds are not reliably distributed across the globe, and their unreliable nature would demand increased overhead due to data verification.
The dream of distributed AI is a ways off.
For months I spent time curating the AI+harness+skills+MCP servers, but now mainly just code with it. I find myself not bothering to use Claude (but keep paying "just in case").
That's feasible in part because my prompts have very specific objectives, constraints, and suggested staging, because I want the code to be exactly as I would write it, and I want to weigh in at specific moments. I would say the speed-up is 2-4X instead of the 10X of vibe-coding greenfield projects. The problem is not the coding speed, but building something complicated that's also correct and flexible (i.e., a directional accuracy). E.g., the agents help with abandoning a less-fruitful API shape instead of sticking with what works in a local maxima.
One flaw there is that I'm still writing code that feels clean to humans, which now is probably a waste. LLM's might be happier with 10+ parameters on one API instead of a plethora of configuration objects and convenience wrappers.
67M Ouput 51M Input
Total $0.83 dollar.
I honestly don't understand why people just don't use DeepSeek.
The secret to actually good agentic outputs even with small models? Llamacpp has support for this little known sampler called "top-n sigma". You should use that, set it to 1 and set temperature to literally whatever you want (it could be infinity) and your model will just magically work to your maximum context window. That's because long context generation is a sampling problem.
if youre shoopping for a new pc, very easy to justify 128gb vram
Recommended setup: plenty of nutrients, some caffeine and a quiet environment.
Performance - not currently measured in tokens: roughly average.
Disclaimer: I am a Linux infra/k8s guy, I write production code but it's mainly glue code and mainly in golang.
Addendum: most value we get is from "document intelligence" and that's all Gemma and Qwen on H100/H200