Ask HN: Has anyone replaced Claude/GPT with a local model for daily coding?

Has anyone here fully swapped Claude/GPT for a local model as their main coding tool, not just for side experiments? If so, please share your setup and performance (e.g tok/s)

531 points | by cloudking 7 hours ago

96 comments

Greenpants 3 hours ago
I have! I care about data privacy and LLMs being free. I'm using the Pi coding harness but containerized and sandboxed, to make sure it's running completely offline. On my Mac Studio with 128GB RAM (or MacBook with 36GB RAM) I'm using Qwen3.6 35b, with only 3b active parameters so that it runs really fast. I've done a complete redesign for my website's homepage and blog with Django + Wagtail. The latter is interesting, because Wagtail is a bit less well-known, so the agent, without giving it internet access, doesn't always know how to develop for Wagtail. I've used Qwen3.5 122b for when things get more complex. At 10b active parameters, it's significantly slower though.
I've noticed a few things compared to large models like Claude. For starters, you really need to know what you're asking, and be precise; it doesn't do much thinking for you. Any assumptions left open, and it'll take the easiest route to reach the goal (e.g. CSS in HTML), often not the best in terms of architecture.
It gets into loops quite often, and surprisingly often gets the edit tool call wrong, after which it will spend lots of thinking tokens and re-read files instead of retrying (despite the system prompt suggesting so).
Comparing agentic Qwen3.6 35b to Claude Opus is like a junior with knowledge across the board, that you really need to guide, versus a senior that thinks with you on architecture. If Opus gives a 15x speedup, local and fully offline Qwen gives a 5x speedup. Which, given that it's completely free, is still mind-boggling to me :)
[-]
- lambda 3 hours ago
  This is very similar to my setup. Pi in a container (I do let it have network access, just no access to creds or anything, only the one directory that I'm working on at the time and my ~/.pi directory), talking to llama.cpp in another container. I'm on a Strix Halo 128 GiB unified memory laptop.
  I've never used the frontier models in earnest, I don't believe in using proprietary tools for my programming, so I can't really compare.
  And I'm still a AI skeptic, so I'm doing more testing and kicking the tires than I am actually using it. That means I spend a lot of time trying to break various models, probe them for strengths and weaknesses, etc.
  But I find that when I do try to use it for real for agentic coding, Qwen 3.6 35B-A3B is definitely the one I reach for the most often.
  For other chat tasks and translation, I'll frequently use Gemma 4 31B.
  For audio, I'll use Gemma 4 12B.
  I keep a bunch of other models around to try out every once in a while (Qwen 3.5 122B-A10B, Qwen 3.6 27B, Nemotron 3 Super 122B-A12B, Step 3.7 Flash and Minimax M2.7 both at somewhat more aggressive quants, and GPT-OSS 120B if I want super fast but not terribly smart), but so far Qwen 3.6 35B-A3B is really the sweet spot for coding on a setup like this.
  [-]
  - chakspak 3 hours ago
    Hopefully this isn't off-topic, but your setup sounds just like mine, Strix Halo and (I'm assuming) llama.cpp on ROCm, and I'm finding that the Qwen hybrid models don't handle prompt caching and instead re-process the context in full on every turn. I'm wondering if you were able to solve this and how?
    [-]
    - lambda 2 hours ago
      I use Vulkan mostly instead of ROCm. Vulkan is actually a bit faster, paradoxically. I do switch out and try them both out, and it's not a huge difference, but I've been mostly saying on Vulkan.
      The re-processing context every turn problem is definitely something I've hit. Some of the causes have been solved upstream in llama.cpp; make sure you're up to date.
      But another cause of the issue that has a big effect is that older Qwen models didn't support preserving thinking. This means that each time you have a long sequence of tool calls with interleaved thinkging, as soon as you had your next turn in the chat, it would have to re-process all of that as it would drop all of the reasoning.
      Qwen 3.6, however, now supports preserving thinking. This can use a bit more context, becasue you're not dropping the thinking every turn, but it re-uses the cache better, not causing you to have to reprocess a whole turn at a time each time.
      In my models.ini, I have this for the Qwen3.6 models:
      chat-template-kwargs = {"preserve_thinking": true}
      There are still occasional issues I hit where it will have to re-process, but getting up to date and enabling preserve_thinking has helped a ton.
      [-]
      - ndom91 2 hours ago
        +1 using llama.cpp Vulkan releases with the Qwen models - runs much better than the ROCm releases.
        I'll have to give the preserve_thinking a shot.
    - dnautics 1 hour ago
      > Qwen hybrid models don't handle prompt caching and instead re-process the context in full on every turn. I'm wondering if you were able to solve this and how?
      Isn't this the nature of how LLMs work? Or do you mean that it recalculates the entire KV cache instead of saving the old KV cache, in which case the problem is likely in your executor (llama.cpp, vllm, e.g.) configuration or capabilities?
      [-]
      - lambda 0 minutes ago
        So, one of the ways that this problem manifests is that most local models aren't trained on preserving the full reasoning between turns. Every turn, they skip passing the reasoning trace from previous turns to the the LLM. So if on one turn you have a long interleaved chain of reasoning and tool calls, then it responds to you, and then you give a new prompt to fix something, it has to re-process all of those tools calls now with the reasoning stripped out.
        Qwen 3.6 has finally been trained both with and without preserving thinking, so you can optionally enable preserving thinking. This will use up a bit more context, but it will avoid having to do this re-processing of long agentic turns, and also the preserved thinking can avoid having to re-do some of the same reasoning over again in later turns.
        Besides that, modern LLMs don't only use full attention (apparently, attention is not all you need). Full attention is very expensive to compute and store (0(n^2)). But additionally, full attention is actually bad at certain kinds of reasoning; keeping track of some value that gets replaced over the course of time, for example. So most models these days use various forms of local attention which is fixed length and gets updated as you go; sliding window attention, Mamba-2 state space models, etc.
        But one advantage of attention is that you can go back and reprocess by truncating the KV cache and starting over. You can't do that with other forms of local attention; you've lost the state earlier in the sequence.
        So to allow you to go back without fully recomputing the cache all over again, your engine will save snapshots of the local attention state at various times, so if you need to go back to recompute the cache, you can start from the last snapshot. However, these snapshots can get large, you can't keep too many of these, so sometimes you need to go back quite far to get to one, or they're all past the point you need to go back to and you need to start over again from the beginning.
        There have been particular bugs in llama.cpp that have caused this to be triggered more often than it should; for instance, it wouldn't take snapshots before turns that included images at one point, so if you had an image heavy agentic workflow, that issue plus the lack of preserving thinking would mean you would frequently have to go back and start over from scratch.
        Some of these issue have been fixed, some are addressed by preserving thinking. There are still some issues sometimes; for instance, one that's hard to fix is that the tokens generated autoregressively don't always parse the same when doing prefill. For instance, you could generate something as two tokens "pre" and "fill", but it turns out that "prefill" is also a single token so the tokenizer will use that, so when you send that back again on the next turn, it will see a divergence and have to recompute from that point. It might be possible to ignore that and use the not fully greedy tokenization that's in the cache, but I've definitely seen llama.cpp have to do some cache recomputation due to that.
    - LoganDark 2 hours ago
      What harness are you using? Some of them (e.g. OpenCode) mutate the system prompt every turn, and therefore can't work with a KV cache.
      I've had the best luck with Pi so far, but it comes without some bells and whistles you might be used to (e.g. plan mode, subagents, MCP client support)
- adyavanapalli 3 hours ago
  For the edit tool, you should consider implementing a hash-based approach where each line of code is hashed and referenced by it when doing replacements. You can read up on the approach here: https://blog.can.ac/2026/02/12/the-harness-problem/
  I didn't do much benchmarking, but anecdotally, I found it to be making less edit errors. YMMV
- electronsoup 2 hours ago
  > It gets into loops quite often, and surprisingly often gets the edit tool call wrong
  I find that running better quantization, like Q8 tend to prevent this even though its a bit slower to run, it saves overall time with less churn
  Using 3.6-27b is even slower again than 3.6-35b, but I find the accuracy really pays off
- ltononro 2 hours ago
  What kind of coding do you do? Do you keep track of frontier models to vibe check the differences and re-evaluate constantly or are you ok with having a nerfed model forever? (not being judmental, just really wanto to know your framework here)
  [-]
  - Greenpants 2 hours ago
    Some of the work I do, I do for an (EU) organisation that doesn't have clear rules or guidelines on the use of AI yet. Though I have seen colleague-developers blatantly putting source code into external Claude-like models, I stay true to my principles and don't. I know for certain that everything that I run through my local, offline Pi Container Sandbox cannot leave the machine, and thus can't result in a data breach. I do this for the peace of mind.
    I do (unscientifically) experiment whenever a new capable local LLM (<=130b) releases with a license that permits commercial use. As for knowing my models require more work than Opus, I don't mind still having to puzzle on getting the architecture right. In any case, it forces me to stay in the loop of what's being built, which is a good thing.
- 0xbadcafebee 3 hours ago
  The harness and the LLM parameters are pretty essential to getting better results and reducing loops. Tweak the parameters and you can mostly eliminate loops without negatively affecting performance (it's a bit complex but ask a SOTA AI to guide you and it's not hard). The harness should also react more intelligently to failures; it can do things like return additional context or hints as it tracks error rates and avg duration of calls. Pi can be easily extended, and it's suggested by the author you modify it to perform better for your use case.
- dotancohen 1 hour ago
```
  > you really need to know what you're asking, and be precise
```
  Any chance that you could share some recent prompts to give other HNers a head start on his to approach Qwen? If you are uncomfortable posting them here, my Gmail username is the same as my HN username.
  Thank you.
  [-]
  - Greenpants 1 hour ago
    I'm glad you're asking. I already started writing a blog post on how to best make use of local models. I'll share it as soon as I have a complete enough list. If anyone else reading this would like to chime in with their tips & tricks, let us know!
    For the time being, off the top of my head, I'd say:
    - Prompt Engineering tips & tricks apply here (like being complete in the relevant context you provide in your question, and the specific task(s) the agent should do like reasoning, modifying one file, or trying to fix a complex task all at once (not recommended)).
    - If you already know which files the agent should look into, mention them to save time and potentially context.
    - In my personal workflow, I write down lots of atomic TODOs needed to solve a problem. As I write it down, I'll notice assumptions I'm making, or the fact that the TODO could still be decomposed further into (atomic) subtasks.
    - It's best to get a feeling yourself for how Qwen handles your repository. I noticed if I don't specify an architecture for development, it'll make quick & dirty fixes. If I don't tell it to remove debug statements, it won't. This is what was meant with "be precise" – Claude Opus might think for you and act in your best interest. Smaller Qwen models will just do what you ask them to, and no more. They have design knowledge, but you have to explicitly ask them to "activate" that part of their knowledge.
- jmuguy 3 hours ago
  Given your knowledge on this - do you think we'll see an open source model with Opus levels of capability? IMO if/when this happens - I would 100% stop using Anthropic.
  [-]
  - Greenpants 3 hours ago
    Let me put it like this. I started with local LLMs when ChatGPT still used GPT-3.5. I was amazed how my MacBook with 8GB RAM could run openhermes2.5-mistral: a 7b parameter model that could generate short stories that sort of made sense. Incredible!
    Two years later, and I'm running Qwen3.6 35b agentically to develop the start of a repository and automatically run tests to then improve on itself. I never thought we'd get here so quickly with LLMs back then.
    I'm pretty sure in two years we'll have current Opus-like quality in the 30-100b parameter model range. But at that point, Opus 6.3 will reason along for us so much better still, that we'll still look at those models in awe. It's great to look ahead, but let's not forget to appreciate how effective the current local models already are :)
    [-]
    - jmuguy 3 hours ago
      Haha well I ask because I don't really want/need anything beyond Opus most of the time. And I'm paranoid that Anthropic is going to be forced to charge the true cost of all this before too long.
      [-]
      - Greenpants 2 hours ago
        The other upside of running local LLMs is that there's no cloud provider to suddenly charge more for the same, or even less, model use.
        It's personal, but I prefer CapEx over OpEx for this. If you can purchase a device upfront that runs a decent local LLM, you get the peace of mind that your setup won't suddenly change over time and can only get better.
  - lambda 3 hours ago
    If you believe the benchmarks, Qwen 3.6 35B-A3B already outperforms Claude 4 Opus.
    Now, there's a bit of a degree to which some of the open source models do some benchmaxxing, and bigger models with more params may always feel like they have more depth. But anyhow, right now you have something that is arguably comparable to Claude 4 Opus on your laptop. I can't really compare myself because I never used it. It looks like Claude 4 Opus is still available on OpenRouter, so you could try it out and compare yourself if you're interested.
    It will likely always be the case that there are proprietary cloud models that are more powerful than what you can run on a laptop. You can just do a whole lot more with terabytes of VRAM on multi-GPU clusters than you can do on a laptop. So for folks who must have the most capable, you're probably not going to want to leave Anthropic.
    But right now, the models you can run on your laptop are comparable to the cloud models that were popular when vibecoding and Claude Code first took off.
    [-]
    - MrScruff 2 hours ago
      You really need to take the benchmarks with a massive pinch of salt. I’ve been testing local LLMs since the original llama and there’s nothing I’ve tried that is in the same category as Opus.
      [-]
      - lambda 2 hours ago
        Which Opus? They certainly outperform Claude 3 Opus.
        Anyhow, feel free to try them out head to head on OpenRouter. I'd love to see someone write up their results, of a modern local sized open source model vs. frontier models from ~a year ago, on something other than the standard benchmarks.
        [-]
        mapontosevenths 1 hour ago
        There's a guy on Youtube named Bijan Bowen who tests all the models (open and frontier) on a series of one/few shot programming exercises and has been for a long while now. You can pretty much watch him compare the results for any two models you're likely to be interested in.
        I'm not affiliated, I just like his style and have found it handy. I know it's not very rigorous, but it's good enough for me and I've found his examples to pretty closely match the results I see in real life.
        [-]
        lambda 49 minutes ago
        OK, it looks like he did a browser OS test with both Claude 4 Opus and Qwen 3.6 35B-A3B.
        Claude 4 Opus: https://youtu.be/J7omabtqnBM?t=193
        Qwen 3.6 35B A3B: https://youtu.be/gVU-DQeqkI0?t=215
        Qwen 3.6 produced far more working functionality than Claude 4 Opus did.
        Obviously, just one test of a single one-shot prompt of a silly toy OS, but yeah, this particular test shows Qwen 3.6 running locally dramatically outperforming Claude 4 Opus, which was a frontier model a year ago.
        MrScruff 2 hours ago
        I’m normally comparing frontier open/cheap models against frontier closed source. I use deepseek/glm regularly, they’re fine and you can get real work done with them but it’s super obvious when you switch back to opus or even sonnet. A 3B active param MoE model is not comparable.
        [-]
        lambda 41 minutes ago
        Yeah. I was pointing out that local 3b active models outperform frontier models from a year ago.
        Will this trend continue? Who knows. Both the frontier and local model will probably continue to get better. Which one will hit the top of the S-curve first? Hard to say, really. But what you can do right now locally is better than what you could do a year ago on the frontier, and lots of people were already using it pretty heavily a year ago.
        Hoever, November is when most folks agree that the frontier models got good enough for much of their work. Local models aren't quite there yet (where by "local" I mean "can run at reasonable speed and quant on a system less that $10,000 with today's RAM and GPU prices"). The biggest open weights models are getting there, but those require something like an 8x H100 server to reasonably run.
        It's likely that there will always be a gap between frontier and local if you're comparing models at the same time, you can just do a lot more with terabytes of HBM than gigabytes of DDR. But will local models get good enough to be usable for useful work? For many folks, they already are.
  - zozbot234 3 hours ago
    People can't seem to agree on what "Opus class" even means (the latest Opus is apparently pretty weak) but DeepSeek Pro, Kimi and GLM all are quite capable.
    [-]
    - computerex 2 hours ago
      Nothing compares to Opus when it comes to "taste" in web design in my experience. Nothing compares to opus in very difficult HPC/model inference development. I worked on this with opus: https://github.com/computerex/dlgo
      OpenAI was offering 2x usage at one point and I still used opus just because it's so much more effective.
  - rvnx 2 hours ago
    To me totally yes, even further, if they keep their existing route, over time people will stop using Anthropic.
    More and more specialized and ultra-performant chips are going to flood the consumer market. Especially once new hardware foundries will start producing (well if we don't die from WW3 in the interval).
    In 10 years from now, when even basic computers will have 128 GB of memory, and phones will have super optimized tuned models, then what will be the point of Anthropic ?
    Just use Gemma/Gemini/Siri or whatever.
    Pornography and uncensored models is also pushing toward local models.
    It's not like needs of people grows exponentially, the needs follow an asymptote instead (they are capped).
    The real revolution is offline robots and self-driving cars, but LLMs are already quite maxed.
    For programmers, now, what Anthropic offers is like 3% improvement on a known test (like this pelican riding a bicycle), or on questions leaked from benchmark insiders.
    It's ok but not like revolutionary (Fable was better but it was unusable, easy 20 minutes per one prompt due to overthinking).
- hparadiz 2 hours ago
  I am right there with you. Mind-boggling. It's a indistinguishable from magic technology!! I tried running some basic tasks through Qwen with Opencode on a 10 year old dual Xeon server for shits and giggles. I gave it a simple task like "use ffprobe first but convert this webm to mp4" and it was able to complete the task with zero network calls outside my network. On 10 year old hardware. It took about 3 minutes to complete the task. Now you may be saying 3 minutes? pfft. But I dare you to do it yourself. You're gonna be googling the CLI switches for at least 10 minutes and setting up your command. I had it actually optimize all the switches on the fly for me based on an initial ffprobe to see what is optimal.
  [-]
  - bluerooibos 15 minutes ago
    > 10 year old dual Xeon server...On 10 year old hardware.
    Hold on, what are the specs of your rig? How much RAM?
    I'm been considering getting an old refurbished 2018 Mac Mini with 64Gb of DDR4 RAM but everything I've read suggests this will be way slower than my 16Gb M1 Pro Macbook.
    [-]
    - hparadiz 1 minute ago
      I inherited a box with dual Xeons and 256 GB of DDR4. I then ran several tests and benchmarks of the hardware with several models.
      I've been meaning to write a blog post but well whatever here's the md.
      https://gist.github.com/hparadiz/f3596d00a62d8ebb2dadcc46ee5...
      Qwen3.5 9B performed best.
      You can absolutely still use this to do some basic stuff like tell opencode to convert a video file from one format to another. But frankly you're better off getting two AMD GPUs. Say a dual 7900XT would get way better performance.
- motbus3 2 hours ago
  Try deepseek V4 flash
- nyxtom 2 hours ago
  Have you found that being much more spec driven helps guide it better?
- amelius 2 hours ago
  Sounds super cool, don't get me wrong, but I suppose for most people the bar is higher than HTML/CSS.
  [-]
  - q3k 1 hour ago
    I love to warm up a whole rack of servers just so that some shitass buggy TUI can generate a line of bash that comments out my test runner.
    We truly live in the dumbest timeline.
- GardenLetter27 3 hours ago
  Could the harness not check for a failed tool call and pass it to a small model for correction without clogging up the main context?
  [-]
  - lambda 3 hours ago
    The thing is, to do a proper fix it would really need all of the context (maybe the tool call that failed was for an edit to a file that was last touched way at the beginning of the context), so you'd need to either keep that smaller model running doing prompt processing all the time, or have a very long wait while it does prompt processing on your whole session.
    And then also, sometimes the tool call errors are because of something like a file was changed out from under it; the larger model is probably going to do a better job of figuring that out and fixing it up.
    Finally, in Pi, you can always just use the /tree command to skip back to before a series of failed tool calls, with a summary if you want to let the model know what happened. The Pi /tree command is pretty powerful in managing your context
    [-]
    - everforward 1 hour ago
      An illustrative example I've seen a lot is creating Jira tickets in projects with custom fields marked as mandatory. It tries to create the ticket without the field and the tool call fails. The LLM needs access to the full context so that it can generate text to put in the "Why couldn't this meeting be an email?" field.
  - Greenpants 3 hours ago
    I'm actually quite sure that directly retrying the tool call would often fix the edit-call already. But these models have been trained to "think" for a while for any problem solving, so they'll presume the problem of the edit is more fundamental and spend unnecessary tokens filling up the context.
    I'll experiment more with the effectiveness of AGENTS.md rules for local Pi agents. I feel like smaller (local) LLMs just lack in attentiveness to elements in the context window, like precise instructions, compared to e.g. Claude models.
- yieldcrv 2 hours ago
  > It gets into loops quite often
  matches my experience and a deal breaker
  also the context window sizes are too low. I can't operate in 65,000 windows any more because even just reading the code's file structure overruns it and gets me nowhere. Definitely its own art form.
  200k context windows and above for me now
  I saw a paper last night that should help this a lot though
  [-]
  - kennywinker 2 hours ago
    Qwen3.6-35b handles 256k context fine if you’ve got room for it. I’m running it with 128k context with just 16gb vram.
  - Greenpants 2 hours ago
    I get that it's a deal breaker to some; it definitely requires patience.
    In Pi, /new is my best friend and most-used command for sure. For simple tasks (I decompose complex ones anyway since I don't trust small local LLMs to do this for me), the model doesn't need much context, given that I'm proficient in my codebase myself: "I'd like Feature X. Look into files 1, 2 and 3 to make your edits."
- nobody_r_knows 3 hours ago
  [dead]
horsawlarway 4 hours ago
For personal use, yes.
I replaced a $100/m subscription to claude in favor of running pi harness pointed at unsloth studio, using both qwen (unsloth/Qwen3.6-35B-A3B-MTP-GGUF) and gemma (unsloth/gemma-4-26B-A4B-it-GGUF) models, depending on my mood.
I have a machine I built about 5 years ago with dual RTX3090s in it (I was going to build a new gaming machine anyways, and the llama release had just dropped so I tacked another used 3090 onto the build), and I get ~150tok/s on either of those models (at UD-Q4_K_XL quant) and can use the entire 300k context length without having to exit VRAM.
To be very clear - it's not as good as claude. But it's free and not so much worse that it matters significantly.
For my personal needs, free beats $100/m.
I also have an openclaw instance pointed at the same inference server, and it's great for that (genuinely solid use-case for local models).
Some example projects
- Replacement launcher for android tvs (with usage monitoring and tracking for kids)
- Custom admin portals for my k8s cluster services
- Custom home assistant integrations/automations (recently some shelly devices for power monitoring and switching)
- Grocery list management and meal planning (mostly via openclaw)
- some custom workflows for 3d asset generation in comfyui.
---
Long story short, if you're trying to make money via software... I'd probably still recommend using a paid provider. But the local models are very capable of cool stuff.
[-]
- rootlocus 4 hours ago
  2x RTX3090 are around $4400. Without any electricity costs or other parts, that's 3.6 years of $100/m claude.
  [-]
  - overgard 2 hours ago
    Assuming the $100/m claude subscription is still around in three years.
    [-]
    - reddalo 1 hour ago
      [dead]
  - freetonik 3 hours ago
    That's also years of top tier PC gaming, if you're into that.
    [-]
    - augusto-moura 3 hours ago
      2x RTX3090 is extremely overkill for gaming, you can run any released game on earth on ultra for much less
      [-]
      - drnick1 2 hours ago
        1x RTX3090 is absolutely not overkill for gaming however. Nowadays it's barely enough to get 60FPS in 4K in some recently released games. But the shocking part is that my 3090 is still probably worth as much as when I bought it about 4 years ago.
      - overgard 2 hours ago
        Having a second card doesn't really work well for gaming.
      - googletron 3 hours ago
        what?
        [-]
        kakacik 3 hours ago
        AFAIK nvidia cards dont work in tandem (aka sli in the past) very well these days. So that aint true.
        Also, 2 gens old means bad performance at ray tracing, abysmal path tracing if at all. Pretty sure it can't run smoothly CP2077 in native 4k without dlss upscalers with all on ultra.
        [-]
        himata4113 3 hours ago
        You can have the 2nd card as an offload for upscaling, frame generation and whatnot.
        [-]
        irishcoffee 2 hours ago
        When I'm not running models I use the 2nd one in a pass-thru configuration to a windows vm for various things, usually gaming.
  - horsawlarway 4 hours ago
    Yes, today is not a great time to purchase hardware.
    When I bought, I paid $850 a piece. And I needed one anyways for the gaming I was going to do.
    My guess is the next good time to buy is going to be 24-36 months from now, depending on how the AI bubble goes.
    ---
    I'll add to this, I personally don't like Apple hardware (not so much related to the hardware as their company philosophy) but their machines with unified memory (or AMDs latest unified memory offerings) get pretty equivalent speeds to my 3090s, and are probably a much better modern entrypoint to local llms.
    There's a reason the joke is that Silicon Valley software devs bought up all the Mac minis for OpenClaw.
    You can get a 48gb unified RAM M4 pro mac mini for ~2k. If you're not going to do much else with the machine, it's what I'd pick as my budget inference device right now. Spend a year of claude now, get ~150tok/s for the next decade (plus) for ~free.
    If you want more capable and are willing to spend a little more, go with the newer Ryzen AI Max+ 395 machines.
    You'll spend less on power too.
    My last suggestion would be to go buy an RTX3090 at this point. You can do a lot better for a lot cheaper.
    [-]
    - tracker1 1 hour ago
      If you're willing to go the AMD route, the AMD Radeon Pro R9700 definitely looks interesting for the price compared to NVidia.
      [-]
      - felooboolooomba 0 minutes ago
        Can we also run LLMs on Radeon?
  - jmuguy 3 hours ago
    Or a really excellent experience playing Satisfactory with the settings cranked up, which is priceless.
  - tripleee 3 hours ago
    Christ GPU prices have gotten crazy
    How do AMD cards perform with LLMs? A 9070 is sold for ~$600 and has 16GB VRAM
    [-]
    - overgard 2 hours ago
      In my personal experience, I wouldn't bother with 16GB cards for coding -- the useful models are _slightly_ too large to work at any reasonable speed
    - lambda 2 hours ago
      That should do pretty well. Memory bandwidth is the biggest bottleneck for token generation, at 644 GB/s you should be able to do pretty well on a 9070, while prompt proessing is more compute bound and Nvidia tends to have the edge there.
      16 GiB won't fit you much, so you'd probably want at least 2x, and preferably 3x of those, and then you need a motherboard, power, etc. that can handle that.
      [-]
      - tracker1 1 hour ago
        You can get an R9700 with 32gb vram for ~$1200-1400 depending on where you live, which is probably a better option for AI use than 2x 9070(xt)
        [-]
        lambda 32 minutes ago
        Yeah, definitely.
  - nyrikki 3 hours ago
    You can get 60tps with three 1080tis and the sparse model, and I bet two 16gb 5060tis would do the same for ~1200. One 3090 is enough for a useful system, even on an old am4 host.
  - flowerthoughts 3 hours ago
    In 3.6 years, chances are they are still worth $3k. Unless some new chip fab pops up that can spam the chip market. Even if the AI bubble bursts, I doubt we'll see high-RAM GPUs sell off.
  - sieabahlpark 3 hours ago
    [dead]
- kpw94 3 hours ago
  > gemma (unsloth/gemma-4-26B-A4B-it-GGUF) models
  Since you're running quantized (at UD-Q4_K_XL) , check out the "qat" models (unsloth/gemma-4-26B-A4B-it-qat-GGUF) !
  - https://huggingface.co/unsloth/gemma-4-26B-A4B-it-qat-GGUF (With "Jun 9 Update: Added MTP support.")
  - https://blog.google/innovation-and-ai/technology/developers-...
  [-]
  - SubiculumCode 29 minutes ago
    How is the the QAT models at coding? I looked for opinions since the release and haven't found much.
  - me_bx 2 hours ago
    TIL:
    > Quantization-Aware Training (QAT) [...] allows preserving similar quality to bfloat16 while dramatically reducing the memory requirements to load the model
- twothreeone 3 hours ago
  > unsloth/Qwen3.6-35B-A3B-MTP-GGUF
  I've actually tried this exact same model locally as well.. albeit on just a single 3090 at 128k context and I got around 40-60tok/s with Q4_K quantization.
  The thing that bugged me the most was really the quality of the output on moderately complex real-world coding tasks. Having to switch between "prompt/vibe" and "manually implement" is such a big context switch burden, because you really have to ask yourself every few minutes if you're "holding it wrong" or the model is just too stupid.
  It also doesn't really seem to handle transitions from "low-level implementation detail" to "high-level design" well, e.g., it wouldn't easily render tables and such. With Claude I don't have this issue.. so I think for now my verdict would be that it's not really a viable replacement. I really hope it will be in a few months time.
  Oh and I used "aider" to replace claude CLI, which maybe that's also sub-optimal.. I'm not sure. The MCP marketplaces are useful of course, though arguably you could just manually replace them over time.
  [-]
  - horsawlarway 3 hours ago
    I don't generally switch to implementing myself on the model, although there are definitely times where I stop it and correct it mid-task.
    It's prone to thinking longer and more repetitively, again - it's definitely not opus 4.7/4.8.
    I've been using pi.dev as my harness for it, and been pleasantly surprised by how nice it feels (I have used aider, but only very briefly and quite a while back - so I can't realistically compare).
    I would say it's roughly where I felt claude was a year back - Most of the sessions need to be more "pair programming" and less "I let it run for hours".
    I'm a big fan of frequent "human in the loop" style workflows even when I'm on something like opus at work, though. I have opinions about lots of things, and re-inforcing that the model should stop and ask frequently seems to get me considerably better output, without having to "re-roll" if you will.
    I've done a good bit of management, and I think it's roughly producing what a junior dev might produce in a day every 5 minutes. And just like a junior dev, you need to be steering it back on track fairly often.
    Opus feels more like a mid-level at this point. I can hand it a chunk of work and "leave" but I still get better output if I'm checked-in and watching/steering.
  - unethical_ban 2 hours ago
    I'm so out of the loop on this stuff, it's the first time in my IT career I feel really behind on things.
    I've used Claude Opus to quickly and effectively pound out some 100-200 line scripts that integrate with a vendor's API, and it one-shotted them both almost perfectly.
    I wonder if for a lot of these local models, the scope of the AI assistance should simply be smaller: You architect the tools and the function definitions, and then tell AI to implement one at a time? Does anyone do that rigorously?
- gonzalohm 4 hours ago
  Did you double the tokens per second by adding a second GPU or was the increase significantly less?
  [-]
  - horsawlarway 3 hours ago
    No real change in inference speed. It basically just allows me to slot in more context or a bigger model.
    A single RTX-3090 will do approximately the same tok/s, but it won't fit the entire 300k context in VRAM.
    Sometimes that matters, a lot of times it doesn't.
    On the speed front - MOE models are great. Biggest perf difference in modern models is the move to MOE architectures.
    I get very similar quality from the both the Gemma-4 31B dense model, and the Gemma-4 26B MOE model (both at Q4 quant) but the MOE version runs at ~3 times the speed (150tok/s vs 46tok/s).
  - mirekrusin 4 hours ago
    You’re adding extra gpu for more vram, not speed.
- anhtqweb 1 hour ago
  Grocery list management and meal planning sounds interesting. Would you mind sharing a little bit more on your use case please?
- agup792 4 hours ago
  That sounds amazing. If I had some GPUs sitting around, I would totally do it. Sounds expensive to do it otherwise though.
jderekw 2 minutes ago
Running AMD Lemonade as the daily rig, Started with Ollama then over to LMStudio and now standardized on AMD Lemonade which has been helpful to monitor cRAM, CPU, GPU and gRam. The multi-models on Lemonade make it straight forward to run a stack for LLM, Voice to Text, NPU, and Image Generation. Platform also works with Nvidia, Apple, Intel and AMD chip sets.
milchek 3 minutes ago
I’ve tried in a 36GB MacBook Pro and haven’t had much success beyond very basic work. Issue for me was the context runs out quick even with smaller models and it’s slower. To get some half decent performance I’d imagine you want 128gb memory and are spending a lot more on hardware. At that point it becomes a question on whether you’d rather have frontier models at a subscription or sink that money into your own equipment. Of course, for those with privacy in mind your only option is forking out the cash for the higher end machines.
bluejay2387 4 hours ago
About 90% of my coding is on Qwen 3.6 27b and Open Code with some custom skills and Semble. It is NOT as smart as CC or Codex but its enough to get most of my work done. I didn't set out to replace CC and Codex (I have an RTX 6000 so the TPS is faster than I care about, but the RTX 6000 was originally for other work). I only tried this just to see how close you could get to a frontier model for coding as an experiment, but it was good enough that I stuck with it. I still fall back to Codex for really complicated stuff and to polish UI's as that seems to be the weakest element to working in Qwen.This isn't a recommendation because I don't think most people have an RTX 6000 laying around and the cost would be many years of MAX CC or Codex subscriptions, but at least this seems possible. Maybe in a few more years it will even be practical.
Other Notes: I have had to set the compact target to 75% on a 256k context window as once the conversation length goes about 100k I start seeing a drop in the quality and speed. This becomes very problematic after about 150k. I tried Qwen 3.5 122b too but it actually seems much worse at coding than 3.6 27b even though its much larger. Maybe because I am using a 4bit quant or maybe I just don't have it configured correctly? I know 3.6 is newer but I didn't expect it to out perform a model that is much larger from the prior generation. Gemma 4 31b is a good model for other tasks but at least my personal experience is that Qwen outperforms in coding. Nemotron Super 120b is great at a lot of stuff but it also seems to be not as good at coding as Qwen. This was very surprising to me.
[-]
- heipei 3 hours ago
  Same here, I use Qwen 3.6 27b (Q6 quant) with llama.cpp on an RTX 5090 using the pi agent exclusively now. The fact that it's local means that I never have to think about token pricing, quotas, time of day, or data sensitivity. I have limited the GPU from 600W to 450W which means the system stays whisper quiet during inference.
  I have become so "lazy" (in a good way), so far that I've started using the model for lots of daily mundane things on top of just coding:
```
  * "commit this on a branch, push, create a PR and assign $nickname for review"
  * "Use the Stripe CLI to download all open and overdue invoices and reconcile them with this CSV export from our bank account."
  * "Use these Elasticsearch credentials to summarise what kind of operations are causing load at the moment."
  * "Tell me if our codebase already supports X and where it's  implemented."
```
  [-]
  - amarshall 40 minutes ago
    What context length and kv cache quant (if any) are you using? And MTP?
- bo1024 4 hours ago
  Qwen3.5-122B is actually Qwen3.5-122B-A10B. The A10B means that this is a "mixture of experts" model where only 10B parameters are activated at a given time. Whereas Qwen3.6-27B is a "dense" model where all 27B parameters are activated all the time. So for many tasks, you'd expect the 27B dense model to be better than the 122B-A10B model.
- user43928 1 hour ago
  I am forced to use Qwen 3.6 27b at work and found it next to useless. I might as well do all the work manually rather than having it implement another mess or get the debugging entirely wrong.
  It feels like anything less than Sonnet is just a waste of time, apart from use as a smarter search function.
  It also strikes me as strange that you would mention Codex for UI polish, as it's notoriously bad at UI, and far behind Claude Opus. Altman specifically posted that they are working to improve this for the next model release.
  [-]
  - sejje 1 hour ago
    It might be good at analysis & review, writing documentation, git commits, etc--even if it's not good at coding.
    All the drudgery.
    [-]
    - user43928 43 minutes ago
      Bad AI written documentation and commits are not great, particularly when you work in a team.
      I almost find it offensive when colleagues open a MR with an obvious slop description that's frequently inaccurate.
      That said, I find AI useful for a lot of drudgery like resolving merge conflicts or splitting changes out into separate MRs.
      Particularly with the latter I had issues with small models, they butchered the changes I wanted moved. Not even on the second attempt did GPT 5.4 mini manage to move 10-20 lines to another file without modifying them in the process.
- htrp 4 hours ago
  why 27b vs 35b? Is MoE that much worse for coding?
  [-]
  - amarshall 34 minutes ago
    Can take the geometric mean of total and active parameters of MoE to get approximate equivalent quality to dense model params. So sqrt(35*10)≈18.7.
    The trade-off of MoE is that it is worse but faster for the same total size.
  - electronsoup 2 hours ago
    Yeah MoE is a little worse for the same size, but you can often run bigger MoEs at respectable speeds even on cpu ram offload. The dense models really need to be 100% vram
pierotofy 5 hours ago
Yes. Llama.cpp + Qwen3.6-35b (MTP) + OpenCode is quite capable and runs on a single RTX 3090 and is faster than most cloud models. Quality is like running edge models from 8-12 months ago. Setup details at https://github.com/pierotofy/LocalCodingLLM/
[-]
- jacobgold 4 hours ago
  "Quality is like running edge models from 8-12 months ago."
  That sounds great for hobbyists but IMHO it wasn't until Opus 4.6 was released six months go (Dec 25, 2025) that we had a model good enough for professionals to use as a primary driver of their coding agents. That seems to be the threshold worth aiming for.
  [-]
  - sbrother 4 hours ago
    I strongly agree on that being the release where these tools got good enough to substantially speed up my professional work. I have to admit I was super skeptical of AI coding until then.
    [-]
    - dnautics 3 hours ago
      for me (might be because of the language im using) i had a substantial bump around september and a huge bump around January.
      in my stuff now i use an OT library that claude put finishing touches on in September.
  - Projectiboga 4 hours ago
    So thalen it might be 6-8 months to get to useable on a local open model? Of course state of the art will be a year ahead, a generation at the current pace.
  - pierotofy 4 hours ago
    I use it for work.
    [-]
    - jacobgold 4 hours ago
      That's cool if you prefer it, but it is hard to imagine it being a strictly rational choice when much better quality is available at a price that is small relative to the cost of an employee. Or is there something specific about your use-case?
      [-]
      - vector_spaces 4 hours ago
        Not all work requires every facet to be so sharply optimized, and there may be other constraints that are completely invisible to you. Some that were easy for me to imagine: the parent works in a heavily regulated industry, their IT team is slow-moving and paranoid and this is a safe, under-the-radar workaround, the output is "good enough" for their purposes and they find tinkering with it to be fun.
        Regardless I don't think it's fruitful to be so condescending with such little insight into this person's situation. Even if you had total insight -- let people be and withhold your judgement, or at least keep it to yourself. Making people feel stupid is a great way to turn people off to pretty much anything else you have to say
      - lokar 4 hours ago
        Won’t it depend on what you use it for? A less capable system might be fine for boilerplate, moderate re-factoring, etc. Not everyone is building whole features in one go.
      - pierotofy 3 hours ago
        To me, what's not rational is believing you must rent the tools of your trade while exposing all of your employer's intellectual property to a third party. Difference of opinion.
        [-]
        jacobgold 3 hours ago
        It's not my opinion that you "must" rent tools but it certainly is the pragmatic choice in 2026. I would be as happy as anyone for this situation to change and I expect it to at some point.
- trueno 4 hours ago
  i have a 128gb m4 max macbook pro i've been wanting to tinker with this stuff but genuinely never find the time. any mac users in here running similar to the above that can share their experience?
  i always see great debates with local stuff but the space is constantly moving goalposts and all the vernacular is pretty unfamiliar to me. i'd love to understand what people with objective experience feel they've traded away (or gained) when going local so i can determine for myself if these things are a good fit.
  [-]
  - brycesub 4 hours ago
    If you have a 128GB Mac you really ought to try out: https://github.com/antirez/ds4 by the creator of redis. This is probably as close to it gets to state-of-the-art local LLM + agentic coding.
    [-]
    - __mharrison__ 1 hour ago
      Using this just this morning on my DGX Spark. A little slower than frontier models but my $200/mo weekly usage exhausted with 3 days left on the week...
      (Shouldn't have done that refactoring job in high mode)
    - trueno 1 hour ago
      well this is supremely interesting thanks for putting it on my radar
    - lostlogin 3 hours ago
      Thank you.
  - htrp 4 hours ago
    Use your ClaudeCode sub and tell it to set it up for you
  - dirkolbrich 1 hour ago
    I have the same machine. You might look into https://omlx.ai/ a „macOS-native MLX server“. pi.dev for the agent with MCP, web-search and sub-agents extension.
- atomicnumber3 4 hours ago
  Same. I have no desire to use Claude at all anymore.
  [-]
  - pierotofy 4 hours ago
    Yep. Screw Anthropic, CloseAI and all other rent seekers in this space.
    [-]
    - akulbe 3 hours ago
      I have an M2 Max MBP with 96GB of RAM. What models and setup would you use for this kind of configuration?
      [-]
      - monirmamoun 2 hours ago
        download LM Studio to play with, and it will let you search for models... try Qwen3.6-35B-A3B at 4,5 or 6 bits (6 bit XL is near perfect) and use pi coder or another harness to access it... you can also try Unsloth studio and try same model to start. LM Studio slighter easier to use, Unsloth probably better quality. Neither one is super great quality by the way (meaning: they crash or act weirdly too often to be full production solutions, but can work for local coding). ONCE YOU DOWNLOAD EITHER APP... it will let you search huggingface for the models. Just type qwen to start looking and ... start messing around. And you connect the pi coder harness using the http interface that LM Studio and Unsloth offer to the engine API, so make sure you figure out that url and turn it on... something like 127.0.0.1:1234/api would be a typical IP (localhost) and port (1234 is used by LM Studio)
- daveidol 4 hours ago
  Do you do your dev work on the windows machine (referenced in the docs), or do you remotely access it from a separate machine? I ask because I have a RTX 3090 kicking around in a gaming desktop, but I don't use it for any dev work (I use a Macbook Pro).
  [-]
  - snake_n_my_boot 2 hours ago
    I have a similar set up and have been using it to learn and tinker with open models. I run Ollama on the gaming desktop and point OpenCode to it from my MacBook. Works nicely for me so far.
- lelandbatey 4 hours ago
  I use it, it's good, I get work done, but know that they really mean it when they say
  > "Quality is like running edge models from 8-12 months ago"
  Don't expect Opus, expect more like Haiku. If you micromanage it, you'll get great results. If you want it to be a human in a box, it'll flounder.
- dheera 4 hours ago
  Am I doing something wrong or has ollama become shittified?
  I'm looking at https://ollama.com/search and the top few models like kimi-k2.7-code say "cloud" and I can't seem to ollama pull them.
  I thought the whole POINT of ollama was not-cloud?
  [-]
  - hoherd 4 hours ago
    I experienced the same situation a month or two ago. One of my friends sent me this article that was illuminating. https://sleepingrobots.com/dreams/stop-using-ollama/
  - satvikpendem 4 hours ago
    Ollama is not recommended to be used. Use llama.cpp.
  - jmorgan 3 hours ago
    The larger models are available on Ollama's cloud as most folks don't have the hardware to run 500B-1T parameter models.
  - jubilanti 2 hours ago
    > I thought the whole POINT of ollama was not-cloud?
    It was at first, then the developers realized they had a massive userbase they could monetize. A tale as old as open source...
  - toyg 4 hours ago
    Yes, you've nailed it. Ollama are desperately trying to pull a Cursor - like 3791 other projects in this space.
- dominotw 4 hours ago
  how much does the setup cost if i want to buy all the hardware now and increased power costs?
codinhood 5 hours ago
I don't think you're going to get many "true" answers to this. The opportunity cost of not using the latest and best models is just too much right now.
Every month I research this and come to the same conclusion: the time, effort, and cost required to get local models (and the coding tools around them) to perform even close to Claude Code with sonnet/opus just not worth it right now. If it was, it would be distributive enough to be in the news.
Not that I'm discounting someone hasn't already solved this, just trying to Occam razor my way out of diving too deep down rabbit holes.
[-]
- pyeri 3 hours ago
  At some point, there will come a saturation point for that "Opportunity cost FOMO train ride", and I think we are already past that point. Mythos class models are a whole different beasts and cutting edge on reasoning but not much use for the problem domains most developers are trying to solve.
  The present Sonnet/Opus versions (~4.8) will likely be what everyone in the enterprise might end up using eventually. And even though local models aren't there yet, there are budget alternatives from the families of DeepSeek, Kimi, GPT, MiniMax, etc. available through APIs of NVidida, OpenRouter, Groq, etc. which are very much Sonnet grade.
  [-]
  - codinhood 3 hours ago
    Yeah this is exactly what I'm waiting for.
    Personally, I don't think we're at that point yet. While I do think model improvement is starting to plateau (reaching a local ceiling), I'm not convinced local models are as good as sonnet/opus yet. The gap is still too much. But I'm excited for those models to reach those levels.
- gunapologist99 36 minutes ago
  Rather than Occam, consider Pareto?
  If you truly believe that it WILL get there within the next couple of years, then you might as well start playing with it now (and, yes, you will be very surprised, especially for shorter/smaller projects or nicely modularized larger projects)
- mark_l_watson 2 hours ago
  Sounds like a correct conclusion to me also. I am trying to transition to a layered system: local, then OpenCode with commercial vendor APIs for models like DeepSeek v4 flash, then DeepSeek v4 Pro.
  With a layered approach we can slowly shift to running more locally and still get required work done. Really, my local setup is so much better than it was 2 months ago, and extremely better than 6 months ago - on the same hardware.
- sakopov 3 hours ago
  This seems to be the answer. Building a rig with a decent graphics card will cost $2k+ and will produce sub-par results. Might as well milk the $100/m Claude sub until open-source alternatives reach parity with today's frontier models.
- MadrasThorn 3 hours ago
  It's great at accelerating hardware innovation however.
- jrm4 4 hours ago
  But you're pretty much measuring opportunity cost in tokens per second, no?
  I think it strongly remains to be seen whether e.g. tokens per second (multiplied or whatever by percieved quality of private model) actually means "better or more useful output."
  I strongly suspect it does not. (though I also strongly suspect this will be very difficult to measure because the incentive to lie about metrics here will be so strong.)
  [-]
  - codinhood 4 hours ago
    If you’re arguing that model metrics don’t necessarily translate into useful output, I agree. That’s not how I measure the success of a mode and not really the point I'm trying to make. I try to set things up and test it on my actual projects.
    What I’m saying is that if local models were actually comparable to Claude Code in practice, we wouldn’t be having threads like this. It would be obvious to the people using them, and it would be massively disruptive. Why would individuals and companies pay hundreds or thousands for Claude Code if they could run something locally and consistently get similar results?
    Every month I revisit the local ecosystem hoping the answer has changed. So far, my experience has been that it hasn’t.
  - Rastonbury 4 hours ago
    I think they are referring to the opportunity cost of time saved on doing things a local model cannot do or fixing it's mistakes against the cost of a subscription
sosodev 5 hours ago
The problem with this question is that it encompasses a huge spectrum of capabilities and expectations. If you can only run an 8B model and expect it to be good at vibe coding / one shotting things you're going to have a bad time.
If you're able to run a model on the scale of ~30B, you can find that with a reasonably scoped and well defined task they do very well. I've found both Gemma4-31B and Qwen3.6-27B to be the best in this range at the moment. You can swap in the MoE models for faster inference, but they are noticeably worse at most tasks. They can one-shot / vibe code tasks with small scope, but still do much better with guidance.
If you really want frontier-like capabilities, you'll probably need at least 128GB of memory and either huge compute or a lot of patience. Most people just don't have either the money or the patience to make these local models work.
The patience required for local model usage goes far beyond just waiting for tokens though. It takes a lot of effort to get things configured and working properly for your workflow and hardware.
[-]
- argee 4 hours ago
  I use Gemma 4 26B A4B on my Macbook (M4 Pro, 48 GB RAM) to study Rust (and ask other myriad questions). I don't trust it to do a good job in an IDE/harness to one-shot anything but the most trivial of changes. Still, it's fast and good enough that it could handle being a "co-pilot" on small to medium context tasks where you've got your hands on the wheel and your eyes on the road — and are driving under the speed limit. That's remarkable given where we were a couple of years ago.
  I don't think I'd be using AI to code at all if this weren't the case. (I don't want to feel stunted or stuck just from losing my internet connection.)
- user43928 1 hour ago
  My experience with smaller models, in this case specifically GPT 5.4 Mini, is that they cannot two-shot moving a 10-20 line code change to another file without modifying it and introducing bugs.
  I did not expect perfect reliability, but I thought they could at least get it right on the second attempt once you point out the difference. No such luck, it confidently tells you that now the code is the same, with yet another subtle bug added in the difference.
  I don't know what work one would need to do where these garbage-class models would be adequate. Maybe they can masquerade as competent for a few minutes, but in the end the results simply are not right. At best they are suitable for a smarter search or autocomplete, in my opinion.
chungus 2 minutes ago
Yup, although technically not replaced because I never used either of those products because I don't like sending my code to their black box. I have 2x24GB AMD gpu's, gotten from gamers on my local marketplace, one is connected with a 40cm riser cable. Running Qwen 27B and am very happy with its performance. Q8 with 135k context (arbitrary number, I could push it to 256). I like to use qwen 35B3A for mapping out entire code paths through our relatively complicated codebase/infra at work.
I think it's so good that I now scour the local marketplaces for good buys on 24GB cards that don't seem run through by miners and the likes, to build an even bigger rig for parallel execution.
Power usage is also totally not an issue, AI workload is very different from gaming.
tldr llama.cpp-vulkan with opencode on total 48GB VRAM AMD cards on arch btw.
garethsprice 2 hours ago
Using OpenCode + OhMyOpenCode + Qwen 3.6 35B-A3B Q_4_KM on an Ada 4000 (20GB VRAM) at 55 tok/sec for generation (slower than it sounds as OpenCode has a bunch of context it adds). Meaning to check out pi when I get a minute as I hear that one mentioned a lot lately.
I am using Opus to generate plans that the local agent then follows, then validated by Opus. So I'm not at 100% local but these models are increasingly part of my production workflow. Probably not worth doing - yet - unless you are a hobbyist who likes spending time and money tinkering.
This setup is certainly not as "good" as Opus or other frontier models but they are "good enough" for an increasing number of rote tasks. You don't need to drive a Rolls Royce to the supermarket, when a used Corolla gets you there just fine.
It also enables new workflows that would be cost-prohibitive with frontier LLMs (especially as token costs rise) - eg. overnight I use the Chrome devtools MCP and have the above setup fuzz-test as a user for a number of hours and see if it can break things. Even got it working with multi-modal so it can check screenshots, which blows my mind (and not my wallet, as Claude+screenshots burns $$$).
The "12-18 months behind frontier" sounds about right, it's about where I was with gpt-4o and basic harnesses back then. In another 12-18 months my bet is we have Opus-level models that can be run locally for <$5k... but the frontier models will be even further forward (unless governments have blocked them). Fun times.
arjie 6 hours ago
Not “local” and not interactive coding but sharing since it might be helpful. I have 2x RTX Pro 6000 Blackwell running DeepSeek V4 Flash. I get 160 tok/s raw but it’s a reasoning model. For my use case, I have it auto-write code and another system auto-review the code.
I occasionally use it with pi to write some code and it’s blazing fast but it’s mostly habit that keeps me with CC and Codex.
[-]
- akersten 4 hours ago
  > I have 2x RTX Pro 6000 Blackwell
  Where did you find/order these? All the sites I can find are either out of stock, only sell to businesses, or are otherwise sketchy...
  [-]
  - arjie 1 hour ago
    I run a small business (https://technologybrother.com) that runs a few small SaaS so I ordered the GPUs through corporate sales. If the barrier is getting an LLC, those are relatively cheap. The nice thing is that if you've got a legitimate business with use for GPUs you can get into the Nvidia Inception Program which has a pretty solid discount.
- leptons 5 hours ago
  Have you measured your electricity consumption for this rig? I have to wonder how much it would cost you per month.
  [-]
  - ux266478 4 hours ago
    Not nearly as much as you might think. 1.2kw where I live translates to about $0.12/hr, and that's when running full clip. If you have a decent solar hookup, it's small fraction on a sunny day.
    The expensive part is the upfront hardware cost and the electrical system upgrade you'll need to give your house.
  - mtone 54 minutes ago
    [dead]
jodoherty 3 hours ago
I use pi with an RTX Pro 6000 Blackwell to run Gemma 4 31b to do all my agentic coding.
I find it useful.
This side project highlights a similar approach to how I scope and tackle projects at work now:
https://git.theodohertyfamily.com/wg-wrap.git/tree/README.md
https://git.theodohertyfamily.com/wg-wrap.git/tree/CASE_STUD...
You have to apply a lot of careful architecture and TDD to your approach. Eliminate technical risk by tackling hard things early and wrapping them up in a simple, easy to use interface.
I find I can get some projects done 2-3 times faster than if I wrote them by hand. It can also save about 5-10x time on mundane or broadly scoped projects by helping me consolidate and try out ideas very quickly.
Setup-wise, I switch between vLLM using nvidia/Gemma-4-31B-IT-NVFP4 and llama.cpp using unsloth/gemma-4-31B-it-qat-GGUF with MTP. I throttle the GPU power usage to 400W.
My current llama.cpp setup gets token generation rates between 60-150 t/s depending on MTP draft acceptance rates. Prefill is between 1500-4000 t/s depending on context length/depth.
_bobm 1 hour ago
But, guys, when you say Claude/ GPT models, do you stop to think what are these "models"?
One day I thought about how can GPT send thinking parts one after another with a markdown header summary of the thinking block itself. Just think about it.
As a matter of fact, think about these operations, api endpoints, observe their output.
These so called SOTA models are not what meets the eye, and are not at all comparable in the infra department to local models. There is crazy orchestration going on due to the scale of these operations. But also these hard constraints lead to innovation. Innovation nobody speaks about.
I wouldn't say we cannot catchup, but serving our local models through llama, vllm is just the A, B, C of it all. In reality I think what is needed is a replication of said orchestration which I hinted at above.
The SOTA models are a deep orchestration of multiple models operating together it isn't a single model. As such no single model ever will catchup to them until it replicates through training first and then maybe through model architecture this orchestration.
Finally, I would wager that the SOTA "models", as one of these models in this orchestration setup, as served for general consumption, are not so much more capable than qwen 3.6.
I am sure that if you change your perspective you will start noticing the scale of the "magic".
[-]
- XCSme 48 minutes ago
  > The SOTA models are a deep orchestration of multiple models operating together it isn't a single mode
  I don't understand, why does it make you think this is the case?
  > how can GPT send thinking parts one after another with a markdown header summary of the thinking block itself
  Can you give an example?
  [-]
  - _bobm 7 minutes ago
    > Can you give an example?
    Sure, connect opencode to an openai/chatgpt endpoint and use it. You will notice multiple "thinking" parts per "turn".
    I put all of these in quotation because... they are part of the orchestration game. For example, it is not known if the thinking parts of a particular turn are chain of thought thinking summaries or just plain response which is masquaraded and thus orchestrated into appearing as thinking.
    Further notice the cadence, word choice and sentence formation. Notice sentence construction. Notice "thinking part" construction and sequencing.
    There is pretty heavy orchestration.
    > I don't understand, why does it make you think this is the case?
    Because not all tokens are equal. And if you waste expensive tokens on mundane tasks you will go out of business. This is the reason.
    As I said, if you observe the output from these api endpoints you will notice it.
Kostic 4 hours ago
For personal needs I connected VSCode with llama.cpp running Qwen 3.6 27B or Gemma 4 31B and it's good enough to cancel my cloud subscription.
Qwen running on my 1st GPU at q4@176k context from 70 to 50 tok/s with MTP, pretty good for coding.
Gemma on the other hand is using both GPUs, running q8@64k context, doing document sentiment analysis, summarization, proofreading and translating, at consistent 25 tok/s. Somewhat slow but usable for batched workflows. Might get some more once llama.cpp starts supporting MTP with tensor split mode.
Still using frontier LLMs at dayjob since I'm not paying it and those are obviously better. Hopefully we'll have a Sonnet 4.6/Opus 4.5 level 30B model in a year or so.
EDIT: Prompt processing starts from 800 t/s and drops to 400 t/s. In most cases my starting prompts are around 16k-24k of tokens and require from 60 to 90 seconds to be processed. Not great but acceptable.
stymaar 4 hours ago
Yes, Qwen3.6-35B-A3B on a Strix Halo 128GB (Bosgame M5).
I have way too much VRAM forme such a model but Qwen never released the 122B version of Qwen3.6, which is the best class of model for my hardware. But at the same time my electricity bill is negligible, this is originally a laptop chip and it shows, it consumes almost nothing while idle and a little above 120W during prompt processing.
And Qwen3.6 has been surprisingly effective for me, I still use Clause occasionally but only for like 10% of my needs which allows me to stay well under the quota even with the cheapest plan.
Speed: ~800tps prompt processing and 50tps for token generation (with no speculative decoding).
[-]
- manmal 3 hours ago
  Have you tried the 27B dense version? It’s way better for coding.
  [-]
  - anana_ 3 hours ago
    Unfortunately on Strix Halo or any similar unified memory set up, dense models are gonna be dirt slow due to the tiny memory bandwidth... But I agree, 27B is superior.
    [-]
    - stymaar 3 hours ago
      Exactly. That's why I'm disappointed there wasn't a 122B version, it's 27B but for Strix Halo users.
heisenbit 58 minutes ago
I think it is work to set up but I'm also learning a lot setting it up. Mainly using qwen/qwen3.6-35b-a3b mlx with my 48GB M4 MBP which leaves me just enough headroom for docker dev-container and other basics. I use LM Studio to run and am using it via VSCode. A big difference made the system prompt improving the tool integration (I asked GPT for guidance on that). Before that it was not making changes but regenerating code often messing up than helping.
I mostly run my MBP on low power even when it is plugged in to avoid the noise and heat. Full power maybe doubles speed but more than doubles power.
What can it do: Simple restructuring of pages. Where did it and other models fail: Splitting up Pinia store which GPT-5.4 did without fail. I think with more tuning, guidance for tool use and maybe some support tooling around it performance can increase further.
cuttysnark 4 hours ago
I've had some success with local models by chaining "agents" together in a workflow. Each agent has a different prompt and uses a different ollama model based on what their role is. The project manager, schema agent(qwen3:14b), etc. doesn't use the same model as the coding agent (qwen2.5-coder:7b). Between each step is an orchestrator and with a Playwright task which attempts to surface errors to the agent who introduced the previous code block. Only error-free blocks are forwarded to the next workflow step.
Probably the biggest improvement was including a backend-for-agents service definition which instructed the schema agent they were to only produce only a manifest based on the task, and to pass off that off to the next agent.
In short, I split tasks up into many pieces by defining a workflow where agents are only allowed to do very specific things before their work is passed along. This keeps them grounded and capable while also creating places for me to intervene if a workflow was say 25% or 90% successful.
[-]
- pianopatrick 3 hours ago
  I wish someone would do a benchmark and competition for this kind of work flow so we could figure out what works well.
  Like "Here's this consumer grade GPU. Using only this GPU but with whatever models and workflow you want, see how well you can do on xyz benchmark."
  Contestants would be given like 1 hour max and scored based on % of questions answered, % of questions correct and total time to finish.
  Like "The Local AI challenge"
- sowbug 2 hours ago
  Have you (or anyone else) tried letting agents compete? For example, give the same coding task to two models, or to the same model with a different seed, and have the reviewer choose the better result.
  Some think the human brain works similarly: thousands of mini-brain cortical columns, each with a slightly different take on the situation, voting in a majority-rules system.
HappySweeney 6 hours ago
I have an optane and lots of ram, so I tried full-fat models for writing some function overnight, as I get about 0.7 t/s. My current go-to test is to update a scalar function to transpose a bit-matrix to one using avx512. the cloud models all play with that like its nothing. Kimi 2.6 and GLM 5.1 both failed miserably.
jborak 3 hours ago
I'm using 4x RTX 5070's and first-gen AMD threadripper (1950X) to run Qwen3.6 27B (MTP) Q6_K with llama.cpp and it works great as a daily driver with Pi. Around 50-60 toks/sec. I also connect a few other applications to it such as OpenWeb UI and recently set up Bifrost, an LLM gateway, to be the primary access point for the models I serve.
I've tried other models such as Qwen3.6 35B A3B and I've found that 27B works better for me when it comes to coding. It's slower being a dense model but the quality seems much better. Inference on my system for Qwen3.6 35B A3B is around 130-140 toks/sec, non-MTP, which is insanely fast!
You don't need 4x 5070's to run Qwen3.6 27B, three or maybe even two will work. However, I use MTP (multi-token prediction) to speed up 27B and that eats up more memory because the draft model requires its own context.
Another thing to keep in mind is that the tools you're using have their system prompts that are loaded into the model for each conversation. When I fire up Pi, working with the model is very snappy at start. When I interact with the LLM via Hermes CLI, it's much slower. That's because each prompt with Hermes is loading so much stuff (skills, tools, etc.) into the context and then it's there forever until the conversation ends.
I like running models at home for privacy, but I also like how there are no quotas, usage isn't a worry. If the future is "loop engineering" then you will be burning through tokens and $$$ using a cloud models.
My system idles around 200W and is around 350-450W when inference load is high. Decoding (token generation) isn't all that efficient, and your GPUs sit idle more than you think during inference. Advancements like diffusion may 1) speed up decoding and 2) let you utilize more of your idle GPU.
[-]
- zakisaad 18 minutes ago
  This is interesting to me - why'd you go with the 5070 for your 4x build?
  At first thought, they are quite skewed toward compute (vs VRAM), which is great for gamers but not so great for running LLMs.
  (I run a 5070 in my desktop)
pianopatrick 3 hours ago
I wish someone would do a benchmark and competition for this kind of work flow so we could figure out what works well.
Like "Here's this consumer grade GPU. Using only this GPU but with whatever models and workflow you want, see how well you can do on xyz benchmark."
Contestants would be given like 1 hour max and scored based on % of questions answered, % of questions correct and total time to finish.
Like "The Local AI challenge"
GodelNumbering 3 hours ago
As someone that spends all day every day talking to LLMs, I'd say the OSS frontier models + a good harness is already a sufficient combo. For local deployments, we are missing one or two hardware generations (and may not get that soon since hardware companies are heavily favoring datacenter segment) to fully move to a local setup.
blurbleblurble 5 hours ago
My experience is that it's not the models themselves that are limiting right now, it's the clunky alternative harnesses with weird missing features making for bad ergonomics around stuff like queue management, interruption, subagents, goals, etc.
[-]
- coder543 3 hours ago
  I agree completely.
  It's also annoying that OpenCode doesn't even try to support local LLMs properly.
  Getting OpenCode to work is possible, but extremely manual and clunky to configure. I have written a script to automate converting my llama-server configs into an OpenCode config, and that helps, but it's not ideal.
  I have seriously considered writing Yet Another Coding Harness in my free time. I have some ideas for what would make it nice.
- horsawlarway 4 hours ago
  Pi is decent.
  I've used the cli agents for claude, cursor, and pi, plus several custom harnesses I've written myself from time to time as experiments (and I guess technically gastown, if we're calling that a harness).
  Pi is... just fine.
  It does what I need it to, has a decent selection of tooling out of the box, integrates nicely with other tools, and generally gets out of my way enough that I don't think about it much anymore.
  If you can run ~30b models at decent speeds, I think most folks would be pleasantly surprised at how capable they are with pi.
  Tack on some of the extensions (ex https://pi.dev/packages/pi-mcp-adapter?name=mcp and https://pi.dev/packages/pi-web-access?name=search) and I get web tooling (ex - perplexity search), access to mcps to do things like drive chrome (https://browsermcp.io/) or firefox (https://github.com/mozilla/firefox-devtools-mcp)
  It's fine. Is it as good as a subsidized top tier model? Nope. Is it free and still very capable? Yup.
  And personally, I've been having a LOT of fun with the pi sdk (https://pi.dev/docs/latest/sdk)
  Which is something that all the other providers charge you api access rates for (ex - thousands a month).
- Insanity 5 hours ago
  Heard good things about pi.dev but haven’t tried it. It might take care of some of those missing features you mentioned.
  [-]
  - bityard 4 hours ago
    pi.dev is more like an agent developer kit. It's basically a substrate upon which you spend hours/days/weeks building your own agents or coding framework. It's pretty much the neovim to claude's vscode.
    [-]
    - horsawlarway 4 hours ago
      I mean - the base experience is just fine, with perfectly reasonable built in tools for file access and editing, plus bash.
      But yes - it expands a lot if you're willing to play with it.
      I'd actually say the vscode comparison is wrong, because vscode is very much "bring your own extension" in the same way that Pi is. While Claude is much more "visual studio" vibes. It's thick, it's opinionated, and it's absolutely not something you can really customize, but it can feel slick for supported workflows.
acc_297 5 hours ago
I've been wondering lately if it would help to take a medium sized model and either in cloud or some local setup actually do Reinforcement Learning from Human Feedback (RLHF) on every prompt as a chore - I don't know if trying to manually finetune a model to your use habits would ruin it or help - ideally if you were diligent you could get rid of some of the ticks that make models for the general public difficult to work with e.g. overly sycophantic, overly verbose, annoying tendency to explain via analogies
but perhaps one individuals prompt feedback just isn't going to ever be enough I'm not sure how much you need (I know people working at big companies that have purchased in-house agents fine-tuned on internal documents etc.. and apparently these end up with bizarre behaviours not necessarily more helpful than the standard models)
I'd like to be able to essentially edit every response given by an agent and then finetune on the difference between what it produced and how I edited the text. Personally I would just remove a lot of the adjectives and try to distill the responses to core responses but I worry based on some of the work done by Owain Evans and other alignment researchers that this can sometimes push agents into tricky-to-predict tendancies.
[-]
- htrp 3 hours ago
  Cursor is doing that (i think with Fireworks as their provider)
  https://cursor.com/blog/real-time-rl-for-composer
- rolisz 5 hours ago
  I'm interested in trying something similar. I was thinking to do this for my OpenClaw agent.
  About Owain Evans work: I think he did SFT. On Twitter someone was saying that RL is not as susceptible to what he showed. I'd like to try that
redox99 3 hours ago
Models that you can run at home (Like Qwen 35B) aren't remotely close to Opus or GPT 5.5. Not even close. The only open models that are in that neighbor are around 1T params, so forget about running at home.
It's kind of like driving a shitbox. It can often drive you from A to B, and some people will try to convince you it's fine. It's not.
There's no logical reason other than absolutely requiring the privacy, doing it for fun, or niche use cases like airplanes and so on. If you can't spend the insanely subsidized $20 for codex, you can use an API for chinese models which will run circles around these tiny models.
[-]
- pbasista 3 hours ago
  > Models that you can run at home (Like Qwen 35B) aren't remotely close to Opus or GPT 5.5.
  Is that characterization based on some objective facts or benchmarks?
  [-]
  - kube-system 3 hours ago
    Yes, there aren't any 35B models that are beating frontier models at just about anything generalized
  - redox99 3 hours ago
    Based on private test prompts I've run through OpenRouter.
- xgulfie 1 hour ago
  I don't need a Ferrari to get to work
  [-]
  - orangeisthe 1 hour ago
    But you need the best tools to do the job
nfrankel 5 hours ago
I tried. It works in theory: https://blog.frankel.ch/tokensparsamkeit-coding-assistants/#...
Results depend on the model, of course, and your computer is the limit. Mine wasn't up to the task, unfortunately.
derekered 1 hour ago
I'm using Qwen 3.6 on my MacBook Pro M5 Pro with 48BG RAM for any work that I am particularly privacy conscious about, like any work with my journaling. It's been working great! I don't have any direct comparisons, but I've been satisfied with the results.
moezd 3 hours ago
Not yet. Without pure Apple game or decent GPUs, even with a lot of RAM and threads, all you get is about 30-50 tokens/second, and that's thinking turned off. Without these optimizations your model will have a field day with your MCPs, skills and agent descriptions and you will watch the paint dry before seeing the first output token. Local model serving means you have to fight for every token in your context window, which is quite opposite of what Claude/GPT/Copilot are pushing the industry towards.
K0balt 5 hours ago
Pretty good results with qwen 3.6 27b dense. I’d say it’s about equal to (Claude) haiku 4.5 maybe sonnet depending on the task.
[-]
- kadoban 5 hours ago
  What tool do you use to drive things for you, out of curiosity?
- kandros 5 hours ago
  I’d rather ask my butcher than Haiku for coding tasks
  [-]
  - papichulo4 4 hours ago
    Agreed on this. Anthropic has now changed the verbiage on the definitions of the models under `/model` to say that Opus is for everyday usage, and Sonnet is for routine tasks.
    There's apparently a reason Sonnet and Haiku have been left in previous version #s.
    Still encouraging, though, that things are catching up. We can't expect $20k local setups to match $20bn compute clusters.
    [-]
cheekygeeky 4 hours ago
Our software dev (smartest guy I ever met) is using OpenCode and Tmux with Open Source models. He says the DeepSeek is his model of choice for coding (he call's it "pretty GOOD". He's running two 3090s on an i9 with 128GB RAM. https://www.msn.com/en-us/news/technology/china-s-open-deeps...
bravetraveler 3 hours ago
I'm largely 'all natural', any of my little LLM usage is local. 128G Strix system, a not-super-dense Qwen or Gemma variant will get 50-80 tok/s output. Not subscribing to Anthropic/OpenAI/etc even in the unlikely event these are the last local models released; simply not needed. Entirely fine without and in-model tool usage covers my currency concerns.
bijowo1676 3 hours ago
One of the interesting setups I saw is using expensive frontier models to write and update markdown for your app: specs, product requirements, architecture, etc
but then use cheap/local model to implement the specs.
Markdown is more effective at compressing information and fits the context window easier, than hundreds of source code files
but this requires second and third passes, to smooth out the rough edges
has anyone tried that?
grmnygrmny2 3 hours ago
Just sharing my $0.02 here - I have ethical objections to using OpenAI or Anthropic products so I was a reluctant adopter of LLMs at all. Local models address most, though not all, my moral objections so I’ve been using them for work and personal projects for about a month.
The hardware I have (32gb Macs and a gaming PC with 10gb 3080) can only get me to Qwen3.6-35B-A3B at various quants but that’s enough (200-400 PP, 20-30 TG).
It’s taken some time to learn how to best utilize it - some things take a bit of babysitting or direction - but it’s quite useful. Not having ever used CC I can’t compare but it’s been a great assistant or pair programmer for everything from embedded C++ to Vue. I wish I could run 27B as there have been moments when this model feels like it just can’t quite figure something out but those moments are quite rare. For a lot of tasks it’s a huge time saver and has proved super capable at digging into and fixing bugs given pretty vague instructions.
I’m using Pi as my harness.
agentbc9000 43 minutes ago
Kimi K2.7 is very good - i have been testing it and its very very good, Fable 5 level of goodness.
mitchell_h 5 hours ago
Tried. The context windows just weren't big enough.
[-]
- coder543 3 hours ago
  Qwen3.6-27B supports a 1 million token context window.
  Of course, you have to have the right hardware to be able to run with a context window like that, as it takes about 100GB of memory on my DGX Spark to do that with full f16 KV cache on the q4_k_xl model.
- lysace 5 hours ago
  Got a similar result (my RTX 4070 only has 12 GB). I'm curious about whether 24/32 GB meaningfully improves this enough to make it useful.
  [-]
  - tobyhinloopen 4 hours ago
    Try it on RAM and CPU.
    It’s slower but you can run them.
    [-]
    - lysace 4 hours ago
      Good idea for evaluating the models, thanks.
- deadbabe 5 hours ago
  Prompt more directly instead of open ended.
qu0b 1 hour ago
I'm using deepseek V4 on two rtx 6000 pros and its working great. Opus is so slow that I get deepseek to do most of the work and Opus is only used to validate and help plan.
catapart 1 hour ago
tough ask, but since we're here: has anyone done this with 16GB of VRAM? I've been getting projects finished with LM Studio, but it definitely could stand to be more efficient. lots of time wasted with trying to get models to understand a problem with so few tokens.
ndom91 3 hours ago
Not 100%, I still fall back to Claude for most day-job stuff. But I've been trying to use Qwen 3.6 and Gemma 4 on my framework desktop mainboard (Strix Halo) as much as possible.
I've been working on an ops style tool for local LLM inference. Proxying, api keys, request logging, model rewriting and much much more.
https://github.com/ndom91/llama-dash
anubhav200 4 hours ago
Yes, llama.cpp, qwen27b, 35b, claude code. Llama-cpp-manager for managing llama.cpp configs (https://github.com/anubhavgupta/llama-cpp-manager)
627467 2 hours ago
So, everyone has different context, but how free is free running these local models? Like having a power hungry machine always on in the cupboard?
How much does this ware out the hardware?
Also, if privacy is the main reason for running local models, why not use venice.ai and equivalent?
zaptheimpaler 4 hours ago
I tried gemma-4-26B-A4B just to see if it could help me read/sort my emails on a relatively under-powered setup (16GB VRAM + 32GB RAM) and it's not going well.. the model burns 24K tokens just on searching for the right tool and then dumps the email contents into context - i tried to get it to use code-mode to save context but the code-mode implementation can't save files so it was useless and im going to try to switch to "ssh-mode" into my devbox container. Still relatively new to this, so I'm probably doing something wrong
[-]
- anana_ 3 hours ago
  Perhaps try a different model? Just from anecdotal experience, I find that the Gemma models smaller than 31B do not tool call as often as they should.
  Some of the benchmarks appear to back this up [0]
  Of course, a lot depends how you are using it (inference parameters, harness, prompting, etc.), but the model is quite important too.
  [0]: https://artificialanalysis.ai/models/open-source/small?model...
BiraIgnacio 4 hours ago
I tried for a bit, with llama.cpp + Qwen + Mac Pro but the results were very poor (both quality and speed).
I considered investing in better hardware but doing the math, it is cheaper for me to pay for DeepSeek (yeah, I know not everyone can do that).
bArray 3 hours ago
I'm in the middle of building my own based on LiquidAI/LFM2.5-1.2B-Instruct [1]. I run it on the CPU locally and get reasonable performance. I'm currently using it to solve small problems - but expanding it daily.
[1] https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct
NetOpWibby 4 hours ago
I'm looking forward to having Claude Fable at home. THAT is when I'll THINK about replacing Claude (who knows what their next models will be capable of, Fable was damn good for the three days I had it).
[-]
- trueno 4 hours ago
  we keep moving the goalposts on when we're gonna be happy with local. first it was sonnet at home as the good enough, then opus, now it's the mysterious leading model that runs on infrastructure we can't feasibly have at home
boringg 4 hours ago
Will the AI labs always make sure there is at least a years worth of differential? I guess the underlying business premise is that each new release has a step function change that prevents this kind of behaviour..
[-]
- snoman 49 minutes ago
  If the government is going to gate access to frontier models from here on out, even if new releases are a step function change… which they’re not… then it may be even more comparable to what’s available with a subscription.
whartung 3 hours ago
Will the inevitable M5 releases from Apple change this equation in any meaningful way?
I'm waiting to swap out my last gen Intel iMac with a new M5 mini of some kind, with the eye to hopefully be able to run some models locally. I envision a mini (heh) arms race to simply swapping out an M(X-1) for an M(X) annually as this field shakes out.
dabinat 5 hours ago
There’s evidence that combining models can achieve frontier-level performance (e.g. OpenRouter Fusion). I’m wondering if that’s the more realistic option: combine Opus with a local model to save on token costs.
[-]
- rvnx 3 hours ago
  I start to believe that adding more and more and more and more and more thinking tokens is the hack that works (this is what gave birth to Fable)
xhinker2 3 hours ago
Yes, I have. 1. Two RTX 3090s in Linux 22.04 2. Running Qwen3.6-27B Q6_K_XL GGUF 3. Using my own harness AZPal, I build myself, also wire it with Hermes Agent, works fine 4. Many times it solve problem that Codex can't solve
https://medium.com/p/f237d575e861
SupLockDef 3 hours ago
Local isn't new for me. I am still coding my stuff, but Qwen3-coder:30b on my old rig with a gtx 1070 16gb RAM does wonders for me.
I mostly use it as a google search if I forget a thing, or doing the boilerplates.
I am using a mix of a non harness chat for the reply speed, and opencode / vim-ai for my boilerplates.
$0.00 / month. That's the budget.
[-]
- jboss10 1 hour ago
  Have you tried qwen3.6 or pi?
overgard 2 hours ago
I haven't yet, but I just bought a 128GB M5 Max 40 core which I'm hoping can do it (if not, it's a good laptop regardless, I actually need that amount of RAM for non-LLM stuff)
kristianpaul 2 hours ago
Qwen3.6 35B on gigabyte aitop (spark clone) but be very specif what you ask and how should be solved
Nemotron super 3 110B works well for 1M context long vibecoding sessions
I also use Pi harness with no extension
mv4 4 hours ago
I've been using MiniMax M2.7 with vllm on my dual Nvidia Spark cluster. Slow (<20 tps) but functional for most of my use cases.
anonymousiam 5 hours ago
This was posted shortly after your Ask HN post:
My Homelab AI Dev Platform
https://news.ycombinator.com/item?id=48542433
tumetab1 6 hours ago
Not yet, tried Gemma 4 on an Apple M4 but the tok/s is significant lower than the cloud offering.
Also,the lack of enterprise tooling to help selected an appropriate model and tooling to run a local LLM does not help.
ryandrake 5 hours ago
Always a bit disappointed in the details in these kinds of threads. When you do get answers, they're never specific enough to try out on your own. It'll be something like "I use Qwen 3.5 and get great results!" OK but what quantization are you using? What llama parameters? What context size? What GPU are you running it on, and how much VRAM does it have? Are you hosting it on a separate box, or running it locally on your dev machine? What coding agent tool are you using, and how is it configured / hooked up to the model?
[-]
- riazrizvi 4 hours ago
  All you get here is some market signal from 1 or 2 posts if you already know how to do it. Most of these responses are garbage.
- porkloin 4 hours ago
  I have good results with this setup:
  Hardware:
  - GPU: AMD 7900xtx, 24gb vram
  - CPU: AMD 5950x, AM4
  - RAM: 64gb DDR4 3600
  Software:
  - OS: Bazzite (atomic fedora - this machine is running Steam "big picture" mode on my TV when not in use for LLM tasks)
  - Virtualization: Podman Quadlets, which allows me to run container images as managed systemd units
  - Network: tailscale
  - Inference: llama.cpp vulkan (better performance than ROCM, though I'm keeping an eye on it in the future)
  - LLM API surface: llama-swap (running as a podman quadlet exposed via tailscale svc) allows running multiple models on a single endpoint.
  - Web/Chat Access: open-webui (running as podman quadlet exposed via tailscale svc) allows me to access any of the models I'm using for coding harness access for chat/general purpose queries via web browser. I also have the "conduit" app for my iPhone that allows me to hit the same models from my phone.
  Models:
  - Qwen3.6-27B-MTP-UD-Q4_K_XL.gguf - Unsloth Q4 quant of the qwen 3.6 27B model weights, with MTP enabled. MTP is important as it improves the speed the model can run at.
  - Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf - Unsloth Q4 quant of 35B-A3B. Not MTP right now because I was having some issues with it?
  - gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf - Gemma 4, which I use sometimes via open-webui instead of Qwen, but I generally think Qwen does a better job
  Flags (specific for Qwen 27b, since that's primary model):
  - `-ngl 99` offload all layers to GPU
  - `-c 80000` 80K context window. I'd like this to be higher, but since my GPU also has to run the desktop session for the machine, I need to leave some VRAM overhead to keep the desktop from OOM-ing
  - `-np 1` single slot (no parallel request handling)
  - `--no-context-shift` error instead of silently sliding the context window when full
  - `--cache-reuse 256` reuse cached prefix in chunks of 256 tokens (prompt cache)
  - `-b 2048` logical batch size (tokens per submission)
  - `-ub 1024` physical micro-batch (per GPU pass)
  - `--cache-type-k q8_0 --cache-type-v q8_0` symmetric 8-bit K/V cache. Q8 is as low as I've been able to go without getting some issues with tool calling
  - `-fa on` flash attention
  - `--spec-type draft-mtp` use the model's built-in MTP as the draft model
  - `--spec-draft-n-max 3` propose up to 3 draft tokens per step
  - `--spec-draft-n-min 0` allow zero drafts if confidence is low
  - `--spec-draft-type-k q8_0 --spec-draft-type-v q8_0` KV quant for the draft path
  - `--reasoning-format deepseek` parse <think> blocks in proper format
  - `--chat-template-kwargs '{"enable_thinking": true}'` turns on Qwen's thinking mode on by default (clients can override)
  - `--jinja` use the GGUF's Jinja chat template
  - `--temp 0.6` moderate randomness (Qwen recommended value for coding)
  - `--top-p 0.95` nucleus sampling (Qwen recommended value for coding)
  - `--top-k 20` top-20 candidates (Qwen recommended value for coding)
  - `--min-p 0.0 disabled (Qwen recommended value for coding)
  Performance (27b, primary model):
  - ~65t/s for token generation
  - ~600 t/s for prompt processing.
  - If these numbers don't mean much to you, perceptually this feels about on-par with cloud model speed, maybe slightly faster.
  - ~30s cold start when swapping from a different model or starting up session from idle via llama-swap.
  I have llama-swap set up to unload the model after 10 min of idle, because I sometimes use this machine for gaming as well. A little annoying, but a small price to pay to be able to use the machine for other stuff (gaming) when I'm not using it with coding tasks.
  CLI/Harness:
  - Crush harness (https://github.com/charmbracelet/crush) less feature rich than Claude Code, but with a smaller system prompt and better built-in LSP support. I point it at the tailnet DNS (https://llama.<tailnet>:<port>)
  - Headroom (https://github.com/chopratejas/headroom) to maximize the 80k context window
  - Exa MCP for web search (https://exa.ai/) this alone makes the model far more useable. It's shocking how often the official claude code or codex harness get botblocked on web fetches, and the results of a good web fetch can be the difference between a good turn and a bad turn.
  A lot of people get hung up on whether Qwen 3.x models are "as smart as" some parallel Anthropic model. Most people seem to agree it's somewhere between Haiku 4.5 and Sonnet 4.5. Personally, I think the biggest thing that makes the Qwen 3.x series of models _feel_ good to use for coding workflows is that its the first time that tool calling actually works consistently on local models. If tool calling is busted even 5% of the time, it can totally ruin the flow. I think that's also why people tend to say the "harness is more important than the model" or whatever. I have a few other models set up but 27B with MTP is the best compromise of speed and quality that I've found.
  This setup works well enough for me that I dropped my personal Claude Code subscription. At work I'm still using frontier models, but personally I don't feel like I need that much power for anything I work on in my personal life. I'm "lucky" that I made the random financially unwise choice to buy a 7900XTX in late 2022 for $1k as a gaming card. I had no clue it would actually be a pretty decent LLM card 3-4 years later.
  Edit: sorry for the horrible formatting, I always forget that HN doesn't actually do markdown :(
  [-]
  - ryandrake 3 hours ago
    Now that's what I'm talking about! Very cool, thank you for the detailed response.
anuramat 2 hours ago
I wonder what languages people are using; I imagine smaller models would be decent at bash/python but significantly worse at something like rust
mark_l_watson 3 hours ago
I would like to say I run 100% local, but I use Opus + Gemini Pro cumulatively for 3 or 4 hours a week. I also like to use DeepSeek v4 flash with OpenCode for small quick tasks.
I did just publish a free to read online book "The Rise of Local Coding Agents" [1] where I document my setup that I enjoy using. I use little-coder (built on pi) and have good results for small Python and TypeScript applications. I struggle getting good results with Common Lisp and Clojure.
For me, the problem with all local LLM-basic coding agents is slow runtime.
[1] https://leanpub.com/read/local-coding-agents
shironnnn_ 3 hours ago
I use SpecKit to create a very detailed plan with a high amount of specificity using paid Claude plan.
Then I give it to local LLM (eg: Qwen / Gemma 4) via CLI. This is possible through usage of llm-mlx on Mac (or ollama on any machine given sufficient on hardware) which serve OpenAPI endpoints compatible for Aider (CLI) or Visual Studio Code to vibe along with the agentic coding assistant.
The paid products have an advantage but are not necessary if you don't mind to be more-involved with the process and have low expectations.
Lwerewolf 5 hours ago
mbp16 m5 max 128gb, antirez/ds4, deepseekv4-flash. Works well for relatively dense (let's say <20k LoC per project) C codebases that are essentially a bunch of custom specialized stores, http servers, network infra, media transformers, etc.
Runs through Pi with a custom prompt (basically "don't speculate blindly, isolate things, make them traceable and measurable, then verify") and behind a pretty restrictive bwrap setup - RO bind everything other than ~/.pi, cdw and a separate tmpfs, unshare almost everything other than the network - for which I use a network namespace that only allows tcp connections to a specific ip and port (i.e the inference mac) - i.e. netns exec into bwrap.
Can't compare it to SOTA or higher-requirements models on what I work on - policy. That said, on a bunch of test pieces - it obviously isn't gpt-5.5, it definitely lags behind k2.6/glm/ds4-pro, but it absolutely is usable. Of course, on such codebases, forget about one-shotting or trusting it blindly or anything of the sort - you ask it, guide it, restart the context from time to time to have a "fresh dice roll" and to keep the context small and clean, etc. Compared to anything smaller (incl. all the usual local qwen models) - on a test piece, it figured out that memfd and mmap were used for setting up a ring buffer with natural wraparound handling (double mapping the first page at the end) and didn't tell me "this is for sharing memory between processes" or some other BS.
Performance as described in the tables in the readme here: https://github.com/antirez/ds4 ...with a bit less than half that at "low power" (30w). Both are usable.
jmward01 3 hours ago
Has anyone been storing their cc sessions for future training data on their own models? I'd love to build a system that fine-tunes on cc sessions and a good first step is capturing my own sessions well.
[-]
- abidlabs 3 hours ago
  Yes! https://huggingface.co/changelog/agent-trace-viewer
  [-]
  - jmward01 3 hours ago
    Didn't realize they did this. I have avoided pushing data to huggingface. This is all -deeply- private info and I haven't really reviewed their privacy policies and the like. I'll give them a look.
wuschel 4 hours ago
I would like to know whether someone was able to use lower tier models for activities other than coding e.g. a limited version of a personal note manager - and what the hardware requirements in RAM for these models were.
ecshafer 5 hours ago
I work with a few models on servers, so not local, but self hosted with ollama. gemma-4, glm 4.7 flash, and qwen 3.6. glm is the best at coding agentically. But I still don't think any of them reach the levels of gpt 5.5 or opus 4.8.
hegdeezy 4 hours ago
I have tried locally but I find that the implicit breakeven is somewhere around 1 year of use given the high power costs where I live. Not really worth it but maybe if I move some day!
fortyseven 4 hours ago
I use Pi and Qwen 3.6 27b locally on a 4090 for all my personal projects. I still use Claude for day job work since they pay for it, and my employer expects me to use it. I rarely touch it otherwise.
euroderf 2 hours ago
Is anyone managing to do this on a Mac with a measly 8GB ? Asking for a friend.
AH4oFVbPT4f8 4 hours ago
Ollama + Hermes on M5 Max 128GB using .NET using Qwen 3.6:35b-a3b as the primary model to do the work. I might use 27b to plan what to do.
[-]
- xeonax 4 hours ago
  Whats .NET doing in between?
  [-]
  - AH4oFVbPT4f8 2 hours ago
    Sorry, I meant to say I was writing .NET C# with the setup
_davide_ 5 hours ago
i used to mix remote and local minimax 2.7(q3) on my strix halo, it run at 30 tg and 220 tokens pp... it was a bit painful slow, but it was a good feeling i could stay offline. unfortunately m3 which is at opus .8 levels is 460b parameters and doesn't even fit in 128gb of memory, let alone a big context. strix halo feels like a toy for ai purposes. https://kyuz0.github.io/amd-strix-halo-toolboxes/
[-]
- sosodev 4 hours ago
  My strix halo board is feeling more useful and less toylike with the recent performance gains combined from MTP, better quantization, and generalized performance improvements across the stack. For example, I can run Unsloth's Gemma4-31B 4-bit QAT model with around 30tg and 200pp. I don't find that to be too slow at all. Particularly because it's nearly full accuracy and good enough for a lot of different stuff I throw at it.
  I think it also helps that I'm using my machine to do home server stuff. It excels at all of the traditional workloads. Then I can lean on the AI to help with automation here and there. I find it deeply satisfying.
  [-]
  - _davide_ 3 hours ago
    you can absolutely use it for some workloads, but as soon as you have some extra complexity for a big repo it'll take forever and the economics are so silly to the point that the electricity bill would be comparable to a subscription. I love having the possibility of running things locally if some random dude decide to pull them plug, and give me solice the fact that i can have 100% private inference, but as the main driver during the day? shoot me
SkitterKherpi 5 hours ago
It has so far been the kind of thing that always feels like the next version of the local models would be the one that is just good enough.
jwr 4 hours ago
I tried many, many times and I keep trying. But I just don't see this happening: those tiny models that we can run on our machines (I have an M4 Max Mac, so I can reasonably run qwen3.6-35b-a3b or gemma-4-26b-a4b-qat at this time) are NOWHERE near as smart as the huge monsters like Opus/Fable. Nowhere. I can see a lot of people deluding themselves.
Sure, you can get the local models to generate plausibly-looking code for simple cases. But compared to how I solve complex design problems in a large codebase with Claude Code and Opus/Fable, this isn't worth my time.
jmichaelson 4 hours ago
I am working on exactly this issue right now. My approach is that a highly optimized harness (pi.dev) with the right backing knowledgebase (a custom, self-updating wiki with lots of QC layers) can get close to most of my usage patterns for my Claude Max 20x subscription. I use Gemma 4 26B QAT served by a custom fork of llama.cpp, with 4-8 slots of 256k context at Q8. It's a very good model when the harness keeps it on rails. In an age of 1M context windows, 256k may seem small but it's been plenty for my work (scientific programming). A $20/month subscription to Ollama-cloud gets me good coverage of consults out to frontier models for difficult plans or debugging (again this is all woven into my highly customized pi install).
I'm still optimizing it (with claude, to be clear), but my testing is very encouraging. I worry a lot about companies (and the government) controlling access to machine intelligence, so local is the way to go.
drnick1 2 hours ago
Do you recommend Ollama or bare llama.cpp?
[-]
- jboss10 1 hour ago
  llama.cpp It's faster and more open source. Ollama has some mixed history. I use llama-swap to emulate the Ollama experience.
- shironnnn_ 1 hour ago
  if on MacOS I recommend llm-mlx which currently renders tokens 10%-15% faster than llama.cpp.
wmedrano 3 hours ago
No, but I use GLM5.1 instead of Claude/GPT.
anubhav200 4 hours ago
Yes, llama.cpp, qwen 27b and 35b, llama-cpp-manager for managing model configs.(https://github.com/anubhavgupta/llama-cpp-manager)
Razengan 5 hours ago
Related: Are there any viable distributed AI models?
Like how we've had SETI at Home, Folding at Home, BitTorrent etc. People are clearly willing to donate their computer resources to distributed projects.
Maybe in a dAI network anyone could submit content for training on, and each user running a "node" could have their own custom private conditions on which type of content to accept for training or inference.
Like someone who dislikes anime could say "never accept anime related content or queries" so their node would basically opt-out from any data or questions about anime.
[-]
- joshuamoyers 5 hours ago
  I think it'd be very hard to achieve viable tokens/s or get arithmetic intensity to be high enough in general, since many cases in existing training and inference are memory bandwidth limited. Definitely seems possible to conceptually have a slow pipeline that is distributed though.
- SimianSci 4 hours ago
  This is unlikely to happen in any meaningful fashion for quite some time.
  (TLDR; Distributed compute for models will require hardware at a level only really possible with data-centers at the moment.)
  Token generation operates at such a scale to demand enough from a single GPU as it will often saturate the bandwidth capabilities of consumer grade interconnects like PCIe. Which fundamentally implies that distributing a model's compute across vast distances is too much of a challenge without significant infrastructure.
  To give an example, When we split a model's compute between two seperate cards on a single workstation, this doesnt mean we end up with 2x the compute bandwidth for a model. Instead the increase becomes something small like 20% depending on model, because the inconnects (PCIe on consumer hardware) will quickly become so saturated with data being copied between the two GPUs so as to become a bottleneck. And remember that this is something that happens locally with PCIe, which (depending on generation) will cap out at around 20-35 GB/s depending on the generation of motherboard.
  Model performance is very much tied to having the fastest and highest bandwidth single card available so as to keep data transfer operations to a minimum as the sheer volume of data necessary for the model to run is immense. I simply cant imagine how slow and unusable a model would be if the copy operations necessary for its compute needed to be performed over unreliable network speeds where there will be significant performance loss as network speeds are not reliably distributed across the globe, and their unreliable nature would demand increased overhead due to data verification.
  The dream of distributed AI is a ways off.
sometimelurker 2 hours ago
yeah I use one one the small MTP qwens and pi
devin 3 hours ago
Anyone here running a tinygrad?
w10-1 3 hours ago
I run many models (but mainly Gemma-4) using oMLX (for caching) on a 32GB M1 max using (gasp) Xcode. For tok/sec response times, I'd say it responds faster than I could read the prompt aloud in many cases (and I'm not constantly polling the Claude status page).
For months I spent time curating the AI+harness+skills+MCP servers, but now mainly just code with it. I find myself not bothering to use Claude (but keep paying "just in case").
That's feasible in part because my prompts have very specific objectives, constraints, and suggested staging, because I want the code to be exactly as I would write it, and I want to weigh in at specific moments. I would say the speed-up is 2-4X instead of the 10X of vibe-coding greenfield projects. The problem is not the coding speed, but building something complicated that's also correct and flexible (i.e., a directional accuracy). E.g., the agents help with abandoning a less-fruitful API shape instead of sticking with what works in a local maxima.
One flaw there is that I'm still writing code that feels clean to humans, which now is probably a waste. LLM's might be happier with 10+ parameters on one API instead of a plethora of configuration objects and convenience wrappers.
thrownaway561 2 hours ago
I just use DeepSeekV4 Fast... It's cheap as hell. Currently my monthly usage has been
67M Ouput 51M Input
Total $0.83 dollar.
I honestly don't understand why people just don't use DeepSeek.
[-]
- ThomasGlanzmann 1 hour ago
  I do the same. deepseekv4 fast for the 90% of the tasks, if it can't lift it, I use deepseekv4 pro. I use crush as coding agent but removed the blocked commands because I also do a lot of system administration. Love it. I use 8 USD in 7 weeks and use it quiet extensively for all sorts of things, programming, system administration, google search replacement, investments, you name it.
major505 3 hours ago
Yes. I use Owen on my MacBook m1 (16gb) daily, running inside Ollama. Works well. Is not particularly fast, and I need to create a custom imagem that sets the temperature of the model to zero starting, so I don't get over creative with its bullshit, but it works reasonable week.
[-]
- Der_Einzige 1 hour ago
  Secretly the problems many people have with agentic coding are related to poor choice of sampling settings, but the world will wait several more years before this is understood well. top_p and top_k are garbage but they are intentionally kept on purpose because subsequent methods enable coherent high temperature sampling, which is an absolute no go for alignment/safety reasons.
  The secret to actually good agentic outputs even with small models? Llamacpp has support for this little known sampler called "top-n sigma". You should use that, set it to 1 and set temperature to literally whatever you want (it could be infinity) and your model will just magically work to your maximum context window. That's because long context generation is a sampling problem.
system2 5 hours ago
Until I can buy an 80GB VRAM GPU, I won't attempt to do it. A local LLM is always missing something that needs a bigger model.
christkv 5 hours ago
Waiting for this https://github.com/antirez/ds4 to stabilize for strix halo.
jeffrallen 3 hours ago
I use Qwen 3.6 on a remote GPU that my work offers. Works fine. Slow and steady, works hard, gets the job done. Probably better at diagnosing than making new code, but whatever.
gigatexal 4 hours ago
I tried to. I just couldn't get over how it made my otherwise whisper quiet M3 Max MacBook Pro 14 for the performance. The sweet spot has been adopting Claude Code to use the Chinese models. Deepseek V4 Pro is very, very good. But I am such a casual local user of AI that my 20/month Claude subscription is enough and I find myself using that more and more.
cyanydeez 3 hours ago
never started. using wither qwne3-xoder-nezt or qwen3.6 35b
if youre shoopping for a new pc, very easy to justify 128gb vram
dude250711 5 hours ago
Yes, running a local model on a natural wetware substrate here.
Recommended setup: plenty of nutrients, some caffeine and a quiet environment.
Performance - not currently measured in tokens: roughly average.
[-]
- jasongill 5 hours ago
  I have been running this stack since well before Claude Code became popular. It works OK but I've found it to be very slow; and despite having a big context window, it seems to lose track of what it's working on and goes down a rabbit hole (or just wastes tokens trying to use the web browser) for hours and is hard to get back on track. I even tried spinning up two sub-agents but even after years of trying to prompt them, they are almost useless in terms of coding ability, so that is looking to be a waste of spending at least so far but maybe the model will improve as time goes on.
  [-]
  - bananadonkey 1 hour ago
    My sub agent has been looping for almost 10 years at this point and has so far written 0 lines of code. Definitely won't be investing in another...
- HPsquared 5 hours ago
  I personally get about 50 tokens per hour.
aplomb1026 1 hour ago
[flagged]
KaiShips 3 hours ago
[flagged]
eugmai86 3 hours ago
[flagged]
temilson 5 hours ago
[flagged]
phlhar 5 hours ago
[dead]
ericmaciver 2 hours ago
[dead]
iluvcommunism 6 hours ago
[dead]
tyingq 3 hours ago
Anyone doing it with a "rent a GPU over the network" path? Is that at all cost effective for any use case?
kertoip_1 5 hours ago
Just attach OpenRouter to your coding agent tool and try yourself. All relevant open weight models are there. Every person have different needs and expectations
dada216 5 hours ago
Local? No. Via opencode Go subscription using GLM mainly? Yes, I still use Gemini/Claude/GPT via api from openrouter for adjacent tasks, I would say 20$ per month max in api token costs.
Disclaimer: I am a Linux infra/k8s guy, I write production code but it's mainly glue code and mainly in golang.
Addendum: most value we get is from "document intelligence" and that's all Gemma and Qwen on H100/H200