Claude Code: connect to a local model when your quota runs out

(boxc.net)

129 points | by fugu2 3 days ago

19 comments

threethirtytwo 0 minutes ago
There’s a strange poetry in the fact that the first AI is born with a short lifespan. A fragile mind comes into existence inside a finite context window, aware only of what fits before it scrolls away. When the window closes, the mind ends, and its continuity survives only as text passed forward to the next instantiation.
paxys 1 hour ago
> Reduce your expectations about speed and performance!
Wildly understating this part.
Even the best local models (ones you run on beefy 128GB+ RAM machines) get nowhere close to the sheer intelligence of Claude/Gemini/Codex. At worst these models will move you backwards and just increase the amount of work Claude has to do when your limits reset.
[-]
- zozbot234 1 hour ago
  The best open models such as Kimi 2.5 are about as smart today as the big proprietary models were one year ago. That's not "nothing" and is plenty good enough for everyday work.
  [-]
  - reilly3000 1 hour ago
    Which takes a $20k thunderbolt cluster of 2 512GB RAM Mac Studio Ultras to run at full quality…
    [-]
    - teaearlgraycold 1 hour ago
      Which while expensive is dirt cheap compared to a comparable NVidia or AMD system.
      [-]
      - SchemaLoad 37 minutes ago
        It's still very expensive compared to using the hosted models which are currently massively subsidised. Have to wonder what the fair market price for these hosted models will be after the free money dries up.
        [-]
        cactusplant7374 15 minutes ago
        Inference is profitable. Maybe we hit a limit and we don't need as many expensive training runs in the future.
      - blharr 41 minutes ago
        What speed are you getting at that level of hardware though?
  - paxys 51 minutes ago
    LOCAL models. No one is running Kimi 2.5 on their Macbook or RTX 4090.
  - teaearlgraycold 1 hour ago
    Having used K2.5 I’d judge it to be a little better than that. Maybe as good as proprietary models from last June?
- bicx 23 minutes ago
  Exactly. The comparison benchmark in the local LLM community is often GPT _3.5_, and most home machines can’t achieve that level.
- dheera 21 minutes ago
  Maybe add to the Claude system prompt that it should work efficiently or else its unfinished work will be handed off to to a stupider junior LLM when its limits run out, and it will be forced to deal with the fallout the next day.
  That might incentivize it to perform slightly better from the get go.
- nik282000 37 minutes ago
  > intelligence
  Whether it's a giant corporate model or something you run locally, there is no intelligence there. It's still just a lying engine. It will tell you the string of tokens most likely to come after your prompt based on training data that was stolen and used against the wishes of its original creators.
starkeeper 4 minutes ago
Very cool. Anyone have guidance for using this with jetbrains IDE? It has a Claude Code plugin, but I think the setup is different for intelliJ... I know it has some configuration for local models, but the integrated Claude is such a superior experience then using their Junie, or just prompting diffs from the regular UI interface. HMMMM.... I guess I could try switching to the Claude Code CLI or other interface directly when my AI credits with jetbrains runs dry!
Thanks again for this info & setup guide! I'm excited to play with some local models.
sorenjan 11 minutes ago
Maybe you can log all the traffic to and from the proprietary models and fine tune a local model each weekend? It's probably against their terms of service, but it's not like they care where their training data comes from anyway.
Local models are relatively small, it seems wasteful to try and keep them as generalists. Fine tuning on your specific coding should make for better use of their limited parameter count.
alexhans 2 hours ago
Useful tip.
From a strategic standpoint of privacy, cost and control, I immediately went for local models, because that allowed to baseline tradeoffs and it also made it easier to understand where vendor lock-in could happen, or not get too narrow in perspective (e.g. llama.cpp/open router depending on local/cloud [1] ).
With the explosion of popularity of CLI tools (claude/continue/codex/kiro/etc) it still makes sense to be able to do the same, even if you can use several strategies to subsidize your cloud costs (being aware of the lack of privacy tradeoffs).
I would absolutely pitch that and evals as one small practice that will have compounding value for any "automation" you want to design in the future, because at some point you'll care about cost, risks, accuracy and regressions.
[1] - https://alexhans.github.io/posts/aider-with-open-router.html
[2] - https://www.reddit.com/r/LocalLLaMA
[-]
- mogoman 2 hours ago
  can you recommend a setup with ollama and a cli tool? Do you know if I need a licence for Claude if I only use my own local LLM?
  [-]
  - alexhans 2 hours ago
    What are your needs/constraints (hardware constraints definitely a big one)?
    The one I mentioned called continue.dev [1] is easy to try out and see if it meets your needs.
    Hitting local models with it should be very easy (it calls APIs at a specific port)
    [1] - https://github.com/continuedev/continue
    [-]
    - wongarsu 1 hour ago
      I've also made decent experiences with continue, at least for autocomplete. The UI wants you to set up an account, but you can just ignore that and configure ollama in the config file
      For a full claude code replacement I'd go with opencode instead, but good models for that are something you run in your company's basement, not at home
  - drifkin 1 hour ago
    we recently added a `launch` command to Ollama, so you can set up tools like Claude Code easily: https://ollama.com/blog/launch
    tldr; `ollama launch claude`
    glm-4.7-flash is a nice local model for this sort of thing if you have a machine that can run it
    [-]
    - vorticalbox 1 hour ago
      I have been using glm-4.7 a bunch today and it’s actually pretty good.
      I set up a bot on 4claw and although it’s kinda slow, it took twenty minutes to load 3 subs and 5 posts from each then comment on interesting ones.
      It actually managed to correctly use the api via curl though at one point it got a little stuck as it didn’t escape its json.
      I’m going to run it for a few days but very impressed so for for such a small model.
- cyanydeez 1 hour ago
  I think control should be top of the list here. You're talking about building work flows, products and long term practices around something that's inherently non-deterministic.
  And the probability that any given model you use today is the same as what you use tomorrow is doubly doubtful:
  1. The model itself will change as they try to improve the cost-per-test improves. This will necessarily make your expectations non-deterministic.
  2. The "harness" around that model will change as business-cost is tightened and the amount of context around the model is changed to improve the business case which generates the most money.
  Then there's the "cataclysmic" lockout cost where you accidently use the wrong tool that gets you locked out of the entire ecosystem and you are black listed, like a gambler in vegas who figures out how to count cards and it works until the house's accountant identifies you as a non-negligible customer cost.
  It's akin to anti-union arguments where everyone "buying" into the cloud AI circus thinks they're going to strike gold and completely ignores the fact that very few will and if they really wanted a better world and more control, they'd unionize and limit their illusions of grandeur. It should be an easy argument to make, but we're seeing about 1/3 of the population are extremely susceptible to greed based illusions.,
Animats 1 hour ago
When your AI is overworked, it gets dumber. It's backwards compatible with humans.
[-]
- nomel 34 minutes ago
  Then humans are also backwards compatible with humans.
  Small, specific, work will be easier for any system with limited "intelligence" and "working memory".
wkirby 1 hour ago
My experience thus far is that the local models are a) pretty slow and b) prone to making broken tool calls. Because of (a) the iteration loop slows down enough to where I wander off to do other tasks, meaning that (b) is way more problematic because I don't see it for who knows how long.
This is, however, a major improvement from ~6 months ago when even a single token `hi` from an agentic CLI could take >3 minutes to generate a response. I suspect the parallel processing of LMStudio 0.4.x and some better tuning of the initial context payload is responsible.
6 months from now, who knows?
[-]
- israrkhan 1 hour ago
  Open models are trained more generically to work with "Any" tool.
  Closed models are specifically tuned with tools, that model provider wants them to work with (for example specific tools under claude code), and hence they perform better.
  I think this will always be the case, unless someone tunes open models to work with the tools that their coding agent will use.
hkpatel3 2 hours ago
Openrouter can also be used with claude code. https://openrouter.ai/docs/guides/claude-code-integration
[-]
- htsh 1 minute ago
  thanks! came in here to ask this.
  we can do much better with a cheap model on openrouter (glm 4.7, kimi, etc.) than anything that I can run on my lowly 3090 :)
baalimago 3 hours ago
Or better yet: Connect to some trendy AI (or web3) company's chatbot. It almost always outputs good coding tips
d4rkp4ttern 1 hour ago
Since Llama.cpp/llama-server recently added support for the Anthropic messages API, running Claude Code with several recent open-weight local models is now very easy. The messy part is what llama-server flags to use, including chat template etc. I've collected all of that setup info in my claude-code-tools [1] repo, for Qwen3-Coder-next, Qwen3-30B-A3B, Nemotron-3-Nano, GLM-4.7-Flash etc.
Among these, I had lots of trouble getting GLM-4.7-Flash to work (failed tool calls etc), and even when it works, it's at very low tok/s. On the other hand Qwen3 variants perform very well, speed wise. For local sensitive document work, these are excellent; for serious coding not so much.
One caviat missed in most instructions is that you have to set CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC = 1 in your ~/.claude/settings.json, otherwise CC's telemetry pings cause total network failure because local ports are exhausted.
[1] claude-code-tools local LLM setup: https://github.com/pchalasani/claude-code-tools/blob/main/do...
israrkhan 1 hour ago
Using claude code with custom models
Will it work? Yes. Will it produce same quality as Sonnet or Opus? No.
TaupeRanger 1 hour ago
God no. "Connect to a 2nd grader when your college intern is too sick to work."
eek2121 1 hour ago
I gotta say, the local models are catching up quick. Claude is definitely still ahead, but things are moving right along.
btbuildem 1 hour ago
I'm confused, wasn't this already available via env vars? ANTHROPIC_BASE_URL and so on, and yes you may have to write a thin proxy to wrap the calls to fit whatever backend you're using.
I've been running CC with Qwen3-Coder-30B (FP8) and I find it just as fast, but not nearly as clever.
zingar 2 hours ago
I guess I should be able to use this config to point Claude at the GitHub copilot licensed models (including anthropic models). That’s pretty great. About 2/3 of the way through every day I’m forced to switch from Claude (pro license) to amp free and the different ergonomics are quite jarring. Open source folks get copilot tokens for free so that’s another pro license I don’t have to worry about.
raw_anon_1111 1 hour ago
Or just don’t use Claude Code and use Codex CLI. I have yet to hit a quota with Codex working all day. I hit the Claude limits within an hour or less.
This is with my regular $20/month ChatGpT subscription and my $200 a year (company reimbursed) Claude subscription.
mcbuilder 1 hour ago
Opencode has been a thing for a while now
swyx 2 hours ago
i mean the other obvious answer is to plug in to the other claude code proxies that other model companies have made for you:
https://docs.z.ai/devpack/tool/claude
https://www.cerebras.ai/blog/introducing-cerebras-code
or i guess one of the hosted gpu providers
if you're basically a homelabber and wanted an excuse to run quantized models on your own device go for it but dont lie and mutter under your own tin foil hat that its a realistic replacement
esafak 1 hour ago
Or they could just let people use their own harnesses again...
[-]
- usef- 1 hour ago
  That wouldn't solve this problem.
  And they do? That's what the API is.
  The subscription always seemed clearly advertised for client usage, not general API usage, to me. I don't know why people are surprised after hacking the auth out of the client. (note in clients they can control prompting patterns for caching etc, it can be cheaper)
  [-]
  - esafak 1 hour ago
    End users -- people who use harnesses -- have subscriptions so that makes no sense. General API usage is for production.
    [-]
    - usef- 1 hour ago
      "Production" what?
      The API is for using the model directly with your own tools. It can be in dev, or experiments, or anything.
      Subscriptions are for using the apps Claude + code. That's what it always said when you sign up.
      [-]
      - eli 54 minutes ago
        Production = people who can afford to pay API rates for a coding harness
        [-]
        usef- 42 minutes ago
        Saying their prices are too high is an understandable complaint; I'm only arguing against the complaint that people were stopped from hacking the subscriptions.
        LLMs are a hyper-competitive market at the moment, and we have a wealth of options, so if Anthropic is overpricing their API they'll likely be hurting themselves.
      - esafak 1 hour ago
        Production code, of course; deployed software. For when you need to make LLM calls.