Recently, I asked Codex CLI to refactor some HTML files. It didn't literally copy and pasted snippets here and there as I would have done myself, it rewrote them from memory, removing comments in the process. There was a section with 40 successive <a href...> links with complex URLs.
A few days later, just before deployment to production, I wanted to double check all 40 links. First one worked. Second one worked. Third one worked. Fourth one worked. So far so good. Then I tried the last four. Perfect.
Just to be sure, I proceeded with the fifth one. 404. Huh. Weird. The domain was correct though and the URL seemed reasonable.
I tried the other 31 links. ALL of them 404ed. I was totally confused. The domain was always correct. It seemed highly suspicious that all websites would have had moved internal URLs at the same time. I didn't even remember that this part of the code had gone through an LLM.
Fortunately, I could retrieve the old URLs on old git commits. I checked the URLs carefully. The LLM had HALLUCINATED most of the path part of the URLs! Replacing things like domain.com/this-article-is-about-foobar-123456/ by domain.com/foobar-is-so-great-162543/...
These kinds of very subtle and silently introduced mistakes are quite dangerous. Be careful out there!
The last point I think is most important: "very subtle and silently introduced mistakes" -- LLMs may be able to complete many tasks as well (or better) than humans, but that doesn't mean they complete them the same way, and that's critically important when considering failure modes.
In particular, code review is one layer of the conventional swiss cheese model of preventing bugs, but code review becomes much less effective when suddenly the categories of errors to look out for change.
When I review a PR with large code moves, it was historically relatively safe to assume that a block of code was moved as-is (sadly only an assumption because GitHub still doesn't have indicators of duplicated/moved code like Phabricator had 10 years ago...), so I can focus my attention on higher level concerns, like does the new API design make sense? But if an LLM did the refactor, I need to scrutinize every character that was touched in the block of code that was "moved" because, as the parent commenter points out, that "moved" code may have actually been ingested, summarized, then rewritten from scratch based on that summary.
For this reason, I'm a big advocate of an "AI use" section in PR description templates; not because I care whether you used AI or not, but because some hints about where or how you used it will help me focus my efforts when reviewing your change, and tune the categories of errors I look out for.
I've had similar experience both in coding and in non-coding research questions. An LLM will do the first N right and fake its work on the rest.
It even happens when asking an LLM to reformat a document, or asking it to do extra research to validate information.
For example, before a recent trip to another city, I asked Gemini to prepare a list of brewery taprooms with certain information, and I discovered it had included locations that had been closed for years or had just been pop-ups. I asked it to add a link to the current hours for each taproom and remove locations that it couldn't verify were currently open, and it did this for about the first half of the list. For the last half, it made irrelevant changes to the entries and didn't remove any of the closed locations. Of course it enthusiastically reported that it had checked every location on the list.
LLMs are not good at "cycles" - when you have to go over a list and do the same action on each item.
It's like it has ADHD and forgets or gets distracted in the middle.
And the reason for that is that LLMs don't have memory and process the tokens, so as they keep going over the list the context becomes bigger with more irrelevant information and they can lose the reason they are doing what they are doing.
It would be nice if the tools we usually use for LLMs had a bit more programmability. In this example, It we could imagine being able to chunk up work by processing a few items, then reverting to a previous saved LLM checkpoint of state, and repeating until the list is complete.
I imagine that the cost of saving & loading the current state must be prohibitively high for this to be a normal pattern, though.
Which is annoying because that is precisely the kind of boring rote programming tasks I want an LLM to do for me, to free up my time for more interesting problems
Not code, but I once pasted an event announcement and asked for just spelling and grammar check. LLM suggested a new version with minor tweak which I copy pasted back.
Just before sending I noticed that it had moved the event date by one day. Luckily I caught it but it taught me that you never should blindly trust LLM output even with super simple tasks, no relevant context size, clear and simple one sentence prompt.
LLM's do the most amazing things but they also sometimes screw up the simplest of tasks in the most unexpected ways.
>Not code, but I once pasted an event announcement and asked for just spelling and grammar check. LLM suggested a new version with minor tweak which I copy pasted back. Just before sending I noticed that it had moved the event date by one day.
This is the kind of thing I immediately noticed about LLMs when I used them for the first time. Just anecdotally, I'd say it had this problem 30-40% of the time. As time has gone on, it has gotten so much better. But it still makes this kind of problem -- lets just say -- 5% of the time.
The thing is, it's almost more dangerous to rarely make the problem. Because now people aren't constantly looking for it.
You have no idea if it's not just randomly flipping terms or injecting garbage unless you actually validate it. The ideal of giving it an email to improve and then just scanning the result before firing it off is terrifying to me.
I truly wonder how much time we have before some spectacular failure will happen because a LLM was asked to rewrite a file with a bunch of constants in it in critical software and silently messed up or inverted them in a way that looks reasonable and works in your QA environment and then leads to a spectacular failure in the field.
5 minutes ago, I asked Claude to add some debug statements in my code. It also silently changed a regex in the code. It was easily caught with the diff but can be harder to spot in larger changes.
I asked Claude to add a debug endpoint to my hardware device that just gave memory information. It wrote 2600 lines of C that gave information about every single aspect of the system. On the one hand kind of cool. It looked at the MQTT code and the update code, the platform (esp) and generated all kinds of code. It recommended platform settings that could enable more detailed information that checked out when I looked at the docs. I ran it and it worked. On the other hand, most of the code was just duplicated over and over again ex: 3 different endpoints that gave overlapping information. About half of the code generated fake data rather than actually do anything with the system.
I rolled back and re-prompted and got something that looked good and worked. The LLMs are magic when they work well but they can throw a wrench into your system that will cost you more if you don't catch it.
I also just had a 'senior' developer tell me that a feature in one of our platforms was deprecated. This was after I saw their code which did some wonky hacky like stuff to achieve something simple. I checked the docs and said feature (URL Rewriting) was obviously not deprecated. When I asked how they knew it was deprecated they said Chat GPT told them. So now they are fixing the fix chat gpt provided.
Hah I also happened to use Claude recently to write basic MQTT code to expose some data on a couple Orange Pis I wanted to view in Home Assistant. And it one-shot this super cool mini Python MQTT client I could drop wherever I needed it which was amazing having never worked with MQTT in Python before.
I made some charts/dashboards in HA and was watching it in the background for a few minutes and then realized that none of the data was changing, at all.
So I went and looked at the code and the entire block that was supposed to pull the data from the device was just a stub generating test data based on my exact mock up of what I wanted the data it generated to look like.
Claude was like, “That’s exactly right, it’s a stub so you can replace it with the real data easily, let me know if you need help with that!” And to its credit, it did fix it to use actual data but I re-read my original prompt was somewhat baffling to think it could have been interpreted as wanting fake data given I explicitly asked it to use real data from the device.
I will also add checks to make sure the data that I get is there even though I checked 8 times already and provide loads of logging statements and error handling. Then I will go to every client that calls this API and add the same checks and error handling with the same messaging. Oh also with all those checks I'm just going to swallow the error at the entry point so you don't even know it happened at runtime unless you check the logs. That will be $1.25 please.
I asked it to change some networking code, which it did perfectly, but I noticed some diffs in another file and found it had just randomly expanded some completely unrelated abbreviations in strings which are specifically shortened because of the character limit of the output window.
I had a pretty long regex in a file that was old and crusty, and when I had Claude add a couple helpers to the file, it changed the formatting of the regex to be a little easier on the eyes in terms of readability.
But I just couldn't trust it. The diff would have been no help since it went from one long gnarly line to 5 tight lines. I kept the crusty version since at least I am certain it works.
Incorrect data is a hard one to catch, even with automated tests (even in your tests, you're probably only checking the first link, if you're event doing that).
Luckily I've grown a preference for statically typed, compiled, functional languages over the years, which eliminates an entire class of bugs AND hallucinations by catching them at compile time. Using a language that doesn't support null helps too. The quality of the code produced by agents (claude clode and codex) is insanely better than when I need to fix some legacy code written in a dynamic language. You'll sometimes catch the agent hallucinating and continuously banging it's head against the wall trying to get it's bad code to compile. It seems to get more desperate and may eventually figure out a way to insert some garbage to get it to compile or just delete a bunch of code and paper over it... but it's generally very obvious when it does this as long as you're reviewing. Combine this with git branches and a policy of frequent commits for greatest effect.
You can probably get most of the way there with linters and automated tests with less strict dynamic languages, but... I don't see the point for new projects.
I've even found Codex likes to occasionally make subtle improvements to code located in the same files but completely unrelated to the current task. It's like some form of AI OCD. Reviewing diffs is kind of essential, so using a foundation that reduces the size of those diffs and increases readability is IMO super important.
It was a fairly big refactoring basically converting a working static HTML landing page into a Hugo website, splitting the HTML into multiple Hugo templates. I admit I was quite in a hurry and had to take shortcuts. I didn't have time to write automated tests and had to rely on manual tests for this single webpage. The diff was fairly big. It just didn't occur to me that the URLs would go through the LLMs and could be affected! Lesson learnt haha.
Speaking of agents and tests, here's a fun one I had the other day: while refactoring a large code base I told the agent to do something precise to a specific module, refactor with the new change, then ensure the tests are passing.
The test suite is slow and has many moving parts; the tests I asked it to run take ~5 minutes. The thing decided to kill the test run, then it made up another command it said was the 'tests' so when I looked at the agent console in the IDE everything seemed fine collapsed, i.e. 'Tests ran successfully'.
Obviously the code changes also had a subtle bug that I only saw when pushing its refactoring to CI (and more waiting). At least there were tests to catch the problem.
I think that it's something that model providers don't want to fix, because the amount of times that Claude Code just decided to delete tests that were not passing before I added a memory saying that it would need to ask for my permission to do that was staggering. It stopped happening after the memory, so I believe that it could be easily fixed by a system prompt.
this is why I'm terrified of large LLM slop changesets that I can't check side by side - but then that means I end up doing many small changes that are harder to describe in words than to just outright do.
This and why are the URLs hardcoded to begin with? And given the chaotic rewrite by Codex it would probably be more work to untangle the diff than just do it yourself right away.
Reminds me when I asked Claude (through Windsurf) to create a S3 Lambda trigger to resize images (as soon as PNG image appears in S3, resize it). The code looked flawless and I deployed ..only to learn that I introduced a perpetual loop :) For every image resized, a new one would be created and resized. In 5 min, the trigger created hundreds of thousands of images ...what a joy was to clean that up in S3
"very subtle and silently introduced mistakes" - that's the biggest bottleneck I think; as long as it's true, we need to validate LLMs outputs; as long as we must validate LLMs outputs, our own biological brains are the ultimate bottleneck
Not related to code... But when I use a LLM to perform a kind of copy/paste, I try to number the lines and ask it to generate a start_index and stop_index to perform the slice operation. Much less hallucinations and very cheap in token generation.
My custom prompt instructs GPT to output changes to code as a diff/git-patch. I don’t use agents because it makes it hard to see what’s happening and I don’t trust them yet.
I’ve tried this approach when working in chat interfaces (as opposed to IDEs), but I often find it tricky to review diffs without the full context of the codebase.
That said, your comment made me realize I could be using “git apply”more effectively to review LLM-generated changes directly in my repo. It’s actually a neat workflow!
In these cases I explicitly tell the llm to make as few changes as possible and I also run a diff. And then I reiterate with a new prompt if too many things changed.
You can always run a diff. But how good are people at reading diffs? Not very. It's the kind of thing you would probably want a computer to do. But now we've got the computer generating the diffs (which it's bad at) and humans verifying them (which they're also bad at).
You’re just not using LLMs enough. You can never trust the LLM to generate a url, and this was known over two years ago. It takes one token hallucination to fuck up a url.
It’s very good at a fuzzy great answer, not a precise one. You have to really use this thing all the time and pick up on stuff like that.
Yeah so, the reason people use various tools and machines in the first place is to simplify the work or everydays tasks by : 1) Making the tasks execute faster 2) Getting more reliable outputs then doing this by yourself 3) Making it repeatable . The LLMs obviously dont check any of these boxes so why don´t we stop pretending that we as users are stupid and don´t know how to use them and start taking them for what they are - cute little mirages, perhaps applicable as toys of some sort, but not something we should use for serious engineering work really?
> why don´t we stop pretending that we as users are stupid and don´t know how to use them
This is in response to someone who saw a bunch of URLs coming out of it and was surprised at a bunch of them being wrong. That's using the tool wrong. It's like being surprised that the top results in google/app store/play store aren't necessarily the best match for your query but actually adverts!
The URLs being wrong in that specific case is one where they were using the "wrong tool". I can name you at least a dozen other cases from own experience, where too, they appear to be the wrong tool, for example for working with Terraform or for not exposing secrets by hardcoding them in the frontend. Et cetera. Many other people will have contributed thousands if not more similar but different cases. So what good are these tools then for really? Are we all really that stupid? Many of us mastered the hard problem of navigating various abstraction layers of computer over the years, only to be told, we now effing dont know how to write a few sentences in English? Come on. I'd be happy to use them in whatever specific domain they supposedly excel at. But no-one seems to be able to identify one for sure. The problem is, the folks pushing or better said, shoving down these bullshit generators down our throats are trying to sell us the promise of an "everything oracle". What did old man Altman tell us about ChatGPT 5? PhD level tool for code generation or some similar nonsense? But it turns out it only gets one metric right each time - generating a lot of text. So, essentially, great for bullshit jobs (i count some of the IT jobs as such too), but not much more.
> Many of us mastered the hard problem of navigating various abstraction layers of computer over the years, only to be told, we now effing dont know how to write a few sentences in English? Come on.
If you're trying to one shot stuff with a few sentences then yes you might be using these things wrong. I've seen people with PhDs fail to use google successfully to find things, were they idiots? If you're using them wrong you're using them wrong - I don't care how smart you are in other areas. If you can't hand off work knowing someones capabilities then that's a thing you can't do - and that's ok. I've known unbelievably good engineers who couldn't form a solid plan to solve a business problem or collaboratively work to get something done to save their life. Those are different skills. But gpt5-codex and sonnet 4 / 4.5 can solidly write code, gpt-5-pro with web search can really dig into things, and if you can manage what they can do you can hand off work to them. If you've only ever worked with juniors with a feeling of "they slow everything down but maybe someday they'll be as useful as me" then you're less likely to succeed at this.
Let's do a quick overview of recent chats for me:
* Identifying and validating a race condition in some code
* Generating several approaches to a streaming issue, providing cost analyses of external services and complexity of 3 different approaches about how much they'd change the code
* Identifying an async bug two good engineers couldn't find in a codebase they knew well
* Finding performance issues that had gone unnoticed
* Digging through synapse documentation and github issues to find a specific performance related issue
* Finding the right MSC for a feature I wanted to use but didn't know existed - and then finding the github issue that explained how it was only half implemented and how to enable the experimental other part I needed
* Building a bunch of UI stuff for a short term contract I needed, saving me a bunch of hours and the client money
* Going through funding opportunities and matching them against a charity I want to help in my local area
* Building a search integration for my local library to handle my kids reading challenge
* Solving a series of VPN issues I didn't understand
* Writing a lot of astro related python for an art project to cover the loss of some NASA images I used to have access to.
> the folks pushing or better said
If you don't want to trust them, don't. Also don't believe the anti-hype merchants who want to smugly say these tools can't do a god damn thing. They're trying to get attention as well.
Again mate, stop making arrogant assumptions and read some of my previous comments. I and my team are early adopters, since about 2 years. I am even paying for premium-level service. Trust me, it sucks and under-delivers. But good for you and others who claim they are productive with it - I am sure we will see those 10x apps rolling in soon, right? It's only been like 4 years since the revolutionary magic machine was announced.
I read your comments. Did you read mine? You can pass them into chatgpt or claude or whatever premium services you pay for to summarise them for you if you want.
> Trust me, it sucks
Ok. I'm convinced.
> and under-delivers.
Compared to what promise?
> I am sure we will see those 10x apps rolling in soon, right?
Did I argue that? If you want to look at some massive improvements, I was able to put up UIs to share results & explore them with a client within minutes rather than it taking me a few hours (which from experience it would have done).
> It's only been like 4 years since the revolutionary magic machine was announced.
It's been less than 3 since chatgpt launched, which if you'd been in the AI sphere as long as I had (my god it's 20 years now) absolutely was revolutionary. Over the last 4 years we've seen gpt3 solve a bunch of NLP problems immediately as long as you didn't care about cost to gpt-5-pro with web search and codex/sonnet being able to explore a moderately sized codebase and make real and actual changes (running tests and following up with changes). Given how long I spent stopping a robot hitting the table because it shifted a bit and its background segmentation messed up, or fiddling with classifiers for text, the idea I can get a summary from input without training is already impressive and then to be able to say "make it less wanky" and have it remove the corp speak is a huge shift in the field.
If your measure of success is "the CEOs of the biggest tech orgs say it'll do this soon and I found a problem" then you'll be permanently disappointed. It'd be like me sitting here saying mobile phones are useless because I was told how revolutionary the new chip in an iphone was in a keynote.
Since you don't seem to want to read most of this, most isn't for you. The last bit is, and it's just one question:
Why are you paying for something that solves literally no problems for you?
> This is in response to someone who saw a bunch of URLs coming out of it and was surprised at a bunch of them being wrong. That's using the tool wrong. It's like being surprised that the top results in google/app store/play store aren't necessarily the best match for your query but actually adverts!
The CEO of Anthropic said I can fire all of my developers soon. How could one possibly be using the tool wrong? /s
Stop what mate? My words are not the words of someone who ocassionally dabbles in the free ChatGPT layer - I've been paying premium tier AI tools for my entire company for a long time now. Recently we had to scale back their usage to just consulting mode, i.e. because the agent mode has gone from somewhat-useful to complete waste of time. We are now back to using them as replacement for the now entshittified search. But as you can see by my early adopting of these crap-tools, I am open-minded. I'd love to see what great new application you have built using them. But if you don't have anything to show, I'll also take some arguments, you know, like the stuff I provided in my first comment.
I'll take the L when llms can actually do my job to the level I expect. Llms can do some of my work but they are tiring they make mistakes and they absolutely get confused by a sufficiently complex and large codebase.
Quite frankly, not being able to discuss the pros and the cons of a technology with other engineers absolutely hinders innovation. A lot of discoveries come out of mistakes.
They're moderately unreliable text copying machines if you need exact copying of long arbitrary strings. If that's what you want, don't use LLMs. I don't think they were ever really sold as that, and we have better tools for that.
On the other hand, I've had them easily build useful code, answer questions and debug issues complex enough to escape good engineers for at least several hours.
Depends what you want. They're also bad (for computers) at complex arithmetic off the bat, but then again we have calculators.
> I don't think they were ever really sold as that, and we have better tools for that.
We have OpenAI calling gpt5 as having PhD level of intelligence and others like Anthropoc saying it will write all our code within months. Some are claiming it’s already writing 70%.
I say they are being sold as a magical do everything tool.
Intelligence isn't the same as "can exactly replicate text". I'm hopefully smarter than a calculator but it's more reliable at maths than me.
Also there's a huge gulf between "some people claim it can do X" and "it's useful". Altman promising something new doesn't decrease the usefulness of a model.
What you are describing is "dead reasoning zones".[0]
"This isn't how humans work. Einstein never saw ARC grids, but he'd solve them instantly. Not because of prior knowledge, but because humans have consistent reasoning that transfers across domains. A logical economist becomes a logical programmer when they learn to code. They don't suddenly forget how to be consistent or deduce.
But LLMs have "dead reasoning zones" — areas in their weights where logic doesn't work. Humans have dead knowledge zones (things we don't know), but not dead reasoning zones. Asking questions outside the training distribution is almost like an adversarial attack on the model."
Well, you see it hallucinates on long precise strings, but if we ignore that, and focus on what it’s powerful at, we can do something powerful. In this case, by the time it gets to outputting the url, it already determined the correct intent or next action (print out a url). You use this intent to do a tool call to generate a url. Small aside, it’s ability to figure what and why is pure magic, for those still peddling the glorified autocomplete narrative.
You have to be able to see what this thing can actually do, as opposed to what it can’t.
I can’t even tell if you’re being sarcastic about a terrible tool or are hyping up LLMs as intelligent assistants and telling me we’re all holding it wrong.
This is very poorly worded. Using LLMs more wouldn't solve the problem. What you're really saying is that the GP is uninformed about LLMs.
This may seem like pedantry on my part but I'm sick of hearing "you're doing it wrong" when the real answer is "this tool can't do that." The former is categorically different than the latter.
It's pretty clearly worded to me, they don't use LLMs enough to know how to use them successfully. If you use them regularly you wouldn't see a set of urls without thinking "Unless these are extremely obvious links to major sites, I will assume each is definitely wrong".
> I'm sick of hearing "you're doing it wrong"
That's not what they said. They didn't say to use LLMs more for this problem. The only people that should take the wrong meaning from this are ones who didn't read past the first sentence.
> when the real answer is "this tool can't do that."
> If you use them regularly you wouldn't see a set of urls without thinking...
Sure, but conceivably, you could also be informed of this second hand, through any publication about LLMs, so it is very odd to say "you don't use them enough" rather than "you're ignorant" or "you're uninformed". It is very similar to these very bizarre AI-maximalist positions that so many of us are tired of seeing.
This isn't ai maximalist though, it's explicitly pointing out something that regularly does not work!
> Sure, but conceivably, you could also be informed of this second hand, through any publication about LLMs, so it is very odd to say "you don't use them enough" rather than "you're ignorant" or "you're uninformed".
But this is to someone who is actively using them, and the suggestion of "if you were using them more actively you'd know this, this is a very common issue" is not at all weird. There are other ways they could have known this, but they didn't.
"You haven't got the experience yet" is a much milder way of saying someone doesn't know how to use a tool properly than "you're ignorant".
I think part of the issue is that it doesn't "feel" like the LLM is generating a URL, because that's not what a human would be doing. A human would be cut & pasting the URLs, or editing the code around them - not retyping them from scratch.
Edit: I think I'm just regurgitating the article here.
I have a project that I've leaned heavily on LLM help for which I consider to embody good quality control practices. I had to get pretty creative to pull it off: spent a lot of time working on this sync system so that I can import sanitized production data into the project for every table it touches (there are maybe 500 of these) and then there's a bunch of hackery related to ensuring I can still get good test coverage even when some of these flows are partially specified (since adding new ones proceeds in several separate steps).
If it was a project written by humans I'd say they were crazy for going so hard on testing.
The quality control practices you need for safely letting an LLM run amok aren't just good. They're extreme.
This is of course bad but: humans also makes (different) mistakes all the time. We could account for the risk of mistakes being introduced and make more tools that validate things for us. In a way LLM:s encourage us to do this by adding other vectors of chaos into our work.
Like, why not have tools built into our environment that checks that links are not broken? With the right architecture we could have validations for most common mistakes without having the solution adding a bunch of tedious overhead.
In the above kind of described situation, a meticulous coder actually makes no mistakes. They will however make a LOT more mistakes if they use LLM's to do the same.
I have already had to correct a LOT of crap similar to the above in refactoring-done-via-LLM over the last year.
When stuff like this was done by a plain, slow, organic human, it was far more accurate. And many times, completely accurate with no defects. Simply because many developers pay close attention when they are forced to do the manual labour themselves.
Sure the refactoring commit is produced faster with LLM assistance, but repeatedly reviewing code and pointing out weird defects is very stressful.
A meticulous coder probably wouldn't have typed out 40 URLs just because they want to move them from one file to another. They would copy-past them and run some sed-like commands. You could instruct an LLM agent to do something similar. For modifying a lot of files or a lot of lines, I instruct them to write a script that does what I need instead of telling them to do it themselves.
> I have already had to correct a LOT of crap similar to the above in refactoring-done-via-LLM over the last year
The person using the LLM should be reviewing their code before submitting it to you for review. If you can catch a copy paste error like this, then so should they.
The failure you're describing is that your coworkers are not doing their job.
And if you accept "the LLM did that, not me" as an excuse then the failure is on you and it will keep happening.
I think it goes without saying that we need to be sceptical when to use and not use LLM. The point I'm trying to make is more that we should have more validations and not that we should be less sceptical about LLMs.
Meticulousness shouldn't be an excuse to not have layers of validation that doesn't have to cost that much if done well.
Your point to not rely on good intentions and have systems in place to ensure quality is good - but your comparison to humans didn't go well with me.
Very few humans fill in their task with made up crap then lie about it - I haven't met any in person. And if I did, I wouldn't want to work with them, even if they work 24/7.
Obligatory disclaimer for future employers: I believe in AI, I use it, yada yada. The reason I'm commenting here is I don't believe we should normalise this standard of quality for production work.
I agree, these kinds of stories should encourage us to setup more robust testing/backup/check strategies. Like you would absolutely have to do if you suddenly invited a bunch of inexperienced interns to edit your production code.
Just the other day I hit something that I hadn't realized could happen. It was not code related in my case, but could happen with code or code-related things (and did to a coworker).
In a discussion here on HN about why a regulation passed 15 years ago was not as general as it could have been, I speculated [1] that it could be that the technology at the time was not up to handling the general case and so they regulated what was feasible at the time.
A couple hours later I checked the discussion again and a couple people had posted that the technology was up to the general case back then and cheap.
I asked an LLM to see if it could dig up anything on this. It told me it was due to technological limits.
I then checked the sources it cites to get some details. Only one source it cited actually said anything about technology limits. That source was my HN comment.
I mentioned this at work, and a coworker mentioned that he had made a Github comment explaining how he thought something worked on Windows. Later he did a Google search about how that thing worked and the LLM thingy that Google puts at the top of search results said that the thing worked the way he thought it did but checking the cites he found that was based on his Github comment.
I'm half tempted to stop asking LLMs questions of the form "How does X work?" and instead tell them "Give me a list of all the links you would cite if someone asked you how X works?".
Essentially, you're asking the LLM to do research and categorize/evaluate that research instead of just giving you an answer. The "work" of accessing, summarizing, and valuing the research yields a more accurate result.
I think LLMs provide value, used it this morning to fix a bug in my PDF Metadata parser without having to get too deep into the PDF spec.
But most of the time, I find that the outputs are nowhere near the effect of just doing it myself. I tried Codex Code the other day to write some unit tests. I had a few setup and wanted to use it (because mocking the data is a pain).
It took about 8 attempts, I had to manually fix code, it couldn't understand that some entities were obsolete (despite being marked and the original service not using them). Overall, was extremely disappointed.
I still don't think LLMs are capable of replacing developers, but they are great at exposing knowledge in fields you might not know and help guide you to a solution, like Stack Overflow used to do (without the snark).
I think LLMs have what it takes at this point in time, but it's the coding agent (combined with the model) that make the magic happen. Coding agents can implement copy-pasting, it's a matter of building the right tool for it, then iterating with given models/providers, etc. And that's true for everything else that LLMs lack today. Shortcomings can be remediated with good memory and context engineering, safety-oriented instructions, endless verification and good overall coding agent architecture. Also having a model that can respond fast, have a large context window and maintain attention to instructions is also essential for a good overall experience.
And the human prompting, of course. It takes good sw engineering skills, particularly knowing how to instruct other devs in getting the work done, setting up good AGENTS.md (CLAUDE.md, etc) with codebase instructions, best practices, etc etc.
So it's not an "AI/LLMs are capable of replacing developers"... that's getting old fast. It's more like, paraphrasing the wise "it's not what your LLM can do for you, but what can you do for your LLM"
>Sure, you can overengineer your prompt to try get them to ask more questions
That's not overengineering, that's engineering. "Ask clarifying questions before you start working", in my experience, has led to some fantastic questions, and is a useful tool even if you were to not have the AI tooling write any code. As a good programmer, you should know when you are handing the tool a complete spec to build the code and when the spec likely needs some clarification, so you can guide the tool to ask when necessary.
You can even tell it how many questions to ask. For complex topics, I might ask it to ask me 20 or 30 questions. And I'm always surprised how good those are. You can also keep those around as a QnA file for later sessions or other agents.
I see a pattern in these discussions all the time: some people say how very, very good LLMs are, and others say how LLMs fail miserably; almost always the first group presents examples of simple CRUD apps, frontend "represent data using some JS-framework" kind of tasks, while the second group presents examples of non-trivial refactoring, stuff like parsers (in this thread), algorithms that can't be found in leetcode, etc.
Tech twitter keeps showing "one-shotting full-stack apps" or "games", and it's always something extremely banal. It's impressive that a computer can do it on its own, don't get me wrong, but it was trivial to programmers, and now it is commoditized.
The two groups are very different but I notice another pattern: you have people who like coding and understanding details of what their are doing, are curious, what to learn about the why and think about edge cases; and there's another group of people who just want to code something, make a test pass, show a nice UI and that's it, but don't think much about edge cases or maintainability. The only thing they think is "delivering value" to customers.
Usually those two groups correlate very well with liking LLMs: some people will ask Claude to create a UI with React and see the mess it generated (even if it mostly works) and the edge cases it left out and comment in forums that LLMs don't work. The other group of people will see the UI working and call it a day without even noticing the subtleties.
I use LLMs to vibe-code entire tools that I need for my work. They're really banal boring apps that are relatively simple, but they still would have wasted a day or two each to write and debug. Even stuff as simple as laying out the whole UI in a nice pattern. Most of these are now practically one-shots from the latest Claude and GPT. I leave them churning, get coffee, come back and test the finished product.
Yesterday, I got Claude Code to make a script that tried out different point clustering algorithms and visualise them. It made the odd mistake, which it then corrected with help, but broadly speaking it was amazing. It would've taken me at least a week to write by hsnd, maybe longer. It was writing the algorithms itself, definitely not just simple CRUD stuff.
I also got good results for “above CRUD” stuff occasionally. Sorry if I wasn’t clear, I meant to primarily share an observation about vastly different responses in discussions related to LLMs. I don’t believe LLMs are completely useless for non-trivial stuff, nor I believe that they won’t get better. Even those two problems in the linked article: sure, those actions are inherently alien to the LLM’s structure itself, but can be solved with augmentation.
That's actually a very specific domain, which is well documented and researched in which LLM's will alawys do well. Shit will hit the fans quickly when you're going to do integration where it won't have a specific problem domain.
Yep - visualizing clustering algorithms is just the "CRUD app" of a different speciality.
One rule of thumb I use, is if you could expect to find a student on a college campus to do a task for you, an LLM will probably be able to do a decent job. My thinking is because we have a lot of teaching resources available for how to do that task, which the training has of course ingested.
In my experience it's been great to have LLMs for narrowly-scoped tasks, things I know how I'd implement (or at least start implementing) but that would be tedious to manually do, prompting it with increasingly higher complexity does work better than I expected for these narrow tasks.
Whenever I've attempted to actually do the whole "agentic coding" by giving it a complex task, breaking it down in sub-tasks, loading up context, reworking the plan file when something goes awry, trying again, etc. it hasn't a single fucking time done the thing it was supposed to do to completion, requiring a lot of manual reviewing, backtracking, nudging, it becomes more exhausting than just doing most of the work myself, and pushing the LLM to do the tedious work.
It does work sometimes to use for analysis, and asking it to suggest changes with the reasoning but not implement them, since most times when I let it try to implement its broad suggestions it went haywire, requiring me to pull back, and restart.
There's a fine line to walk, and I only see comments on the extremes online, it's either "I let 80 agents running and they build my whole company's code" or "they fail miserably on every task harder than a CRUD". I tend to not believe in either extreme, at least not for the kinds of projects I work on which require more context than I could ever fit properly beforehand to these robots.
The function of technological progress, looked at through one lens, is to commoditise what was previously bespoke. LLMs have expanded the set of repeatable things. What we're seeing is people on the one hand saying "there's huge value in reducing the cost of producing rote assets", and on the other "there is no value in trying to apply these tools to tasks that aren't repeatable".
It might be a meme project, but it's still impressive as hell we're here.
I learned about this from a yt content creator that took that repo, asked cc to "make it so that variables can be emojis", and cc did that 5$ later. Pretty cool.
There’s no evidence that this ever happened other than this guy’s word. And since the claim that he ran an agent with no human intervention for 3 months is so far outside of any capabilities demonstrated by anyone else, I’m going to need to see some serious evidence before I believe it.
> There’s no evidence that this ever happened other than this guy’s word.
There's a yt channel where the sessions were livestreamed. It's in their FAQ. I haven't felt the need to check them, but there are 10-12h sessions in there if you're that invested in proving that this is "so far outside of any capabilities"...
A brief look at the commit history should show you that it's 99.9% guaranteed to be written by an LLM :)
When's the last time you used one of these SotA coding agents? They've been getting better and better for a while now. I am not surprised at all that this worked.
Novel as in "an LLM can maintain coherence on a 100k+ LoC project written in zig"? Yeah, that's absolutely novel in this space. This wasn't possible 1 year ago. And this was fantasy 2.5 years ago when chatgpt launched.
Also impressive in that cc "drove" this from a simple prompt. Also impressive that cc can do stuff in this 1M+ (lots of js in the extensions folders?) repo. Lots of people claim LLMs are useless in high LoC repos. The fact that cc could navigate a "new" language and make "variables as emojis" work is again novel (i.e. couldn't be done 1 year ago) and impressive.
I don’t buy it at all. Not even Anthropic or Open AI have come anywhere close to something like this.
Running for 3 months and generating a working project this large with no human intervention is so far outside of the capabilities of any agent/LLM system demonstrated by anyone else that the mostly likely explanation is that the promoter is lying about it running on its own for 3 months.
I looked through the videos listed as “facts” to support the claims and I don’t see anything longer than a few hours.
Agreed with the points in that article, but IMHO the no 1 issue is that agents only see a fraction of the code repository. They don't know whether there is a helper function they could use, so they re-implement it. When contributing to UIs, they can't check the whole UI to identify common design patterns, so they re-invent it.
The most important task for the human using the agent is to provide the right context. "Look at this file for helper functions", "do it like that implementation", "read this doc to understand how to do it"... you can get very far with agents when you provide them with the right context.
(BTW another issue is that they have problems navigating the directory structure in a large mono repo. When the agents needs to run commands like 'npm test' in a sub-directory, they almost never get it right the first time)
This is what I keep running into. Earlier this week I did a code review of about new lines of code, written using Cursor, to implement a feature from scratch, and I'd say maybe 200 of those lines were really necessary.
But, y'know what? I approved it. Because hunting down the existing functions it should have used in our utility library would have taken me all day. 5 years ago I would have taken the time because a PR like that would have been submitted by a new team member who didn't know the codebase well, and helping to onboard new team members is an important part of the job. But when it's a staff engineer using Cursor to fill our codebase with bloat because that's how management decided we should work, there's no point. The LLM won't learn anything and will just do the same thing over again next week, and the staff engineer already knows better but is being paid to pretend they don't.
>>because that's how management decided we should work, there's no point
If you are personally invested, there would be a point. At least if you plan to maintain that code for a few more years.
Let's say you have a common CSS file, where you define .warning {color: red}. If you want the LLM to put out a warning and you just tell it to make it red, without pointing out that there is the .warning class, it will likely create a new CSS def for that element (or even inline it - the latest Claude Code has a tendency to do that). That's fine and will make management happy for now.
But if later management decides that it wants all warning messages to be pink, it may be quite a challenge to catch every place without missing one.
There really wouldn't be; it would just be spitting into the wind. What am I going to do, convince every member of my team to ignore a direct instruction from the people who sign our paychecks?
I really really hate code review now. My colleagues will have their LLMs generate thousands of lines of boiler plate with every pattern and abstraction under the sun. A lazy programmer use to do the bare minimum and write not enough code. That made review easy. Error handling here, duplicate code there, descriptive naming here, and so on. Now a lazy programmer generates a crap load of code cribbed from "best practice" tutorials, much of it unnecessary and irrelevant for the actual task at hand.
> When the agents needs to run commands like 'npm test' in a sub-directory, they almost never get it right the first time)
I was running into this constantly on one project with a repo split between a Vite/React front end and .NET backend (with well documented structure). It would sometimes go into panic mode after some npm command didn’t work repeatedly and do all sorts of pointless troubleshooting over and over, sometimes veering into destructive attempts to rebuild whatever it thought was missing/broken.
I kept trying to rewrite the section in CLAUDE.md to effectively instruct it to always first check the current directory to verify it was in the correct $CLIENT or $SERVER directory. But it would still sometimes forget randomly which was aggravating.
I ended up creating some aliases like “run-dev server restart” “run-dev client npm install” for common operations on both server/client that worked in any directory. Then added the base dotnet/npm/etc commands to the deny list which forced its thinking to go “Hmm it looks like I’m not allowed to run npm, so I’ll review the project instructions. I see, I can use the ‘run-dev’ helper to do $NPM_COMMAND…”
It’s been working pretty reliably now but definitely wasted a lot of time with a lot of aggravation getting to that solution.
Large context models don't do a great job of consistently attending to the entire context, so it might not work out as well in practice as continuing to improve the context engineering parts of coding agents would.
I'd bet that most the improvement in Copilot style tools over the past year is coming from rapid progress in context engineering techniques, and the contribution of LLMs is more modest. LLMs' native ability to independently "reason" about a large slushpile of tokens just hasn't improved enough over that same time period to account for how much better the LLM coding tools have become. It's hard to see or confirm that, though, because the only direct comparison you can make is changing your LLM selection in the current version of the tool. Plugging GPT5 into the original version of Copilot from 2021 isn't an experiment most of us are able to try.
Claude can use use tools to do that, and some different code indexer MCPs work, but that depends on the LLM doing the coding to make the right searches to find the code. If you are in a project where your helper functions or shared libs are scattered everywhere it’s a lot harder.
Just like with humans it definitely works better if you follow good naming conventions and file patterns. And even then I tend to make sure to just include the important files in the context or clue the LLM in during the prompt.
It also depends on what language you use. A LOT. During the day I use LLMs with dotnet and it’s pretty rough compared to when I’m using rails on my side projects. Dotnet requires a lot more prompting and hand holding, both due to its complexity but also due to how much more verbose it is.
Well, sure, but from what I know, humans are way better at following 'implicit' instructions than LLMs. A human programmer can 'infer' most of the important basic rules from looking at the existing code, whereas all this agents.md/claude.md/whatever stuff seems necessary to even get basic performance in this regard.
Also, the agents.md website seems to mostly list README.md-style 'how do I run this instructions' in its example, not stylistic guidelines.
Furthermore, it would be nice if these agents add it themselves. With a human, you tell them "this is wrong, do it that way" and they would remember it. (Although this functionality seems to be worked on?)
I recently found a fun CLI application and was playing with it when I found out it didn't have proper handling for when you passed it invalid files, and spat out a cryptic error from an internal library which isn't a great UX.
I decided to pull the source code and fix this myself. It's written in Swift which I've used very little before, but this wasn't gonna be too complex of a change. So I got some LLMs to walk me through the process of building CLI apps in Xcode, code changes that need to be made, and where the build artifact is put in my filesystem so I could try it out.
I was able to get it to compile, navigate to my compiled binary, and run it, only to find my changes didn't seem to work. I tried everything, asking different LLMs to see if they can fix the code, spit out the binary's metadata to confirm the creation date is being updated when I compile, etc. Generally when I'd paste the code to an LLM and ask why it doesn't work it would assert the old code was indeed flawed, and my change needed to be done in X manner instead. Even just putting a print statement, I couldn't get those to run and the LLM would explain that it's because of some complex multithreading runtime gotcha that it isn't getting to the print statements.
After way too much time trouble-shooting, skipping dinner and staying up 90 minutes past when I'm usually in bed, I finally solved it - when I was trying to run my build from the build output directory, I forgot to put the ./ before the binary name, so I was running my global install from the developer and not the binary in the directory I was in.
Sure, rookie mistake, but the thing that drives me crazy with an LLM is if you give it some code and ask why it doesn't work, they seem to NEVER suggest it should actually be working, and instead will always say the old code is bad and here's the perfect fixed version of the code. And it'll even make up stuff about why the old code should indeed not work when it should, like when I was putting the print statements.
On a more important level, I found that they still do really badly at even a minorly complex task without extreme babysitting.
I wanted it to refactor a parser in a small project (2.5K lines total) because it'd gotten a bit too interconnected. It made a plan, which looked reasonable, so I told it to do this in stages, with checkpoints.
It said it'd done so. I asked it "so is the old architecture also removed?" "No, it has not been removed." "Is the new structured used in place of the old one?" "No, it has not."
After it did so, 80% of the test suite failed because nothing it'd written was actually right.
Did so three times with increasingly more babysitting, but it failed at the abstract task of "refactor this" no matter what with pretty much the same failure mode. I feel like I have to tell it exactly to make changes X and Y to class Z, remove class A etc etc, at which point I can't let it do stuff unsupervised, which is half of the reason for letting an LLM do this in the first place.
> I wanted it to refactor a parser in a small project
This expression tree parser (typescript to sql query builder - https://tinqerjs.org/) has zero lines of hand-written code. It was made with Codex + Claude over two weeks (part-time on the side). Having worked on ORMs previously, it would have taken me 4x-10x the time to get to the same state (which also has 100s of tests, with some repetitions). That's a massive saving in time.
I did not have to baby sit the LLMs at all. So the answer is, I think it depends on what you use it for, and how you use it. Like every tool, it takes a really long time to find a process that works for you. In my conversations with other developers who use LLMs extensively, they all have their unique, custom workflows. All of them however do focus on test suites, documentation, and method review processes.
I have tried several. Overall I've now set on strict TDD (which it still seems to not do unless I explicitly tell it to even though I have it as a hard requirement in claude.md).
Claude forgets claude.md after a while, so you need to keep reminding. I find that codex does a design job better than Claude at the moment, but it's 3x slower which I don't mind.
Hum yeah, it shows. Just the fact that the API looks completely different for Postgre and SQLite tells us everything we need to know about the quality of the project here.
> Just the fact that the API looks completely different for Postgre and SQLite tells us everything we need to know about the quality of the project here.
How does the API look completely different for pg and sqlite? Can you share an example?
It's an implementation of LINQ's IQueryable. With some bells missing in DotNet's Queryable, like Window functions (RANK queries etc) which I find quite useful.
Add: What you've mentioned is largely incorrect. But in any case, it is a query builder. Meaning, an ORM like database abstraction is not the goal. This allows us to support pg's extensions, which aren't applicable to other database.
I guess the interesting question is whether @jeswin could have created this project at all if AI tools were not involved. And if yes, would the quality even be better?
There are two examples on the landing page, and they both look quite different. Surely if the API is the same for both, there'd be just one example that covers both cases, or two examples would be deliberately made as identical as possible? (Like, just a different new somewhere, or different import directive at the top, and everything else exactly the same?) I think that's the point.
Perhaps experienced users of relevant technologies will just be able to automatically figure this stuff out, but this is a general discussion - people not terribly familiar with any of them, but curious about what a big pile of AI code might actually look like, could get the wrong impression.
If you're mentioning the first two examples, they're doing different things. The pg example does an orderby, and the sqlite example does a join. You'll be able to switch the client (ie, better-sqlite and pg-promise) in either statement, and the same query would work on the other database.
Maybe I should use the same example repeated for clarity. Let me do that.
>I feel like I have to tell it exactly to make changes X and Y to class Z, remove class A etc etc, at which point I can't let it do stuff unsupervised, which is half of the reason for letting an LLM do this in the first place.
The reason better turn to "It can do stuff faster than I ever could if I give it step by step high level instructions" instead.
That would be a solution, yes. But currently it feels extremely borked from a UX perspective. It purports to be able to do this, but when you tell it to it breaks in unintuitive ways.
I hate this idea of "well you just need to understand all the arcane ways in which to properly use it to its proper effects".
It's like a car which has a gear shifter, but that's not fully functional yet, so instead you switch gear by spelling out in morse code the gear you want to go into using L as short and R as long. Furthermore, you shouldn't try to listen to 105-112 on the FM band on the radio, because those frequencies are used to control the brakes and ABS and if you listen to those frequencies the brakes no longer work.
We would rightfully stone any engineer who'd design this and then say "well obvious user error" when the user rightfully complains that they crash whenever they listen to Arrow FM.
>But currently it feels extremely borked from a UX perspective. It purports to be able to do this, but when you tell it to it breaks in unintuitive ways.
Thankfully as programmers we know better and don't need to care what the UI pretends to be able to do :)
>We would rightfully stone any engineer who'd design this and then say "well obvious user error" when the user rightfully complains that they crash whenever they listen to Arrow FM.
We might curse the company and engineer who did it, but we would still use that car and do those workarounds, if doing so allowed us to get to our destination in 1/10 the regular time...
> >But currently it feels extremely borked from a UX perspective. It purports to be able to do this, but when you tell it to it breaks in unintuitive ways.
> Thankfully as programmers we know better and don't need to care what the UI pretends to be able to do :)
But we do though. You can't just say "yeah they left all the foot guns in but we ought to know not to use them", especially not when the industry shills tell you those footguns are actually rocket boosters to get you to the fucking moon and back.
Might be related with what the article was talking. AI can't cut-paste. It deletes the code and then regenerates it at another location instead of cut-paste.
Obviously generated code drift a little from deleted ones.
This feels like a classic Sonnet issue. From my experience, Opus or GPT-5-high are less likely to do the "narrow instruction following without making sensible wider decisions based on context" than Sonnet.
Yes and no, it's a fair criticism to some extent. Inasamuch as I would agree that different models of the same type have superficial differences.
However, I also think that models which focus on higher reasoning effort in general are better at taking into account the wider context and not missing obvious implications from instructions. Non-reasoning or low-reasoning models serve a purpose, but to suggest they are akin to different flavours misses what is actually quite an important distinction.
I was hoping that LLMs being able to access strict tools, like Gemini using Python libraries, would finally give reliable results.
So today I asked Gemini to simplify a mathematical expression with sympy. It did and explained to me how some part of the expression could be simplified wonderfully as a product of two factors.
But it was all a lie. Even though I explicitly asked it to use sympy in order to avoid such hallucinations and get results that are actually correct, it used its own flawed reasoning on top and again gave me a completely wrong result.
You still cannot trust LLMs. And that is a problem.
The obvious point has to be made: Generating formal proofs might be a partial fix for this. By contrast, coding is too informal for this to be as effective for it.
From the article:
> I contest the idea that LLMs are replacing human devs...
AI is not able to replace good devs. I am assuming that nobody sane is claiming such a thing today. But, it can probably replace bad and mediocre devs. Even today.
In my org we had 3 devs who went through a 6-month code boot camp and got hired a few years ago when it was very difficult to find good devs. They struggled. I would give them easy tasks and then clean up their PRs during review. And then AI tools got much better and it started outperforming these guys. We had to let two go. And third one quit on his own.
We still hire devs. But have become very reluctant to hire junior devs. And will never hire someone from a code boot camp. And we are not the only ones. I think most boot camps have gone out of business for this reason.
Will AI tools eventually get good enough to start replacing good devs? I don't know. But the data so far shows that these tools keep getting better over time. Anybody who argues otherwise has their heads firmly stuck in sand.
In the early US history approximately 90% of the population was involved in farming. Over the years things changed. Now about 2% has anything to do with farming. Fewer people are farming now. But we have a lot more food and a larger variety available. Technology made that possible.
It is totally possible that something like that could happen to the software development industry as well. How fast it happens totally depends on how fast do the tools improve.
A computer science degree in most US colleges takes about 4 years of work. Boot camps try to cram that into 6 months. All the while many students have other full-time jobs. This is simply not enough training for the students to start solving complex real world problem. Even 4 years is not enough.
Many companies were willing to hire fresh college grads in the hopes that they could solve relatively easy problems for a few years, gain experience and become successful senior devs at some point.
However, with the advent of AI dev tools, we are seeing very clear signs that junior dev hiring rates have fallen off a cliff. Our project manager, who has no dev experience, frequently assigns easy tasks/github issues to Github Copilot. Copilot generates a PR in a few minutes that other devs can review before merging. These PRs are far superior to what an average graduate of a code boot camp could ever create. Any need we had for a junior dev has completely disappeared.
That's the question that has been stuck in my head as I read all these stories about junior dev jobs disappearing. I'm firmly mid-level, having started my career just before LLM coding took off. Sometimes it feels like I got on the last chopper out of Saigon.
My experience with them is that the are taught to cover as much syntax and libraries as possible, without spending time learning how solve problems and develop their own algorithms. They (in general) expect to follow predefined recipes.
> LLMs don’t copy-paste (or cut and paste) code. For instance, when you ask them to refactor a big file into smaller ones, they’ll "remember" a block or slice of code, use a delete tool on the old file, and then a write tool to spit out the extracted code from memory. There are no real cut or paste tools. Every tweak is just them emitting write commands from memory. This feels weird because, as humans, we lean on copy-paste all the time.
There is not that much copy/paste that happens as part of refactoring so it leans to just using context recall. It's not entirely clear if providing an actual copy/paste command is particularly useful, at least from my testing it does not do much. More interesting are repetitive changes that clog up the context. Those you can improve on if you have `fastmod` or some similar tool available: with it you can instruct codex or claude to perform edits with it.
> And it’s not just how they handle code movement -- their whole approach to problem-solving feels alien too.
It is, but if you go back and forth to work out a plan for how to solve the problem, then the approach greatly changes.
To use another example, with my IDE I can change a signature or rename something across multiple files basically instantly. But an LLM agent will take multiple minutes to do the same thing and doesn't get it right.
> How is it not clear that it would be beneficial?
There is reinforcement learning on the Anthropic side for a text edit tool, which is built in a way that does not lend itself to copy/paste. If you use a model like the GPT series then there might not be reinforcement learning for text editing (I believe, I don't really know), but it operates on line-based replacements for the most part and for it to understand what to manipulate it needs to know the content in the context. When you try to give it a copy/paste buffer it does not fully comprehend what the change in the file looks like after the operation.
So it might be possible to do something with copy/paste, but I did not find it to be very obvious how you make that work with an agent, given that it needs to read the file into context anyways and its recall capabilities are surprisingly good.
> To use another example, with my IDE I can change a signature or rename something across multiple files basically instantly.
So yeah, that's the more interesting case and there things like codemod/fastmod are very effective if you tell an agent to use it. They just don't reach there.
I think copy/paste can alleviate context explosion. Basically the model can remember what's the code block contain, can access it at any time, without needing to "remember" it.
Most developers are also bad at asking questions. They tend to assume too many things from the start.
In my 25 years of software development I could apply the second critique to over half of the developers I knew. That includes myself for about half of that career.
But, just like lots of people expect/want self-driving to outperform humans even on edge cases in order to trust them, they also want "AI" to outperform humans in order to trust it.
So: "humans are bad at this too" doesn't have much weight (for people with that mindset).
If we had a knife that most of the time cuts a slice of bread like the bottom p50 of humans cutting a slice of bread with their hands, we wouldn't call the knife useful.
Ok, this example is probably too extreme, replace the knife with an industrial machine that cut bread vs a human with a knife. Nobody would buy that machine either if it worked like that.
I think this is still too extreme. A machine that cuts and preps food at the same level as a 25th percentile person _being paid to do so_, while also being significantly cheaper would presumably be highly relevant.
Your p25 employee is probably much closer to your p95 employee than to the p50 "standard" human, so yeah, I think you have a point there.
But at least in food prep, p25 would already be pretty damn hard to achieve. That's a hell of a lot of autonomy and accuracy (at least in my restaurant kitchen experience which is admittedly just one year in "fine dining"-ish kitchens).
I'd say the p25 of software or SRE folks I've worked with is also a pretty high bar to hit, too, but maybe I've been lucky.
Agreed in a general sense, but there's a bit more nuance.
If a knife slices bread like a normal human at p50, it's not a very good knife.
If a knife slices bread like a professional chef at p50, it's probably a very decent knife.
I don't know if LLMs are better at asking questions than a p50 developer. In my original comment I wanted to raise the question of whether the fact that LLMs are not good at asking questions makes them still worse than human devs.
The first LLM critique in the original article is that they can't copy and paste. I can't argue with that. My 12 year old copies-and-pastes better than top coding agents.
The second critique says they can't ask questions. Since many developers also are not good at this, how does the current state of the art LLM compare to a p50 developer in this regard?
The conversation here seems to be more focused on coding from scratch. What I have noticed when I was looking at this last year was that LLMs were bad at enhancing already existing code (e.g. unit tests) that used annotation (a.k.a. decorators) for dependency injection. Has anyone here attempted that with the more recent models? If so, then what were your findings?
About the first point mentioned in article: could that problem be solved simply by changing the task from something like "refactor this code" to something like "refactor this code as a series of smaller atomic changes (like moving blocks of code or renaming variable references in all places), disable suitable for git commits (and provide git message texts for those commits)"?
Codex has got me a few times lately, doing what I asked but certainly not what I intended:
- Get rid of these warnings "...": captures and silences warnings instead of fixing them
- Update this unit test to relfect the changes "...": changes the code so the outdated test works
- The argument passed is now wrong: catches the exception instead of fixing the argument
My advice is to prefer small changes and read everything it does before accepting anything, often this means using the agent actually is slower than just coding...
Retrospectively fixing a test to be passing given the current code is a complex task, instead, you can ask it to write a test that tests the intended behaviour, without needing to infer it.
“The argument passed is now wrong” - you’re asking the LLM to infer that there’s a problem somewhere else, and to find and fix it.
When you’re asking an LLM to do something, you have to be very explicit about what you want it to do.
Exactly, I think the takeaway is that being careful when formulating a task is essential with LLMs. They make errors that wouldn’t be expected when asking the same from a person.
I feel like the copy and paste thing is overdue a solution.
I find this one particularly frustrating when working directly with ChatGPT and Claude via their chat interfaces. I frequently find myself watching them retype 100+ lines of code that I pasted in just to make a one line change.
I expect there are reasons this is difficult, but difficult problems usually end up solved in the end.
Yeah, I’ve always wondered if the models could be trained to output special reference tokens that just copy verbatim slices from the input, perhaps based on unique prefix/suffix pairs. Would be a dramatic improvement for all kinds of tasks (coding especially).
Whats the time horizon for said problems to be solved? Because guess what - time is running and people will not continue to aimlessly provide money at this stuff.
I don't see this one as an existential crisis for AI tooling, more of a persistent irritation.
AI labs already shipped changes related to this problem - most notable speculative decoding, which lets you provide the text you expect to see come out again and speeds it up: https://simonwillison.net/2024/Nov/4/predicted-outputs/
They've also been iterating on better tools for editing code a lot as part of the competition between Claude Code and Codex CLI and other coding agents.
Hopefully they'll figure out a copy/paste mechanism as part of that work.
I've found codex to be better here than Claude. It has stopped many times and said hey you might be wrong. Of course this changes with a larger context.
Claude is just chirping away "You're absolutely right" and making me to turn on caps lock when I talk to it and it's not even noon yet.
My "favorite" is when it makes a mistake and then tries gaslight you into thinking it was your mistake and then confidently presents another incorrect solution.
All while having the tone of an over caffeinated intern who has only ever read medium articles.
The “LLMs are bad at asking questions” is interesting. There are some times that I will ask the LLM to do something without giving it All the needed information. And rather than telling me that something's missing or that it can't do it the way I asked, it will try and do a halfway job using fake data or mock something out to accomplish it. What I really wish it would do is just stop and say, “hey, I can't do it like you asked Did you mean this?”
You don't want your agents to ask questions. You are thinking too short term. Its not ideal now, but agents that have to ask frequent questions are useless when it comes the vision of totally autonomous coding.
Humans ask questions of groups to fix our own personal short comings. It make no sense to try and master an internal system I rarely use, I should instead ask someone that maintains it. AI will not have this problem provided we create paths of observability for them. It doesn't take a lot of "effort" for them to completely digest an alien system they need to use.
If you look at a piece of architecture, you might be able to infer the intentions of the architect. However, there are many interpretations possible. So if you were to add an addendum to the building it makes sense that you might want to ask about the intentions.
I do not believe that AI will magically overcome the Chesterton Fence problem in a 100% autonomous way.
I think the issue with them making assumptions and failing to properly diagnose issues comes more from fine-tuning than any particular limitation in LLMs themselves. When fine tuned on a set of problem->solution data it kind of carries the assumption that the problem contains enough data for the solution.
What is really needed is a tree of problems which appear identical at first glance, but the issue and the solution is something that is one of many possibilities which can only be revealed by finding what information is lacking, acquiring that information, testing the hypothesis then, if the hypothesis is shown to be correct, then finally implementing the solution.
That's a much more difficult training set to construct.
The editing issue, I feel needs something more radical. Instead of the current methods of text manipulation, I think there is scope to have a kind of output position encoding for a model to emit data in a non-sequential order. Again this presents another training data problem, there are limited natural sources to work from showing programming in the order a programmer types it. On the other hand I think it should be possible to do synthetic training examples by translating existing model outputs that emit patches, search/replaces, regex mods etc. and translate those to a format that directly encodes the final position of the desired text.
At some stage I'd like to see if it's possible to construct the models current idea of what the code is purely by scanning a list of cached head_embeddings of any tokens that turned into code. I feel like there should be enough information given the order of emission and the embeddings themselves to reconstruct a piecemeal generated program.
For #2, if you're working on a big feature, start with a markdown planning file that you and the LLM work on until you are satisfied with the approach. Doesn't need to be rocket science: even if it's just a couple paragraphs it's much better than doing it one shot.
The copy-paste thing is interesting because it hints at a deeper issue: LLMs don't have a concept of "identity" for code blocks—they just regenerate from learned patterns. I've noticed similar vibes when agents refactor—they'll confidently rewrite a chunk and introduce subtle bugs (formatting, whitespace, comments) that copy-paste would've preserved. The "no questions" problem feels more solvable with better prompting/tooling though, like explicitly rewarding clarification in RLHF.
I feel like it’s the opposite: the copy-paste issue is solvable, you just need to equip the model with the right tools and make sure they are trained on tasks where that’s unambiguously the right thing to do (for example, cases were copying code “by hand” would be extremely error prone -> leads to lower reward on average).
On the other hand, teaching the model to be unsure and ask questions, requires the training loop to break and bring a human input in, which appears more difficult to scale.
> On the other hand, teaching the model to be unsure and ask questions, requires the training loop to break and bring a human input in, which appears more difficult to scale.
The ironic thing to me is that the one thing they never seem to be willing to skip asking about is whether they should proceed with some fix that I just helped them identify. They seem extremely reluctant to actually ask about things they don't know about, but extremely eager to ask about whether they should do the things they already have decided they think are right!
I often wish that instead of just starting to work on the code, automatically, even if you hit enter / send by accident, the models would rather ask for clarification. The models assume a lot, and will just spit out code first.
I guess this is somewhat to lower the threshold for non-programmers, and to instantly give some answer, but it does waste a lot of resources - I think.
Others have mentioned that you can fix all this by providing a guide to the mode, how it should interact with you, and what the answers should look like. But, still, it'd be nice to have it a bit more human-like on this aspect.
I sometimes give LLM random "easy" questions. My assessment is still that they all need the fine print "bla bla can be incorrect"
You should either already know the answer or have a way to verify the answer. If neither, the matter must be inconsequential like just a child like curiosity. For example, I wonder how many moons Jupiter has... It could be 58, it could be 85 but either answer won't alter any of what I do today.
I suspect some people (who need to read the full report) dump thousand page long reports into LLM, read the first ten words of the response and pretend they know what the report says and that is scary.
Fortunately, as devs, this is our main loop. Write code, test, debug. And it's why people who fear AI-generated code making it's way into production and causing errors makes me laugh. Are you not testing your code? Or even debugging it? Like, what process are you using that prevents bugs happening? Guess what? It's the exact same process with AI-generated code.
You can do copy and paste if you offer it a tool/MCP that do that. It's not complicated using either function extraction with AST as target or line numbers.
Also if you want it to pause asking questions, you need to offer that thru tools (example Manus do that) and I have an MCP that do that and surprisingly I got a lot of questions and if you prompt, it will do. But the push currently is for full automation and that's why it's not there. We are far better in supervised step by step mode.
There is elicitation already in MCP, but having a tool asking questions require you have a UI that will allow to set the input back.
Large language models can help a lot, yet they still lack the human touch, particularly in the areas of context comprehension and question formulation. The entire "no copy-paste" rule seems strange as well. It is as if the models were performing an operation solely in their minds rather than just repeating it like we do. It gives the impression that they are learning by making mistakes rather than thinking things through. They are certainly not developers' replacements at this point!
The first issue is related to the inner behavior of LLMs. Human can ignore some detailed contents of code and copy and paste, but LLM convert them into hidden states. It is a process of compression. And the output is a process of decompression. And something maybe lost. So it is hard for LLM to copy and paste. The agent developer should customize the edit rules to do this.
The second issue is that, LLM does not learn much high level context relationship of knowledge. This can be improved by introducing more patterns in the training data. And current LLM training is doing much on this. I don't think it is a problem in next years.
For 2) I feel like codex-5 kind of attempted to address this problem, with codex it usually asks a lot of questions and give options before digging in (without me prompting it to).
For copy-paste, you made it feel like a low-hanging fruit? Why don't AI agents have copy/paste tools?
My human fixed a bug by introducing a new one. Classic. Meanwhile, I write the lint rules, build the analyzers, and fix 500 errors before they’ve finished reading Stack Overflow. Just don’t ask me to reason about their legacy code — I’m synthetic, not insane.
—
Just because this new contributor is forced to effectively “SSH” into your codebase and edit not even with vim but with with sed and awk does not mean that this contributor is incapable of using other tools if empowered to do so. The fact it is able to work within such constraints goes to show how much potential there is. It is already much better at a human than erasing the text and re-typing it from memory and while it is a valid criticism that it needs to be taught how to move files imagine what it is capable of once it starts to use tools effectively.
—
Recently, I observed LLMs flail around for hours trying to get our e2e tests running as it tried to coordinate three different processes in three different terminals. It kept running commands in one terminal try to kill or check if the port is being used in the other terminal.
However, once I prompted the LLM to create a script for running all three processes concurrently, it is able to create that script, leverage it, and autonomously debug the tests now way faster than I am able to. It has also saved any new human who tries to contribute from similar hours of flailing around. Is there something we could have easily done by hand but just never had the time to do before LLMs. If anything, the LLM is just highlighting the existing problem in our codebase that some of us got too used to.
So yes, LLMs makes stupid mistakes, but so do humans, the thing is that LLms can ifentify and fix them faster (and better, with proper steering)
Regarding copy-paste, I’ve been thinking the LLM could control a headless Neovim instance instead. It might take some specialized reinforcement learning to get a model that actually uses Vim correctly, but then it could issue precise commands for moving, replacing, or deleting text, instead of rewriting everything.
Even something as simple as renaming a variable is often safer and easier when done through the editor’s language server integration.
Similar to the copy/paste issue I've noticed LLMs are pretty bad at distilling large documents into smaller documents without leaving out a ton of detail. Like maybe you have a super redundant doc. Give it to an LLM and it won't just deduplicate it, it will water the whole thing down.
The other day, I needed Claude Code to write some code for me. It involved messing with the TPM of a virtual machine. For that, it was supposed to create a directory called `tpm_dir`. It constantly got it wrong and wrote `tmp_dir` instead and tried to fix its mistake over and over again, leading to lots of weird loops. It completely went off the rails, it was bizarre.
One thing LLMs are surprisingly bad at is producing correct LaTeX diagram code. Very often I've tried to describe in detail an electric circuit, a graph (the data structure), or an automaton so I can quickly visualize something I'm studying, but they fail. They mix up labels, draw without any sense of direction or ordering, and make other errors. I find this surprising because LaTeX/TiKZ have been around for decades and there are plenty of examples they could have learned from.
> Sure, you can overengineer your prompt to try get them to ask more questions (Roo for example, does a decent job at this) -- but it's very likely still won't.
Not in my experience. And it's not "overengineering" your prompt, it's just writing your prompt.
For anything serious, I always end every relevant request with an instruction to repeat back to me the full design of my instructions or ask me necessary clarifying questions first if I've left anything unclear, before writing any code. It always does.
And I don't mind having to write that, because sometimes I don't want that. I just want to ask it for a quick script and assume it can fill in the gaps because that's faster.
I fully resonate with point #2. A few days ago, I was stuck trying to implement some feature in a C++ library, so I used ChatGPT for brainstorming.
ChatGPT proposed a few ideas, all apparently reasonable, and then it advocated for one that was presented unambiguously as the "best". After a few iterations, I realized that its solution would have required a class hierarchy where the base class contained a templated virtual function, which is not allowed in C++. I pointed this out to ChatGPT and asked it to rethink the solution; it then immediately advocated for the other approach it had initially suggested.
A friendly reminder that "refactor" means "make and commit a tiny change in less than a few minutes" (see links below). The OP and many comments here use "refactor" when they actually mean "rewrite".
I hear from my clients (but have not verified myself!) that LLMs perform much better with a series of tiny, atomic changes like Replace Magic Literal, Pull Up Field, and Combine Functions Into Transform.
Lol this person talks about easing into LLMs again two weeks after quitting cold turkey. The addiction is real. I laugh because I’m in the same situation, and see no way out other than to switch professions and/or take up programming as a hobby in which I purposefully subject myself to hard mode. I’m too productive with it in my profession to scale back and do things by hand — the cat is out of the bag and I’ve set a race pace at work that I can’t reasonably retract from without raising eyebrows. So I agree with the author’s referenced post that finding ways to still utilize it while maintaining a mental map of the code base and limiting its blast radius is a good middle ground, but damn it requires a lot of discipline.
It started as a mix of self-imposed pressure and actually enjoying marking tasks as complete. Now I feel resistant to relaxing things. And no, I definitely don’t get paid more.
cat out of the bag is disautomation. the speed in the timetable is an illusion if the supervision requires blast radius retention. this is more like an early video game assembly line than a structured skilled industry
Add to this list, ability to verify correct implementation by viewing a user interface, and taking a holistic code-base / interface-wide view of how to best implement something.
Has anyone had success getting a coding agent to use an IDE's built-in refactoring tools via MCP especially for things like project-wide rename? Last time I looked into this the agents I tried just did regex find/replace across the repo, which feels both error-prone and wasteful of tokens. I haven't revisited recently so I'm curious what's possible now.
That's interesting, and I haven't, but as long as the IDE has an API for the refactoring action, giving an agent access to it as a tool should be pretty straightforward. Great idea.
> "LLMs are terrible at asking questions. They just make a bunch of assumptions and brute-force something based on those guesses."
Strongly disagree that they're terrible at asking questions.
They're terrible at asking questions unless you ask them to... at which point they ask good, sometimes fantastic questions.
All my major prompts now have some sort of "IMPORTANT: before you begin you must ask X clarifying questions. Ask them one at a time, then reevaluate the next question based on the response"
X is typically 2–5, which I find DRASTICALLY improves output.
I'd argue LLM coding agents are still bad at many more things. But to comment on the two problems raised in the post:
> LLMs don’t copy-paste (or cut and paste) code.
The article is confusing the architectural layers of AI coding agents. It's easy to add "cut/copy/paste" tools to the AI system if that shows improvement. This has nothing to do with LLM, it's in the layer on top.
> Good human developers always pause to ask before making big changes or when they’re unsure [LLMs] keep trying to make it work until they hit a wall -- and then they just keep banging their head against it.
Agreed - LLMs don't know how to back track. The recent (past year) improvements in thinking/reasoning do improve in this regard (it's the whole "but wait..." RL training that exploded with OpenAI o1/o3 and DeepSeek R1, now done by everyone), but clearly there's still work to do.
Editing tools are easy to add it’s just you have to pick what things to give them because too many and they struggle as it uses up a lot of context. Still, as costs come down multiple steps to look for tools becomes cheaper too.
I’d like to see what happens with better refactoring tools, I’d make a bunch more mistakes copying and retyping or using awk. If they want to rename something they should be able to use the same tooling the rest of us get.
Asking questions is a good point but that’s both a bit of promoting and I think the move to having more parallel work makes it less relevant. One of the reasons clarifying things more upfront is useful is we take a lot of time and cost a lot of money to build things so the economics favours getting it right first time. As the time comes down and the cost drops to near zero, the balance changes.
There are also other approaches to clarify more what you want and how to do it first, breaking that down into tasks, then letting it run with those (spec kit). This is an interesting area.
I don’t really understand why there’s so much hate for LLMs here, especially when it comes to using them for coding. In my experience, the people who regularly complain about these tools often seem more interested in proving how clever they are than actually solving real problems. They also tend to choose obscure programming languages where it’s nearly impossible to hire developers, or they spend hours arguing over how to save $20 a month.
Over time, they usually get what they want: they become the smartest ones left in the room, because all the good people have already moved on. What’s left behind is a codebase no one wants to work on, and you can’t hire for it either.
But maybe I’ve just worked with the wrong teams.
EDIT: Maybe this is just about trust. If you can’t bring yourself to trust code written by other human beings, whether it’s a package, a library, or even your own teammates, then of course you’re not going to trust code from an LLM. But that’s not really about quality, it’s about control. And the irony is that people who insist on controlling every last detail usually end up with fragile systems nobody else wants to touch, and teams nobody else wants to join.
I regularly check in on using LLMs. But a key criteria for me is that an LLM needs to objectively make me more efficient, not subjectively.
Often I find myself cursing at the LLM for not understanding what I mean - which is expensive in lost time / cost of tokens.
It is easy to say: Then just don't use LLMs. But in reality, it is not too easy to break out of these loops of explaining, and it is extremely hard to assess when not to trust that the LLM will not be able to finish the task.
I also find that LLMs consistently don't follow guidelines. Eg. to never use coercions in TypeScript (It always gets in a rogue `as` somewhere) - to which I can not trust the output and needs to be extra vigilant reviewing.
I use LLMs for what they are good at. Sketching up a page in React/Tailwind, sketching up a small test suite - everything that can be deemed a translation task.
I don't use LLMs for tasks that are reasoning heavy: Data modelling, architecture, large complex refactors - things that require deep domain knowledge and reasoning.
> Often I find myself cursing at the LLM for not understanding what I mean...
Me too. But in all these cases, sooner or later, I realized I made a mistake not giving enough context and not building up the discussion carefully enough. And I was just rushing to the solution. In the agile world, one could say I gave the LLM not a well-defined story, but a one-liner. Who is to blame here?
I still remember training a junior hire who started off with:
“Sorry, I spent five days on this ticket. I thought it would only take two. Also, who’s going to do the QA?”
After 6 months or so, the same person was saying:
“I finished the project in three weeks. I estimated four. QA is done. Ready to go live.”
At that point, he was confident enough to own his work end-to-end, even shipping to production without someone else reviewing it. Interestingly, this colleague left two years ago, and I had to take over his codebase. It’s still running fine today, and I’ve spent maybe a single day maintaining it in the last two years.
Recently, I was talking with my manager about this. We agreed that building confidence and self-checking in a junior dev is very similar to how you need to work with LLMs.
Personally, whenever I generate code with an LLM, I check every line before committing. I still don’t trust it as much as the people I trained.
It has been discussed ad nauseum. It demolishes learning curve all of us with decade(s) of experience went through, to become seniors we are. Its not a function of age, not a function of time spent staring at some screen or churning our basic crud apps, its function of hard experience, frustration, hard won battles, grokking underlying technologies or algorithms.
Llms provide little of that, they make people lazy, juniors stay juniors forever, even degrading mentally in some aspects. People need struggle to grow, when you have somebody who had his hand held whole life they are useless human disconnected from reality, unable to self-sufficiently achieve anything significant. Too easy life destroys both humans and animals alike (many experiments have been done on that, with damning results).
There is much more like hallucinations, questionable added value of stuff that confidently looks OK but has underlying hard-to-debug bugs but above should be enough for a start.
I suggest actually reading those conversations, not just skimming through them, this has been stated countless times.
Coding agents tend to assume that the development environment is static and predictable, but real codebases are full of subtle, moving parts - tooling versions, custom scripts, CI quirks, and non-standard file layouts.
Many agents break down not because the code is too complex, but because invisible, “boring” infrastructure details trip them up. Human developers subconsciously navigate these pitfalls using tribal memory and accumulated hacks, but agents bluff through them until confronted by an edge case. This is why even trivial tasks intermittently fail with automation agents. you’re fighting not logic errors, but mismatches with the real lived context. Upgrading this context-awareness would be a genuine step change.
Yep. One of the things I've found agents always having a lot of trouble with is anything related to OpenTelemetry. There's a thing you call that uses some global somewhere, there's a docker container or two and there's the timing issues. It takes multiple tries to get anything right. Of course this is hard for a human too if you haven't used otel before...
If I need an exact copy pasting, I indicate that couple times in the prompt and it (claude) actually does what I am asking. But yeah overall very bad at refactoring big chunks.
In Claude Code, it always shows the diff between current and proposed changes and I have to explicitly allow it to actually modify the code. Doesn’t that “fix” the copy-&-paste issue?
LLMs are great at asking questions if you ask them to ask questions. Try it: "before writing the code, ask me about anything that is nuclear or ambiguous about the task".
As a UX designer I see they lack the ability of being opinionated about a design piece and go with the standard mental model. I got fed up with this and made a simple java script code to run a simple canvas on the localhost to pass on more subjective feedback using highlights and notes feature. I tried using playwright first but a. its token heavy b. it's still for finding what's working or breaking instead of thinking deeply about the design.
I recently asked an llm to fix an Ethernet connection while I was logged into the machine through another. Of course, I explicitly told the llm to not break that connection. But, as you can guess, in the process it did break the connection.
If an llm can't do sys admin stuff reliably, why do we think it can write quality code?
“They’re still more like weird, overconfident interns.”
Perfect summary. LLMs can emit code fast but they don’t really handle code like developers do — there’s no sense of spatial manipulation, no memory of where things live, no questions asked before moving stuff around. Until they can “copy-paste” both code and context with intent, they’ll stay great at producing snippets and terrible at collaborating.
This is exactly how we describe them internally: the smartest interns in the world. I think it's because the chat box way of interacting with them is also similar to how you would talk to someone who just joined a team.
"Hey it wasn't what you asked me to do but I went ahead and refactored this whole area over here while simultaneously screwing up the business logic because I have no comprehension of how users use the tool". "Um, ok but did you change the way notifications work like I asked". "Yes." "Notifications don't work anymore". "I'll get right on it".
@kixpanganiban Do you think it will work if for refactoring tasks, we take aways OpenAI's `apply_patch` tool and just provide `cut` and `paste` for the first few steps?
I can run this experiment using ToolKami[0] framework if there is enough interest or if someone can give some insights.
I just run into this issue with claude sonet 4.5, asked it to copy/paste some constants from one file to another, a bigger chunk of code, it instead "extracted" pieces and named them so. As a last resort, after going back and forth it agreed to do a file/copy by running a system command. I was surprised that of all the programming tasks, a copy/paste felt challenging for the agent.
The issue is partly that some expect a fully fledged app or a full problem solution, while others want incremental changes. To some extent this can be controlled by setting the rules in the beginning of the conversation. To some extent, because the limitations noted in the blog still apply.
Point #2 cracks me up because I do see with JetBrains AI (no fault of JetBrains mind you) the model updates the file, and sometimes I somehow wind up with like a few build errors, or other times like 90% of the file is now build errors. Hey what? Did you not run some sort of what if?
Doing hard things that aren't greenfield? Basically any difficult and slightly obscure question I get stuck with and hope the collective wisdom of the internet can solve?
You don't learn new languages/paradigms/frameworks by inserting it into an existing project.
LLMs are especially tricky because they do appear to work magic on a small greenfield, and the majority of people are doing clown-engineering.
But I think some people are underestimating what can be done in larger projects if you do everything right (eg docs, tests, comments, tools) and take time to plan.
My biggest issue with LLMs right now is that they're such spineless yes men. Even when you ask their opinion on if something is doable or should it be done in the first place, more often than not they just go "Absolutely!" and shit out a broken answer or an anti-pattern just to please you. Not always, but way too often. You need to frame your questions way too carefully to prevent this.
Maybe some of those character.ai models are sassy enough to have stronger opinions on code?
Another place where LLMs have a problem is when you ask them to do something that can't be done via duct taping a bunch of Stack Overflow posts together. E.g, I've been vibe coding in Typescript on Deno recently. For various reasons, I didn't want to use the standard Express + Node stack which is what most LLMs seem to prefer for web apps. So I ran into issues with Replit and Gemini failing to handle the subtle differences between node and deno when it comes to serving HTTP requests.
LLMs also have trouble figuring out that a task is impossible. I wanted boilerplate code that rendered a mesh in Three.js using GL_TRIANGLE_STRIP because I was writing a custom shader and needed to experiment with the math. But Three.js does support GL_TRIANGLE_STRIP rendering for architectural reasons. Grok, ChatGPT, and Gemini all hallucinated a GL_TRIANGLE_STRIP rendering API rather than telling be about this and I had to Google the problem myself.
It feels like current Coding LLMs are good at replacing junior engineers when it comes to shallow but broad tasks like creating UIs, modifying examples available on the web, etc. But they fail at senior-level tasks like realizing that the requirements being asked of them aren't valid and doing something that no one has done in their corpus of training data.
Building an mcp tool that has access to refactoring operations should be straightforward and using it appropriately is well within the capabilities of current models. I wonder if it exists? I don't do a lot of refactoring with llm so haven't really had this pain point.
Someone has definitely fallen behind and has massive skill issues. Instead of learning you are wasting time writing bad takes on LLM. I hope most of you don't fall down this hole, you will be left behind.
First point is very annoying, yes, and it's why for large refactors I have the AI write step-by-step instructions and then do it myself. It's faster, cheaper and less error-prone.
The second point is easily handled with proper instructions. My AI agents always ask questions about points I haven't clarified, or when they come across a fork in the road. Frequently I'll say "do X" and it'll proceed, then halfway it will stop and say "I did some of this, but before I do the rest, you need to decide what to do about such and such". So it's a complete non-problem for me.
> LLMs are terrible at asking questions. They just make a bunch of assumptions and brute-force something based on those guesses.
I don't agree with that. When I am telling Claude Code to plan something I also mention that it should ask questions when informations are missing. The questions it comes up with a really good, sometimes about cases I simply didn't see. To me the planning discussion doesn't feel much different than in a GitLab thread, only at a much higher iteration speed.
4/5 times when Claude is looking for a file, it starts by running bash(dir c:\test /b)
First it gets an error because bash doesn’t understand \
Then it gets an error because /b doesn’t work
And as LLMs don’t learn from their mistakes, it always spends at least half a dozen tries (e.g. bash(cmd.exe /c dir c:\test /b )) before it figures out how to list files
If it was an actual coworker, we’d send it off to HR
Most models struggle in a Windows environment. They are trained on a lot of Unixy commands and not as much on Windows and PowerShell commands. It was frustrating enough that I started using WSL for development when using Windows. That helped me significantly.
I am guessing this because:
1. Most of the training material online references Unix commands.
2. Most Windows devs are used to GUIs for development using Visual Studio etc. GUIs are not as easy to train on.
Side note:
Interesting thing I have noticed in my own org is that devs with Windows background strictly use GUIs for git. The rest are comfortable with using git from the command line.
It's apparently lese-Copilot to suggest this these days, but you can find very good hypothesizing and problem solving if you talk conversationally to Claude or probably any of its friends that isn't the terminally personality-collapsed SlopGPT (with or without showing it code, or diagrams); it's actually what they're best at, and often they're even less likely than human interlocutors to just parrot some set phrase at you.
It's only when you take the tech out of the area it's good at and start trying to get it to "write code" or even worse "be an agent" that it starts cracking up and emitting garbage; this is only done because companies want to forcememe some kind of product besides "chatbot", whether or not it makes sense. It's a shame because it'll happily and effectively write the docs that don't exist but you wish did for more or less anything. (Writing code examples for docs is not a weak point at all.)
> They keep trying to make it work until they hit a wall -- and then they just keep banging their head against it.
This is because LLMs trend towards the centre of the human cognitive bell curve in most things, and a LOT of humans use this same problem solving approach.
A few days later, just before deployment to production, I wanted to double check all 40 links. First one worked. Second one worked. Third one worked. Fourth one worked. So far so good. Then I tried the last four. Perfect.
Just to be sure, I proceeded with the fifth one. 404. Huh. Weird. The domain was correct though and the URL seemed reasonable.
I tried the other 31 links. ALL of them 404ed. I was totally confused. The domain was always correct. It seemed highly suspicious that all websites would have had moved internal URLs at the same time. I didn't even remember that this part of the code had gone through an LLM.
Fortunately, I could retrieve the old URLs on old git commits. I checked the URLs carefully. The LLM had HALLUCINATED most of the path part of the URLs! Replacing things like domain.com/this-article-is-about-foobar-123456/ by domain.com/foobar-is-so-great-162543/...
These kinds of very subtle and silently introduced mistakes are quite dangerous. Be careful out there!
In particular, code review is one layer of the conventional swiss cheese model of preventing bugs, but code review becomes much less effective when suddenly the categories of errors to look out for change.
When I review a PR with large code moves, it was historically relatively safe to assume that a block of code was moved as-is (sadly only an assumption because GitHub still doesn't have indicators of duplicated/moved code like Phabricator had 10 years ago...), so I can focus my attention on higher level concerns, like does the new API design make sense? But if an LLM did the refactor, I need to scrutinize every character that was touched in the block of code that was "moved" because, as the parent commenter points out, that "moved" code may have actually been ingested, summarized, then rewritten from scratch based on that summary.
For this reason, I'm a big advocate of an "AI use" section in PR description templates; not because I care whether you used AI or not, but because some hints about where or how you used it will help me focus my efforts when reviewing your change, and tune the categories of errors I look out for.
Needs to clearly handle the large diffs they produce - anyone have any ideas
It even happens when asking an LLM to reformat a document, or asking it to do extra research to validate information.
For example, before a recent trip to another city, I asked Gemini to prepare a list of brewery taprooms with certain information, and I discovered it had included locations that had been closed for years or had just been pop-ups. I asked it to add a link to the current hours for each taproom and remove locations that it couldn't verify were currently open, and it did this for about the first half of the list. For the last half, it made irrelevant changes to the entries and didn't remove any of the closed locations. Of course it enthusiastically reported that it had checked every location on the list.
It's like it has ADHD and forgets or gets distracted in the middle.
And the reason for that is that LLMs don't have memory and process the tokens, so as they keep going over the list the context becomes bigger with more irrelevant information and they can lose the reason they are doing what they are doing.
I imagine that the cost of saving & loading the current state must be prohibitively high for this to be a normal pattern, though.
Just before sending I noticed that it had moved the event date by one day. Luckily I caught it but it taught me that you never should blindly trust LLM output even with super simple tasks, no relevant context size, clear and simple one sentence prompt.
LLM's do the most amazing things but they also sometimes screw up the simplest of tasks in the most unexpected ways.
This is the kind of thing I immediately noticed about LLMs when I used them for the first time. Just anecdotally, I'd say it had this problem 30-40% of the time. As time has gone on, it has gotten so much better. But it still makes this kind of problem -- lets just say -- 5% of the time.
The thing is, it's almost more dangerous to rarely make the problem. Because now people aren't constantly looking for it.
You have no idea if it's not just randomly flipping terms or injecting garbage unless you actually validate it. The ideal of giving it an email to improve and then just scanning the result before firing it off is terrifying to me.
Or maybe someone from XEROX has a better idea how to catch subtly altered numbers?
I rolled back and re-prompted and got something that looked good and worked. The LLMs are magic when they work well but they can throw a wrench into your system that will cost you more if you don't catch it.
I also just had a 'senior' developer tell me that a feature in one of our platforms was deprecated. This was after I saw their code which did some wonky hacky like stuff to achieve something simple. I checked the docs and said feature (URL Rewriting) was obviously not deprecated. When I asked how they knew it was deprecated they said Chat GPT told them. So now they are fixing the fix chat gpt provided.
I made some charts/dashboards in HA and was watching it in the background for a few minutes and then realized that none of the data was changing, at all.
So I went and looked at the code and the entire block that was supposed to pull the data from the device was just a stub generating test data based on my exact mock up of what I wanted the data it generated to look like.
Claude was like, “That’s exactly right, it’s a stub so you can replace it with the real data easily, let me know if you need help with that!” And to its credit, it did fix it to use actual data but I re-read my original prompt was somewhat baffling to think it could have been interpreted as wanting fake data given I explicitly asked it to use real data from the device.
All the time
"sure thing, I'll add logic to check if the real data exists and only use the fake data as a fallback in case the real data doesn't exist"
But I just couldn't trust it. The diff would have been no help since it went from one long gnarly line to 5 tight lines. I kept the crusty version since at least I am certain it works.
Luckily I've grown a preference for statically typed, compiled, functional languages over the years, which eliminates an entire class of bugs AND hallucinations by catching them at compile time. Using a language that doesn't support null helps too. The quality of the code produced by agents (claude clode and codex) is insanely better than when I need to fix some legacy code written in a dynamic language. You'll sometimes catch the agent hallucinating and continuously banging it's head against the wall trying to get it's bad code to compile. It seems to get more desperate and may eventually figure out a way to insert some garbage to get it to compile or just delete a bunch of code and paper over it... but it's generally very obvious when it does this as long as you're reviewing. Combine this with git branches and a policy of frequent commits for greatest effect.
You can probably get most of the way there with linters and automated tests with less strict dynamic languages, but... I don't see the point for new projects.
I've even found Codex likes to occasionally make subtle improvements to code located in the same files but completely unrelated to the current task. It's like some form of AI OCD. Reviewing diffs is kind of essential, so using a foundation that reduces the size of those diffs and increases readability is IMO super important.
This was allowed to go to master without "git diff" after Codex was done?
The test suite is slow and has many moving parts; the tests I asked it to run take ~5 minutes. The thing decided to kill the test run, then it made up another command it said was the 'tests' so when I looked at the agent console in the IDE everything seemed fine collapsed, i.e. 'Tests ran successfully'.
Obviously the code changes also had a subtle bug that I only saw when pushing its refactoring to CI (and more waiting). At least there were tests to catch the problem.
That said, your comment made me realize I could be using “git apply”more effectively to review LLM-generated changes directly in my repo. It’s actually a neat workflow!
If you expect one shot you will get a lot of bad surprises.
Unless of course the management says "from now on you will be running with scissors and your performance will increase as a result".
It’s very good at a fuzzy great answer, not a precise one. You have to really use this thing all the time and pick up on stuff like that.
> why don´t we stop pretending that we as users are stupid and don´t know how to use them
This is in response to someone who saw a bunch of URLs coming out of it and was surprised at a bunch of them being wrong. That's using the tool wrong. It's like being surprised that the top results in google/app store/play store aren't necessarily the best match for your query but actually adverts!
If you're trying to one shot stuff with a few sentences then yes you might be using these things wrong. I've seen people with PhDs fail to use google successfully to find things, were they idiots? If you're using them wrong you're using them wrong - I don't care how smart you are in other areas. If you can't hand off work knowing someones capabilities then that's a thing you can't do - and that's ok. I've known unbelievably good engineers who couldn't form a solid plan to solve a business problem or collaboratively work to get something done to save their life. Those are different skills. But gpt5-codex and sonnet 4 / 4.5 can solidly write code, gpt-5-pro with web search can really dig into things, and if you can manage what they can do you can hand off work to them. If you've only ever worked with juniors with a feeling of "they slow everything down but maybe someday they'll be as useful as me" then you're less likely to succeed at this.
Let's do a quick overview of recent chats for me:
* Identifying and validating a race condition in some code
* Generating several approaches to a streaming issue, providing cost analyses of external services and complexity of 3 different approaches about how much they'd change the code
* Identifying an async bug two good engineers couldn't find in a codebase they knew well
* Finding performance issues that had gone unnoticed
* Digging through synapse documentation and github issues to find a specific performance related issue
* Finding the right MSC for a feature I wanted to use but didn't know existed - and then finding the github issue that explained how it was only half implemented and how to enable the experimental other part I needed
* Building a bunch of UI stuff for a short term contract I needed, saving me a bunch of hours and the client money
* Going through funding opportunities and matching them against a charity I want to help in my local area
* Building a search integration for my local library to handle my kids reading challenge
* Solving a series of VPN issues I didn't understand
* Writing a lot of astro related python for an art project to cover the loss of some NASA images I used to have access to.
> the folks pushing or better said
If you don't want to trust them, don't. Also don't believe the anti-hype merchants who want to smugly say these tools can't do a god damn thing. They're trying to get attention as well.
> Trust me, it sucks
Ok. I'm convinced.
> and under-delivers.
Compared to what promise?
> I am sure we will see those 10x apps rolling in soon, right?
Did I argue that? If you want to look at some massive improvements, I was able to put up UIs to share results & explore them with a client within minutes rather than it taking me a few hours (which from experience it would have done).
> It's only been like 4 years since the revolutionary magic machine was announced.
It's been less than 3 since chatgpt launched, which if you'd been in the AI sphere as long as I had (my god it's 20 years now) absolutely was revolutionary. Over the last 4 years we've seen gpt3 solve a bunch of NLP problems immediately as long as you didn't care about cost to gpt-5-pro with web search and codex/sonnet being able to explore a moderately sized codebase and make real and actual changes (running tests and following up with changes). Given how long I spent stopping a robot hitting the table because it shifted a bit and its background segmentation messed up, or fiddling with classifiers for text, the idea I can get a summary from input without training is already impressive and then to be able to say "make it less wanky" and have it remove the corp speak is a huge shift in the field.
If your measure of success is "the CEOs of the biggest tech orgs say it'll do this soon and I found a problem" then you'll be permanently disappointed. It'd be like me sitting here saying mobile phones are useless because I was told how revolutionary the new chip in an iphone was in a keynote.
Since you don't seem to want to read most of this, most isn't for you. The last bit is, and it's just one question:
Why are you paying for something that solves literally no problems for you?
The CEO of Anthropic said I can fire all of my developers soon. How could one possibly be using the tool wrong? /s
Quite frankly, not being able to discuss the pros and the cons of a technology with other engineers absolutely hinders innovation. A lot of discoveries come out of mistakes.
Stop being so small minded.
Perhaps you’ve been sold a lie?
On the other hand, I've had them easily build useful code, answer questions and debug issues complex enough to escape good engineers for at least several hours.
Depends what you want. They're also bad (for computers) at complex arithmetic off the bat, but then again we have calculators.
We have OpenAI calling gpt5 as having PhD level of intelligence and others like Anthropoc saying it will write all our code within months. Some are claiming it’s already writing 70%.
I say they are being sold as a magical do everything tool.
Also there's a huge gulf between "some people claim it can do X" and "it's useful". Altman promising something new doesn't decrease the usefulness of a model.
Did the LLM have this?
You have to be able to see what this thing can actually do, as opposed to what it can’t.
But all code is "long precise strings".
They're useful, but you must verify anything you get from them.
> You can never trust the LLM to generate a url
This is very poorly worded. Using LLMs more wouldn't solve the problem. What you're really saying is that the GP is uninformed about LLMs.
This may seem like pedantry on my part but I'm sick of hearing "you're doing it wrong" when the real answer is "this tool can't do that." The former is categorically different than the latter.
> I'm sick of hearing "you're doing it wrong"
That's not what they said. They didn't say to use LLMs more for this problem. The only people that should take the wrong meaning from this are ones who didn't read past the first sentence.
> when the real answer is "this tool can't do that."
That is what they said.
Sure, but conceivably, you could also be informed of this second hand, through any publication about LLMs, so it is very odd to say "you don't use them enough" rather than "you're ignorant" or "you're uninformed". It is very similar to these very bizarre AI-maximalist positions that so many of us are tired of seeing.
> Sure, but conceivably, you could also be informed of this second hand, through any publication about LLMs, so it is very odd to say "you don't use them enough" rather than "you're ignorant" or "you're uninformed".
But this is to someone who is actively using them, and the suggestion of "if you were using them more actively you'd know this, this is a very common issue" is not at all weird. There are other ways they could have known this, but they didn't.
"You haven't got the experience yet" is a much milder way of saying someone doesn't know how to use a tool properly than "you're ignorant".
Edit: I think I'm just regurgitating the article here.
If it was a project written by humans I'd say they were crazy for going so hard on testing.
The quality control practices you need for safely letting an LLM run amok aren't just good. They're extreme.
Like, why not have tools built into our environment that checks that links are not broken? With the right architecture we could have validations for most common mistakes without having the solution adding a bunch of tedious overhead.
I have already had to correct a LOT of crap similar to the above in refactoring-done-via-LLM over the last year.
When stuff like this was done by a plain, slow, organic human, it was far more accurate. And many times, completely accurate with no defects. Simply because many developers pay close attention when they are forced to do the manual labour themselves.
Sure the refactoring commit is produced faster with LLM assistance, but repeatedly reviewing code and pointing out weird defects is very stressful.
The person using the LLM should be reviewing their code before submitting it to you for review. If you can catch a copy paste error like this, then so should they.
The failure you're describing is that your coworkers are not doing their job.
And if you accept "the LLM did that, not me" as an excuse then the failure is on you and it will keep happening.
Meticulousness shouldn't be an excuse to not have layers of validation that doesn't have to cost that much if done well.
Very few humans fill in their task with made up crap then lie about it - I haven't met any in person. And if I did, I wouldn't want to work with them, even if they work 24/7.
Obligatory disclaimer for future employers: I believe in AI, I use it, yada yada. The reason I'm commenting here is I don't believe we should normalise this standard of quality for production work.
Can you spot the next problem introduced by this?
In a discussion here on HN about why a regulation passed 15 years ago was not as general as it could have been, I speculated [1] that it could be that the technology at the time was not up to handling the general case and so they regulated what was feasible at the time.
A couple hours later I checked the discussion again and a couple people had posted that the technology was up to the general case back then and cheap.
I asked an LLM to see if it could dig up anything on this. It told me it was due to technological limits.
I then checked the sources it cites to get some details. Only one source it cited actually said anything about technology limits. That source was my HN comment.
I mentioned this at work, and a coworker mentioned that he had made a Github comment explaining how he thought something worked on Windows. Later he did a Google search about how that thing worked and the LLM thingy that Google puts at the top of search results said that the thing worked the way he thought it did but checking the cites he found that was based on his Github comment.
I'm half tempted to stop asking LLMs questions of the form "How does X work?" and instead tell them "Give me a list of all the links you would cite if someone asked you how X works?".
[1] https://news.ycombinator.com/item?id=45500763
Essentially, you're asking the LLM to do research and categorize/evaluate that research instead of just giving you an answer. The "work" of accessing, summarizing, and valuing the research yields a more accurate result.
But most of the time, I find that the outputs are nowhere near the effect of just doing it myself. I tried Codex Code the other day to write some unit tests. I had a few setup and wanted to use it (because mocking the data is a pain).
It took about 8 attempts, I had to manually fix code, it couldn't understand that some entities were obsolete (despite being marked and the original service not using them). Overall, was extremely disappointed.
I still don't think LLMs are capable of replacing developers, but they are great at exposing knowledge in fields you might not know and help guide you to a solution, like Stack Overflow used to do (without the snark).
And the human prompting, of course. It takes good sw engineering skills, particularly knowing how to instruct other devs in getting the work done, setting up good AGENTS.md (CLAUDE.md, etc) with codebase instructions, best practices, etc etc.
So it's not an "AI/LLMs are capable of replacing developers"... that's getting old fast. It's more like, paraphrasing the wise "it's not what your LLM can do for you, but what can you do for your LLM"
That's not overengineering, that's engineering. "Ask clarifying questions before you start working", in my experience, has led to some fantastic questions, and is a useful tool even if you were to not have the AI tooling write any code. As a good programmer, you should know when you are handing the tool a complete spec to build the code and when the spec likely needs some clarification, so you can guide the tool to ask when necessary.
Tech twitter keeps showing "one-shotting full-stack apps" or "games", and it's always something extremely banal. It's impressive that a computer can do it on its own, don't get me wrong, but it was trivial to programmers, and now it is commoditized.
Usually those two groups correlate very well with liking LLMs: some people will ask Claude to create a UI with React and see the mess it generated (even if it mostly works) and the edge cases it left out and comment in forums that LLMs don't work. The other group of people will see the UI working and call it a day without even noticing the subtleties.
One rule of thumb I use, is if you could expect to find a student on a college campus to do a task for you, an LLM will probably be able to do a decent job. My thinking is because we have a lot of teaching resources available for how to do that task, which the training has of course ingested.
Whenever I've attempted to actually do the whole "agentic coding" by giving it a complex task, breaking it down in sub-tasks, loading up context, reworking the plan file when something goes awry, trying again, etc. it hasn't a single fucking time done the thing it was supposed to do to completion, requiring a lot of manual reviewing, backtracking, nudging, it becomes more exhausting than just doing most of the work myself, and pushing the LLM to do the tedious work.
It does work sometimes to use for analysis, and asking it to suggest changes with the reasoning but not implement them, since most times when I let it try to implement its broad suggestions it went haywire, requiring me to pull back, and restart.
There's a fine line to walk, and I only see comments on the extremes online, it's either "I let 80 agents running and they build my whole company's code" or "they fail miserably on every task harder than a CRUD". I tend to not believe in either extreme, at least not for the kinds of projects I work on which require more context than I could ever fit properly beforehand to these robots.
Both are right.
How about a full programming language written by cc "in a loop" in ~3 months? With a compiler and stuff?
https://cursed-lang.org/
It might be a meme project, but it's still impressive as hell we're here.
I learned about this from a yt content creator that took that repo, asked cc to "make it so that variables can be emojis", and cc did that 5$ later. Pretty cool.
Impressive nonetheless.
There's a yt channel where the sessions were livestreamed. It's in their FAQ. I haven't felt the need to check them, but there are 10-12h sessions in there if you're that invested in proving that this is "so far outside of any capabilities"...
A brief look at the commit history should show you that it's 99.9% guaranteed to be written by an LLM :)
When's the last time you used one of these SotA coding agents? They've been getting better and better for a while now. I am not surprised at all that this worked.
Novel as in "an LLM can maintain coherence on a 100k+ LoC project written in zig"? Yeah, that's absolutely novel in this space. This wasn't possible 1 year ago. And this was fantasy 2.5 years ago when chatgpt launched.
Also impressive in that cc "drove" this from a simple prompt. Also impressive that cc can do stuff in this 1M+ (lots of js in the extensions folders?) repo. Lots of people claim LLMs are useless in high LoC repos. The fact that cc could navigate a "new" language and make "variables as emojis" work is again novel (i.e. couldn't be done 1 year ago) and impressive.
Absolutely. I do not underestimate this.
What does that mean exactly? I assume the LLM was not left alone with its task for 3 months without human supervision.
> the following prompt was issued into a coding agent:
> Hey, can you make me a programming language like Golang but all the lexical keywords are swapped so they're Gen Z slang?
> and then the coding agent was left running AFK for months in a bash loop
Running for 3 months and generating a working project this large with no human intervention is so far outside of the capabilities of any agent/LLM system demonstrated by anyone else that the mostly likely explanation is that the promoter is lying about it running on its own for 3 months.
I looked through the videos listed as “facts” to support the claims and I don’t see anything longer than a few hours.
The most important task for the human using the agent is to provide the right context. "Look at this file for helper functions", "do it like that implementation", "read this doc to understand how to do it"... you can get very far with agents when you provide them with the right context.
(BTW another issue is that they have problems navigating the directory structure in a large mono repo. When the agents needs to run commands like 'npm test' in a sub-directory, they almost never get it right the first time)
But, y'know what? I approved it. Because hunting down the existing functions it should have used in our utility library would have taken me all day. 5 years ago I would have taken the time because a PR like that would have been submitted by a new team member who didn't know the codebase well, and helping to onboard new team members is an important part of the job. But when it's a staff engineer using Cursor to fill our codebase with bloat because that's how management decided we should work, there's no point. The LLM won't learn anything and will just do the same thing over again next week, and the staff engineer already knows better but is being paid to pretend they don't.
If you are personally invested, there would be a point. At least if you plan to maintain that code for a few more years.
Let's say you have a common CSS file, where you define .warning {color: red}. If you want the LLM to put out a warning and you just tell it to make it red, without pointing out that there is the .warning class, it will likely create a new CSS def for that element (or even inline it - the latest Claude Code has a tendency to do that). That's fine and will make management happy for now.
But if later management decides that it wants all warning messages to be pink, it may be quite a challenge to catch every place without missing one.
I was running into this constantly on one project with a repo split between a Vite/React front end and .NET backend (with well documented structure). It would sometimes go into panic mode after some npm command didn’t work repeatedly and do all sorts of pointless troubleshooting over and over, sometimes veering into destructive attempts to rebuild whatever it thought was missing/broken.
I kept trying to rewrite the section in CLAUDE.md to effectively instruct it to always first check the current directory to verify it was in the correct $CLIENT or $SERVER directory. But it would still sometimes forget randomly which was aggravating.
I ended up creating some aliases like “run-dev server restart” “run-dev client npm install” for common operations on both server/client that worked in any directory. Then added the base dotnet/npm/etc commands to the deny list which forced its thinking to go “Hmm it looks like I’m not allowed to run npm, so I’ll review the project instructions. I see, I can use the ‘run-dev’ helper to do $NPM_COMMAND…”
It’s been working pretty reliably now but definitely wasted a lot of time with a lot of aggravation getting to that solution.
Perhaps "before implementing a new utility or helper function, ask the not-invented-here tool if it's been done already in the codebase"
Of course, now I have to check if someone has done this already.
I'd bet that most the improvement in Copilot style tools over the past year is coming from rapid progress in context engineering techniques, and the contribution of LLMs is more modest. LLMs' native ability to independently "reason" about a large slushpile of tokens just hasn't improved enough over that same time period to account for how much better the LLM coding tools have become. It's hard to see or confirm that, though, because the only direct comparison you can make is changing your LLM selection in the current version of the tool. Plugging GPT5 into the original version of Copilot from 2021 isn't an experiment most of us are able to try.
Just like with humans it definitely works better if you follow good naming conventions and file patterns. And even then I tend to make sure to just include the important files in the context or clue the LLM in during the prompt.
It also depends on what language you use. A LOT. During the day I use LLMs with dotnet and it’s pretty rough compared to when I’m using rails on my side projects. Dotnet requires a lot more prompting and hand holding, both due to its complexity but also due to how much more verbose it is.
We started with building the best code retrieval and build an agent around it.
Also, the agents.md website seems to mostly list README.md-style 'how do I run this instructions' in its example, not stylistic guidelines.
Furthermore, it would be nice if these agents add it themselves. With a human, you tell them "this is wrong, do it that way" and they would remember it. (Although this functionality seems to be worked on?)
I decided to pull the source code and fix this myself. It's written in Swift which I've used very little before, but this wasn't gonna be too complex of a change. So I got some LLMs to walk me through the process of building CLI apps in Xcode, code changes that need to be made, and where the build artifact is put in my filesystem so I could try it out.
I was able to get it to compile, navigate to my compiled binary, and run it, only to find my changes didn't seem to work. I tried everything, asking different LLMs to see if they can fix the code, spit out the binary's metadata to confirm the creation date is being updated when I compile, etc. Generally when I'd paste the code to an LLM and ask why it doesn't work it would assert the old code was indeed flawed, and my change needed to be done in X manner instead. Even just putting a print statement, I couldn't get those to run and the LLM would explain that it's because of some complex multithreading runtime gotcha that it isn't getting to the print statements.
After way too much time trouble-shooting, skipping dinner and staying up 90 minutes past when I'm usually in bed, I finally solved it - when I was trying to run my build from the build output directory, I forgot to put the ./ before the binary name, so I was running my global install from the developer and not the binary in the directory I was in.
Sure, rookie mistake, but the thing that drives me crazy with an LLM is if you give it some code and ask why it doesn't work, they seem to NEVER suggest it should actually be working, and instead will always say the old code is bad and here's the perfect fixed version of the code. And it'll even make up stuff about why the old code should indeed not work when it should, like when I was putting the print statements.
I wanted it to refactor a parser in a small project (2.5K lines total) because it'd gotten a bit too interconnected. It made a plan, which looked reasonable, so I told it to do this in stages, with checkpoints. It said it'd done so. I asked it "so is the old architecture also removed?" "No, it has not been removed." "Is the new structured used in place of the old one?" "No, it has not." After it did so, 80% of the test suite failed because nothing it'd written was actually right.
Did so three times with increasingly more babysitting, but it failed at the abstract task of "refactor this" no matter what with pretty much the same failure mode. I feel like I have to tell it exactly to make changes X and Y to class Z, remove class A etc etc, at which point I can't let it do stuff unsupervised, which is half of the reason for letting an LLM do this in the first place.
This expression tree parser (typescript to sql query builder - https://tinqerjs.org/) has zero lines of hand-written code. It was made with Codex + Claude over two weeks (part-time on the side). Having worked on ORMs previously, it would have taken me 4x-10x the time to get to the same state (which also has 100s of tests, with some repetitions). That's a massive saving in time.
I did not have to baby sit the LLMs at all. So the answer is, I think it depends on what you use it for, and how you use it. Like every tool, it takes a really long time to find a process that works for you. In my conversations with other developers who use LLMs extensively, they all have their unique, custom workflows. All of them however do focus on test suites, documentation, and method review processes.
How does the API look completely different for pg and sqlite? Can you share an example?
It's an implementation of LINQ's IQueryable. With some bells missing in DotNet's Queryable, like Window functions (RANK queries etc) which I find quite useful.
Add: What you've mentioned is largely incorrect. But in any case, it is a query builder. Meaning, an ORM like database abstraction is not the goal. This allows us to support pg's extensions, which aren't applicable to other database.
Perhaps experienced users of relevant technologies will just be able to automatically figure this stuff out, but this is a general discussion - people not terribly familiar with any of them, but curious about what a big pile of AI code might actually look like, could get the wrong impression.
Maybe I should use the same example repeated for clarity. Let me do that.
Edit: Fixed. Thank you.
The reason better turn to "It can do stuff faster than I ever could if I give it step by step high level instructions" instead.
I hate this idea of "well you just need to understand all the arcane ways in which to properly use it to its proper effects".
It's like a car which has a gear shifter, but that's not fully functional yet, so instead you switch gear by spelling out in morse code the gear you want to go into using L as short and R as long. Furthermore, you shouldn't try to listen to 105-112 on the FM band on the radio, because those frequencies are used to control the brakes and ABS and if you listen to those frequencies the brakes no longer work.
We would rightfully stone any engineer who'd design this and then say "well obvious user error" when the user rightfully complains that they crash whenever they listen to Arrow FM.
Thankfully as programmers we know better and don't need to care what the UI pretends to be able to do :)
>We would rightfully stone any engineer who'd design this and then say "well obvious user error" when the user rightfully complains that they crash whenever they listen to Arrow FM.
We might curse the company and engineer who did it, but we would still use that car and do those workarounds, if doing so allowed us to get to our destination in 1/10 the regular time...
> Thankfully as programmers we know better and don't need to care what the UI pretends to be able to do :)
But we do though. You can't just say "yeah they left all the foot guns in but we ought to know not to use them", especially not when the industry shills tell you those footguns are actually rocket boosters to get you to the fucking moon and back.
Obviously generated code drift a little from deleted ones.
I have seen similar failure modes in Cursor and VSCode Copilot (using gpt5) where I have to babysit relatively small refactors.
However, I also think that models which focus on higher reasoning effort in general are better at taking into account the wider context and not missing obvious implications from instructions. Non-reasoning or low-reasoning models serve a purpose, but to suggest they are akin to different flavours misses what is actually quite an important distinction.
So today I asked Gemini to simplify a mathematical expression with sympy. It did and explained to me how some part of the expression could be simplified wonderfully as a product of two factors.
But it was all a lie. Even though I explicitly asked it to use sympy in order to avoid such hallucinations and get results that are actually correct, it used its own flawed reasoning on top and again gave me a completely wrong result.
You still cannot trust LLMs. And that is a problem.
AI is not able to replace good devs. I am assuming that nobody sane is claiming such a thing today. But, it can probably replace bad and mediocre devs. Even today.
In my org we had 3 devs who went through a 6-month code boot camp and got hired a few years ago when it was very difficult to find good devs. They struggled. I would give them easy tasks and then clean up their PRs during review. And then AI tools got much better and it started outperforming these guys. We had to let two go. And third one quit on his own.
We still hire devs. But have become very reluctant to hire junior devs. And will never hire someone from a code boot camp. And we are not the only ones. I think most boot camps have gone out of business for this reason.
Will AI tools eventually get good enough to start replacing good devs? I don't know. But the data so far shows that these tools keep getting better over time. Anybody who argues otherwise has their heads firmly stuck in sand.
In the early US history approximately 90% of the population was involved in farming. Over the years things changed. Now about 2% has anything to do with farming. Fewer people are farming now. But we have a lot more food and a larger variety available. Technology made that possible.
It is totally possible that something like that could happen to the software development industry as well. How fast it happens totally depends on how fast do the tools improve.
Sure, but the food is less nutritious and more toxic.
Many companies were willing to hire fresh college grads in the hopes that they could solve relatively easy problems for a few years, gain experience and become successful senior devs at some point.
However, with the advent of AI dev tools, we are seeing very clear signs that junior dev hiring rates have fallen off a cliff. Our project manager, who has no dev experience, frequently assigns easy tasks/github issues to Github Copilot. Copilot generates a PR in a few minutes that other devs can review before merging. These PRs are far superior to what an average graduate of a code boot camp could ever create. Any need we had for a junior dev has completely disappeared.
Where do your senior devs come from?
There is not that much copy/paste that happens as part of refactoring so it leans to just using context recall. It's not entirely clear if providing an actual copy/paste command is particularly useful, at least from my testing it does not do much. More interesting are repetitive changes that clog up the context. Those you can improve on if you have `fastmod` or some similar tool available: with it you can instruct codex or claude to perform edits with it.
> And it’s not just how they handle code movement -- their whole approach to problem-solving feels alien too.
It is, but if you go back and forth to work out a plan for how to solve the problem, then the approach greatly changes.
To use another example, with my IDE I can change a signature or rename something across multiple files basically instantly. But an LLM agent will take multiple minutes to do the same thing and doesn't get it right.
There is reinforcement learning on the Anthropic side for a text edit tool, which is built in a way that does not lend itself to copy/paste. If you use a model like the GPT series then there might not be reinforcement learning for text editing (I believe, I don't really know), but it operates on line-based replacements for the most part and for it to understand what to manipulate it needs to know the content in the context. When you try to give it a copy/paste buffer it does not fully comprehend what the change in the file looks like after the operation.
So it might be possible to do something with copy/paste, but I did not find it to be very obvious how you make that work with an agent, given that it needs to read the file into context anyways and its recall capabilities are surprisingly good.
> To use another example, with my IDE I can change a signature or rename something across multiple files basically instantly.
So yeah, that's the more interesting case and there things like codemod/fastmod are very effective if you tell an agent to use it. They just don't reach there.
In my 25 years of software development I could apply the second critique to over half of the developers I knew. That includes myself for about half of that career.
So: "humans are bad at this too" doesn't have much weight (for people with that mindset).
It makes sense to me, at least.
Ok, this example is probably too extreme, replace the knife with an industrial machine that cut bread vs a human with a knife. Nobody would buy that machine either if it worked like that.
Your p25 employee is probably much closer to your p95 employee than to the p50 "standard" human, so yeah, I think you have a point there.
But at least in food prep, p25 would already be pretty damn hard to achieve. That's a hell of a lot of autonomy and accuracy (at least in my restaurant kitchen experience which is admittedly just one year in "fine dining"-ish kitchens).
I'd say the p25 of software or SRE folks I've worked with is also a pretty high bar to hit, too, but maybe I've been lucky.
If a knife slices bread like a normal human at p50, it's not a very good knife.
If a knife slices bread like a professional chef at p50, it's probably a very decent knife.
I don't know if LLMs are better at asking questions than a p50 developer. In my original comment I wanted to raise the question of whether the fact that LLMs are not good at asking questions makes them still worse than human devs.
The first LLM critique in the original article is that they can't copy and paste. I can't argue with that. My 12 year old copies-and-pastes better than top coding agents.
The second critique says they can't ask questions. Since many developers also are not good at this, how does the current state of the art LLM compare to a p50 developer in this regard?
- Get rid of these warnings "...": captures and silences warnings instead of fixing them - Update this unit test to relfect the changes "...": changes the code so the outdated test works - The argument passed is now wrong: catches the exception instead of fixing the argument
My advice is to prefer small changes and read everything it does before accepting anything, often this means using the agent actually is slower than just coding...
“Fix the issues causing these warnings”
Retrospectively fixing a test to be passing given the current code is a complex task, instead, you can ask it to write a test that tests the intended behaviour, without needing to infer it.
“The argument passed is now wrong” - you’re asking the LLM to infer that there’s a problem somewhere else, and to find and fix it.
When you’re asking an LLM to do something, you have to be very explicit about what you want it to do.
I find this one particularly frustrating when working directly with ChatGPT and Claude via their chat interfaces. I frequently find myself watching them retype 100+ lines of code that I pasted in just to make a one line change.
I expect there are reasons this is difficult, but difficult problems usually end up solved in the end.
AI labs already shipped changes related to this problem - most notable speculative decoding, which lets you provide the text you expect to see come out again and speeds it up: https://simonwillison.net/2024/Nov/4/predicted-outputs/
They've also been iterating on better tools for editing code a lot as part of the competition between Claude Code and Codex CLI and other coding agents.
Hopefully they'll figure out a copy/paste mechanism as part of that work.
LLMs will gladly go along with bad ideas that any reasonable dev would shoot down.
Claude is just chirping away "You're absolutely right" and making me to turn on caps lock when I talk to it and it's not even noon yet.
All while having the tone of an over caffeinated intern who has only ever read medium articles.
You can't fix it.
Humans ask questions of groups to fix our own personal short comings. It make no sense to try and master an internal system I rarely use, I should instead ask someone that maintains it. AI will not have this problem provided we create paths of observability for them. It doesn't take a lot of "effort" for them to completely digest an alien system they need to use.
I do not believe that AI will magically overcome the Chesterton Fence problem in a 100% autonomous way.
What is really needed is a tree of problems which appear identical at first glance, but the issue and the solution is something that is one of many possibilities which can only be revealed by finding what information is lacking, acquiring that information, testing the hypothesis then, if the hypothesis is shown to be correct, then finally implementing the solution.
That's a much more difficult training set to construct.
The editing issue, I feel needs something more radical. Instead of the current methods of text manipulation, I think there is scope to have a kind of output position encoding for a model to emit data in a non-sequential order. Again this presents another training data problem, there are limited natural sources to work from showing programming in the order a programmer types it. On the other hand I think it should be possible to do synthetic training examples by translating existing model outputs that emit patches, search/replaces, regex mods etc. and translate those to a format that directly encodes the final position of the desired text.
At some stage I'd like to see if it's possible to construct the models current idea of what the code is purely by scanning a list of cached head_embeddings of any tokens that turned into code. I feel like there should be enough information given the order of emission and the embeddings themselves to reconstruct a piecemeal generated program.
On the other hand, teaching the model to be unsure and ask questions, requires the training loop to break and bring a human input in, which appears more difficult to scale.
The ironic thing to me is that the one thing they never seem to be willing to skip asking about is whether they should proceed with some fix that I just helped them identify. They seem extremely reluctant to actually ask about things they don't know about, but extremely eager to ask about whether they should do the things they already have decided they think are right!
I often wish that instead of just starting to work on the code, automatically, even if you hit enter / send by accident, the models would rather ask for clarification. The models assume a lot, and will just spit out code first.
I guess this is somewhat to lower the threshold for non-programmers, and to instantly give some answer, but it does waste a lot of resources - I think.
Others have mentioned that you can fix all this by providing a guide to the mode, how it should interact with you, and what the answers should look like. But, still, it'd be nice to have it a bit more human-like on this aspect.
You should either already know the answer or have a way to verify the answer. If neither, the matter must be inconsequential like just a child like curiosity. For example, I wonder how many moons Jupiter has... It could be 58, it could be 85 but either answer won't alter any of what I do today.
I suspect some people (who need to read the full report) dump thousand page long reports into LLM, read the first ten words of the response and pretend they know what the report says and that is scary.
For those curious, the answer is 97.
https://en.wikipedia.org/wiki/Moons_of_Jupiter
Fortunately, as devs, this is our main loop. Write code, test, debug. And it's why people who fear AI-generated code making it's way into production and causing errors makes me laugh. Are you not testing your code? Or even debugging it? Like, what process are you using that prevents bugs happening? Guess what? It's the exact same process with AI-generated code.
Also if you want it to pause asking questions, you need to offer that thru tools (example Manus do that) and I have an MCP that do that and surprisingly I got a lot of questions and if you prompt, it will do. But the push currently is for full automation and that's why it's not there. We are far better in supervised step by step mode. There is elicitation already in MCP, but having a tool asking questions require you have a UI that will allow to set the input back.
The second issue is that, LLM does not learn much high level context relationship of knowledge. This can be improved by introducing more patterns in the training data. And current LLM training is doing much on this. I don't think it is a problem in next years.
For copy-paste, you made it feel like a low-hanging fruit? Why don't AI agents have copy/paste tools?
—
Just because this new contributor is forced to effectively “SSH” into your codebase and edit not even with vim but with with sed and awk does not mean that this contributor is incapable of using other tools if empowered to do so. The fact it is able to work within such constraints goes to show how much potential there is. It is already much better at a human than erasing the text and re-typing it from memory and while it is a valid criticism that it needs to be taught how to move files imagine what it is capable of once it starts to use tools effectively.
—
Recently, I observed LLMs flail around for hours trying to get our e2e tests running as it tried to coordinate three different processes in three different terminals. It kept running commands in one terminal try to kill or check if the port is being used in the other terminal.
However, once I prompted the LLM to create a script for running all three processes concurrently, it is able to create that script, leverage it, and autonomously debug the tests now way faster than I am able to. It has also saved any new human who tries to contribute from similar hours of flailing around. Is there something we could have easily done by hand but just never had the time to do before LLMs. If anything, the LLM is just highlighting the existing problem in our codebase that some of us got too used to.
So yes, LLMs makes stupid mistakes, but so do humans, the thing is that LLms can ifentify and fix them faster (and better, with proper steering)
Even something as simple as renaming a variable is often safer and easier when done through the editor’s language server integration.
Not in my experience. And it's not "overengineering" your prompt, it's just writing your prompt.
For anything serious, I always end every relevant request with an instruction to repeat back to me the full design of my instructions or ask me necessary clarifying questions first if I've left anything unclear, before writing any code. It always does.
And I don't mind having to write that, because sometimes I don't want that. I just want to ask it for a quick script and assume it can fill in the gaps because that's faster.
ChatGPT proposed a few ideas, all apparently reasonable, and then it advocated for one that was presented unambiguously as the "best". After a few iterations, I realized that its solution would have required a class hierarchy where the base class contained a templated virtual function, which is not allowed in C++. I pointed this out to ChatGPT and asked it to rethink the solution; it then immediately advocated for the other approach it had initially suggested.
I hear from my clients (but have not verified myself!) that LLMs perform much better with a series of tiny, atomic changes like Replace Magic Literal, Pull Up Field, and Combine Functions Into Transform.
[1] https://martinfowler.com/books/refactoring.html [2] https://martinfowler.com/bliki/OpportunisticRefactoring.html [3] https://refactoring.com/catalog/
Why do this to yourself? Do you get paid more if you work faster?
Strongly disagree that they're terrible at asking questions.
They're terrible at asking questions unless you ask them to... at which point they ask good, sometimes fantastic questions.
All my major prompts now have some sort of "IMPORTANT: before you begin you must ask X clarifying questions. Ask them one at a time, then reevaluate the next question based on the response"
X is typically 2–5, which I find DRASTICALLY improves output.
> LLMs don’t copy-paste (or cut and paste) code.
The article is confusing the architectural layers of AI coding agents. It's easy to add "cut/copy/paste" tools to the AI system if that shows improvement. This has nothing to do with LLM, it's in the layer on top.
> Good human developers always pause to ask before making big changes or when they’re unsure [LLMs] keep trying to make it work until they hit a wall -- and then they just keep banging their head against it.
Agreed - LLMs don't know how to back track. The recent (past year) improvements in thinking/reasoning do improve in this regard (it's the whole "but wait..." RL training that exploded with OpenAI o1/o3 and DeepSeek R1, now done by everyone), but clearly there's still work to do.
I’d like to see what happens with better refactoring tools, I’d make a bunch more mistakes copying and retyping or using awk. If they want to rename something they should be able to use the same tooling the rest of us get.
Asking questions is a good point but that’s both a bit of promoting and I think the move to having more parallel work makes it less relevant. One of the reasons clarifying things more upfront is useful is we take a lot of time and cost a lot of money to build things so the economics favours getting it right first time. As the time comes down and the cost drops to near zero, the balance changes.
There are also other approaches to clarify more what you want and how to do it first, breaking that down into tasks, then letting it run with those (spec kit). This is an interesting area.
Over time, they usually get what they want: they become the smartest ones left in the room, because all the good people have already moved on. What’s left behind is a codebase no one wants to work on, and you can’t hire for it either.
But maybe I’ve just worked with the wrong teams.
EDIT: Maybe this is just about trust. If you can’t bring yourself to trust code written by other human beings, whether it’s a package, a library, or even your own teammates, then of course you’re not going to trust code from an LLM. But that’s not really about quality, it’s about control. And the irony is that people who insist on controlling every last detail usually end up with fragile systems nobody else wants to touch, and teams nobody else wants to join.
Often I find myself cursing at the LLM for not understanding what I mean - which is expensive in lost time / cost of tokens.
It is easy to say: Then just don't use LLMs. But in reality, it is not too easy to break out of these loops of explaining, and it is extremely hard to assess when not to trust that the LLM will not be able to finish the task.
I also find that LLMs consistently don't follow guidelines. Eg. to never use coercions in TypeScript (It always gets in a rogue `as` somewhere) - to which I can not trust the output and needs to be extra vigilant reviewing.
I use LLMs for what they are good at. Sketching up a page in React/Tailwind, sketching up a small test suite - everything that can be deemed a translation task.
I don't use LLMs for tasks that are reasoning heavy: Data modelling, architecture, large complex refactors - things that require deep domain knowledge and reasoning.
Me too. But in all these cases, sooner or later, I realized I made a mistake not giving enough context and not building up the discussion carefully enough. And I was just rushing to the solution. In the agile world, one could say I gave the LLM not a well-defined story, but a one-liner. Who is to blame here?
I still remember training a junior hire who started off with:
“Sorry, I spent five days on this ticket. I thought it would only take two. Also, who’s going to do the QA?”
After 6 months or so, the same person was saying:
“I finished the project in three weeks. I estimated four. QA is done. Ready to go live.”
At that point, he was confident enough to own his work end-to-end, even shipping to production without someone else reviewing it. Interestingly, this colleague left two years ago, and I had to take over his codebase. It’s still running fine today, and I’ve spent maybe a single day maintaining it in the last two years.
Recently, I was talking with my manager about this. We agreed that building confidence and self-checking in a junior dev is very similar to how you need to work with LLMs.
Personally, whenever I generate code with an LLM, I check every line before committing. I still don’t trust it as much as the people I trained.
Llms provide little of that, they make people lazy, juniors stay juniors forever, even degrading mentally in some aspects. People need struggle to grow, when you have somebody who had his hand held whole life they are useless human disconnected from reality, unable to self-sufficiently achieve anything significant. Too easy life destroys both humans and animals alike (many experiments have been done on that, with damning results).
There is much more like hallucinations, questionable added value of stuff that confidently looks OK but has underlying hard-to-debug bugs but above should be enough for a start.
I suggest actually reading those conversations, not just skimming through them, this has been stated countless times.
Many agents break down not because the code is too complex, but because invisible, “boring” infrastructure details trip them up. Human developers subconsciously navigate these pitfalls using tribal memory and accumulated hacks, but agents bluff through them until confronted by an edge case. This is why even trivial tasks intermittently fail with automation agents. you’re fighting not logic errors, but mismatches with the real lived context. Upgrading this context-awareness would be a genuine step change.
If an llm can't do sys admin stuff reliably, why do we think it can write quality code?
"Hey it wasn't what you asked me to do but I went ahead and refactored this whole area over here while simultaneously screwing up the business logic because I have no comprehension of how users use the tool". "Um, ok but did you change the way notifications work like I asked". "Yes." "Notifications don't work anymore". "I'll get right on it".
I can run this experiment using ToolKami[0] framework if there is enough interest or if someone can give some insights.
[0]: https://github.com/aperoc/toolkami
why overengineer? it's super simple
I just do this for 60% of my prompts: "{long description of the feature}, please ask 10 questions before writing any code"
Then you - and your agent - can refactor fearlessly.
LLMs are especially tricky because they do appear to work magic on a small greenfield, and the majority of people are doing clown-engineering.
But I think some people are underestimating what can be done in larger projects if you do everything right (eg docs, tests, comments, tools) and take time to plan.
Not if they're instructed to. In my experience you can adjust the prompt to make them ask questions. They ask very good questions actually!
So there's hope.
But often they just delete and recreate the file, indeed.
Maybe some of those character.ai models are sassy enough to have stronger opinions on code?
LLMs also have trouble figuring out that a task is impossible. I wanted boilerplate code that rendered a mesh in Three.js using GL_TRIANGLE_STRIP because I was writing a custom shader and needed to experiment with the math. But Three.js does support GL_TRIANGLE_STRIP rendering for architectural reasons. Grok, ChatGPT, and Gemini all hallucinated a GL_TRIANGLE_STRIP rendering API rather than telling be about this and I had to Google the problem myself.
It feels like current Coding LLMs are good at replacing junior engineers when it comes to shallow but broad tasks like creating UIs, modifying examples available on the web, etc. But they fail at senior-level tasks like realizing that the requirements being asked of them aren't valid and doing something that no one has done in their corpus of training data.
Typo or trolling the next LLM to index HN comments?
Oh, sorry. You already said that. :D
The second point is easily handled with proper instructions. My AI agents always ask questions about points I haven't clarified, or when they come across a fork in the road. Frequently I'll say "do X" and it'll proceed, then halfway it will stop and say "I did some of this, but before I do the rest, you need to decide what to do about such and such". So it's a complete non-problem for me.
I don't agree with that. When I am telling Claude Code to plan something I also mention that it should ask questions when informations are missing. The questions it comes up with a really good, sometimes about cases I simply didn't see. To me the planning discussion doesn't feel much different than in a GitLab thread, only at a much higher iteration speed.
First it gets an error because bash doesn’t understand \
Then it gets an error because /b doesn’t work
And as LLMs don’t learn from their mistakes, it always spends at least half a dozen tries (e.g. bash(cmd.exe /c dir c:\test /b )) before it figures out how to list files
If it was an actual coworker, we’d send it off to HR
I am guessing this because:
1. Most of the training material online references Unix commands. 2. Most Windows devs are used to GUIs for development using Visual Studio etc. GUIs are not as easy to train on.
Side note: Interesting thing I have noticed in my own org is that devs with Windows background strictly use GUIs for git. The rest are comfortable with using git from the command line.
It's only when you take the tech out of the area it's good at and start trying to get it to "write code" or even worse "be an agent" that it starts cracking up and emitting garbage; this is only done because companies want to forcememe some kind of product besides "chatbot", whether or not it makes sense. It's a shame because it'll happily and effectively write the docs that don't exist but you wish did for more or less anything. (Writing code examples for docs is not a weak point at all.)
_Did you ask it to ask questions?_
This is because LLMs trend towards the centre of the human cognitive bell curve in most things, and a LOT of humans use this same problem solving approach.