Do you think we will recognize any walls? Or is there a point where the output might look different with respect to different paradigms / modalities we throw at it, but we won't be able to understand the quantitative differences as good/bad/scalable?
> > The optimizer is usually smarter than you think.
> Except for when it isn't, and moves heavy calculation inside a nested loop inside a nested loop to avoid an index scan. Nothing is perfect.
Yeah that's also been my experience. It's true that Postgres is usually smarter than I think, when I try to figure out why it's not using a better query plan I eventually find out that it wouldn't be better at all. But from time to time, it genuinely is taking a bad decision and having no power over that at all is a problem
> projects I simply could not have ever approached alone.
I think that's part of the divide between enthusiasts and naysayers. If you use GenAI on things that you couldn't approach alone, it's an incredible tool. If you use it on stuff that you're pretty good at, it's not a gamechanger (and if you're an expert, it's a minor boost at best). Many people's job are about doing what they're an expert at.
I'm about to complete a new non trivial functionality in a project of a costumer of mine. I spent an hour writing the spec. Then I asked Claude (Sonnet 4.6) to check if I missed something. I did, the sort of minor issues one notice after starting writing code, edge cases etc. That made me think about more issues and after a few iterations we settled down on a spec. I asked Claude to make an implementation plan and we ended up with 9 steps. It wrote the code for a step with new automatic tests and I performed some manual QA, which found further issues we didn't think about. We are at step 8 of 9 in about 12 hours of work. I would have needed a week to be there alone, with time spent researching and fixing bugs I created along the way, an inevitable part of our job but not exactly the most pleasant one.
This speedup is great. It improves the overall quality of the product (as perceived by the users) because I can ask Claude to add features that my customers and I would have dismissed because they take too long to implement. We would have settled down with a more basic UX.
So is it a game changer? It is in the same way those HTML / CSS framework like Bootstrap were game changers: suddenly every developer could create a decent and consistent UI in a fraction of the time with a few bells and whistles that we wouldn't have bothered coding. As a side effect a lot of web apps felt look alike mass products and web designers had to reinvent themselves, but the economics leaded inevitably in that direction. Would I spend again one of two weeks doing alone what I could write in a day or two with a LLM? Not anymore, not at this cost ($20 per month.)
I'd love to read a full transcript of someone going through this kind of collaborative programming. I see this kind of process mentioned a lot but can't quite figure out the details in my head.
If anyone has a link to a blog post or similar showing this process in depth, I'd love to give it a read :)
I've been using superpowers [1] for this purpose, and have really appreciated how it guides the model to use careful, methodical approaches to answering my prompts. It's great for multi-step planning, design, and implementation, but also has guidance for debugging, accepting a code review, etc.
I think it will click once you actually sit down with the AI agent, toggle Plan mode, and just tell it what you want to do in couple of sentences. It will immediately start building up the plan, presenting it to you what it thinks is the right approach , with the steps to take, with open questions that you can look at and answers. Then send them back to the AI. Repeat. That process along would give you a progress way further than you try to do it by yourself.
You can tell it to start implementing step 1. And you pick it up from there. Very natural how you would approach an expert for help, but you can always audit.
I can't provide a transcript because it's work I made for a customer and I'm bound to a clause of confidentiality.
What I did is what I use to do while starting to work on a major feature: make a list of changes, new and modified functionality, think which code and db tables I will touch and how, set constraints on the edits (eg: that API must not change, that one must be retro compatible) etc. I've been a bit pedantic because this time I had to tell it to someone else. I wrote it into a md file and asked Claude to check the code and find out if my plan was consistent with the code we were starting from. It made a list of things that I needed to detail more, added some questions and we iterated on it. Basically it's what I do myself but it happened faster.
You describe almost exactly how I work, except I always use Opus with effort locked at max. Lots of detailed multi level planning, then coding the different planned steps, which it at that point just one shots, with a plan review and adaptation after each step.
> If you use it on stuff that you’re pretty good at, it’s not a gamechanger (and if you’re an expert, it’s a minor boost at best).
This was probably true last year, and it’s a common talking point, but I’ve seen too many examples now of deep experts using Claude & Codex in the last year to solve very big problems, and write or rewrite large systems. The experts do complain that the LLMs can sometimes get stuck or go off the rails and they need to pay attention and actively steer. But nobody I know who’s using it is still claiming the LLMs aren’t a game changer, even quite a few people who were staunch holdouts for a long time. I was skeptical myself, for a long time, but had my oh shit moment late last year.
One caveat - to get expert results, you do need to have some experience using LLMs, you need to use it to write plans and design docs, know how to use ‘skills’ and MCPs, use it to review code, and (for now) you need to understand context compaction and when/why to use sub-agents. If you’re a domain expert but an AI noob, it’s less effective than an expert who knows how to use AI and has experience.
One of the biggest problem with humans is we’re wired to spot patterns and draw conclusions and then we have a really hard time seeing and accepting change and updating our mental rules. The LLMs are getting better. They have already gotten better, and they’re going to continue getting better. It’s too early to draw conclusions, and many conclusions people have already declared are out of date and no longer true.
I think part of it is we often notice bad AI usage. The llm generated "art" by someone with bad taste, or the patches to open source projects by people who cant program at all and are teerrible.
If the use is half decent people just dont notice it.
Anti-AI zealots (from a practical usability position. Not necessarily the moral ones) are like the people who looked at The Daily WTF and decided no humans are capable of programming. They had plenty of examples to point at, but refuse to look at decent to great programmers. The stories of "The AI deleted my database!" are prevalent and boosted by these folks because it confirms their biases. It literally doesn't matter if the LLM wrote strong warnings about the action about to be taken. They don't see that aspect of it. Just the fact that someone claims "The AI deleted my database!" is enough for them.
Despite all the liars telling me gaming is easier on Linux than Windows, most new games have some sort of issues launching with default settings. CC is able to dive into both the exact error logs and the recent community feedback on what tweaks / configurations are needed to make it work. I rarely have to go beyond two prompts before a game is playable. CC and Proton are enabling the Linux gaming experience far more than Linus ever has or ever was interested in.
> Despite all the liars telling me gaming is easier on Linux than Windows, most new games have some sort of issues launching with default settings. CC is able to dive into both the exact error logs and the recent community feedback on what tweaks / configurations are needed to make it work. I rarely have to go beyond two prompts before a game is playable. CC and Proton are enabling the Linux gaming experience far more than Linus ever has or ever was interested in.
Heh - I've just gone through a similar journey transitioning from Windows to Bazzite to play Steam games on Linux. I wouldn't have bothered pre-LLMs because my day job is Linux/Software and the thought of trying to fix issues here just to play games put me off.
Imo it's still great for areas you have expertise in, because it's a tool for automating the boring, repetitive, or time-consuming bits that you can then expert-verify.
I'd rather review & tweak generated test cases than write a load of boilerplate, test setup, etc. myself.
If you work on architecture and Claude docs, then you can essentially just have it fill in the gaps. Work then mostly becomes a matter of defining what the next piece of functionality is (which you can also use Claude to help with).
The stuff that used to take days now takes hours. It's not perfect, but if you get your codebase into a good shape then the payoff is huge.
> If you use GenAI on things that you couldn't approach alone, it's an incredible tool.
I think this isn't true in all cases
> If you use it on stuff that you're pretty good at, it's not a gamechanger (and if you're an expert, it's a minor boost at best).
I think even then there's a divide.
I mostly work greenfield projects (and love it!). For these, AI has been a literal game changer. Our projects are built faster, with one or two orders of magnitude more automated tests, and all quality metrics are up.
Meanwhile, nearly all of my friends complain that AI doesn't help them. But they mostly work in very large existing codebases.
Still, even in large projects I think AI (the expensive variant) has been a complete gamechanger for me. Sure, I spend a lot on tokens, but I just feel happier and enjoy what I do more. The singalong people say about "thinking at a higher abstraction level" is what I feel. I really am thinking about architecture and larger patterns, instead of the boring nitty-gritty (which wasn't boring at all when I was a kid learning to code!...)
I think a key factor in all of this, to me, has been dictation. Most of the time, I don't write -- I use voice-to-text. I don't even read what comes out of it -- the LLMs get it (it is mostly unintelligible to anyone else) .
This means when I'm planning a big feature, I give a gigantic brain dump to the LLM in perfect stream of consciousness way, going through ideas, pros and cons, edge cases, what exists, what doesn't exist, where I'm sure of something, where I'm not sure and want the LLM to browse the state-of-the-art. Sometimes I spend 20 minutes just talking to the microphone before I send the first prompt. When I pair that with Opus, I find that I am able to build much faster and to go through alternative designs much more frequently as well.
I keep trying to tell all my friends: use voice to text and braindump to the computer. But they refuse... I couldn't imagine having to type everything nowadays. Even though I'm a fast typer, it's still much slower than the speed of my thought, which, granted, is still faster than the speed of my voice.
In effect, I filter much less, but I've come to think that's positive for the good LLMs: I throw all the edge cases and what ifs I'm thinking about -- all those years of experience dealing with similar systems.
If I wanted to go back to work in-office, that would be my major problem: I need to be able to talk with my computer all the time, loudly, and pacing through my room.
Yay for dictation! It's so nice to just think aloud and then have an easily editable record of your thoughts, even when you aren't feeding the outputs to LLMs.
How do you use voice-to-text? You mean, in the browser?
I am only familiar with Claude Code, which I have installed on remote server, and there obviously, voice-to-text does not work. I have to type, which is tiring.
There are many tools for this, and I use the one that I tried first, so there are probably better-suited alternatives out there.
I run MacWhisper, and I paired it with BetterTouchTool so it triggers on any input when I double tap the fn/globe icon.
Obviously all of my transcriptions through it are entirely local. I usually use the Large V3 Turbo model, though in the beginning I used Parakeet v3, which was slightly faster but produced more mistakes (and kept a lot of filler words -- 'ahhm', 'hummm').
However, if I'm interacting with the Claude or ChatGPT/Codex apps, I often use their voice recognition instead, because it tends to be more accurate, especially with punctuation, albeit significantly slower. OpenAI's is noticeably better than Anthropic but I feel like that gap has closed a bit recently (might be all in my head, though).
Like I said I don't really care about mistakes in the transcription. If you try to read it, it feels like a fever dream, but the LLMs get it.
If I say "taken" it may have "take and"
If I say "all the while calling the method" it might have "although a while. while. call in the met of". This is a rather extreme example but I've seen them happen. The repetition of words happens because I'm talking with "humns and ahs" and do repeat words or just the ends of words. It's very rare for the models, especially Opus, to have any issue with this transcription. When they do, they tend to signal to me they didn't get it, or I catch them in the act. But, like I said, it really is very very rare.
As an example, I've got quite a significant feature to work on, which would have probably taken me weeks to design and implement, and I've used this exact method today to ink out the plan:
- I have spent the last couple of days researching the feature in my off-time and just "thinking about it in the background" (think: I fall asleep thinking of it -- a habit I've always had)
- I spent ~25 minutes brainstorming out loud. The transcript ended with ~17.000 characters and ~3.000 words.
- I sent that transcript, in cursor, to Opus 4.6-High with instructions on how to iterate on it and how I want to work while planning
- I then spent about 1.5 hours with it iterating and building the actual plan (and supporting technical decision document, which points at the FULL transcript of the whole interaction). Many of my original ideas made it to the final plan, others got scrapped or simplified, and others still got added. It contains a mixture of my ideas, Opus' ideas and our push-back on "each other".
- Now I have a multi-step plan, with at least 8 distinct stages to implement this massive feature which I know for a fact would have taken me weeks to implement, and I expect to implement it in at most 3 days, but very likely it will be a day and a half.
Final context (with regards to your Claude Code question): My main development environment is Cursor, though for personal projects I also use Codex and Claude code. For the initial "researching of the feature in my off-time" I often have interactions with ChatGPT and Claude where they have no access to the codebase, and I have them go find out what the state of the art on specific topics is. All of these interactions also involve me using my voice to talk to them (though nowadays I don't typically use their voice mode, I just let them reply in text). Then I brood over that.
This is exactly my workflow and it’s just incredible. I use aqua and wispr flow depending on which one seems to be returning the best results that day.
I'm an expert at datalakes. I manage them for my company. I also am proficient at backend web. Even still I use Gen AI frequently to manage it all. When my company downsized I kept the lights on. Not enough bandwidth to do more. I've since materially improved the system and doing things we never did even at a team of 2 or 3.
Outside my expertise I've begun writing static recompilers for old retro game systems and have gotten some games off the ground. I understand WHAT they're doing but I neve had the expertise to do such things myself. Even if I did I could never operate at the velocity I am now.
Using GenAI on things that you couldn't approach is also extremely scary and dangerous in my opinion.
For example, I would never in a million years use generated code I don't understand (fully) to interface and possibly interact with a physical object that can fail, catch fire, break etc. in case of a bug or misuse, like OP mentioned.
The dangerous thing is when you’re a novice and can’t identify the BS. That’s why for people with “good” and “expert” skill, it’s not a huge boost. They can identify the BS, and what’s left is modestly helpful.
The highest danger in using AI comes precisely to people who stand the most to gain from it.
Exactly that. Novice don’t notice the BS. But they see the output and it looks magical. The UI is working! Hardly any time to code that in
Then they send that PR for a review by a more senior person. And that senior person doesn’t even know where to start on how to explain why it’s all wrong and likely to collapse in prod.
Tons of good use of AI. But tons of bad use of it. And when it’s bad and people don’t notice it, that gets dangerous. So because of that, now we spend a lot more time in doing reviews. Essentially creating a new bottle neck
I don't know if that'll make you feel any better but yeah, you're indeed asking for the impossible! You need consensus between your nodes that store state _somehow_, either these nodes are Redis and it does that for you, or these nodes are your pods and you need to do consensus yourself (zookeeper might help, but you're definitely in "complicated stuff" territory).
Spinning up an in-memory (no persistence) Redis cluster in your k8s should be easy enough, hopefully?
And yes, adding a Redis cluster is fine, it is just another moving part to manage. But given that the alternative is made out of unobtainium, I guess that is just the way of it :-)
> If I flag every line in your PR as a potential security bug then I have 100% recall.
No. A code review isn't about "flagging a line of code", it's about identifying an issue or a risk. If a 10-line PR has one issue and you leave a comment on every single character, if you still miss the issue you have 0% recall.
> It’s pretty rare that I run into a bug in production that a type system would have caught.
Wow, how different our experiences are. In Javascript/Typescript land, so so many bugs are null/undefined-related and really should have been caught at type level.
In fact, I'd say (without actually measuring it) that _most_ bugs I've ran into in Typescript are due to someone having bypassed the typing (casting, ts-ignore...), or a type mismatch at IO boundary.
Anecdotally, it is very much different in Elixir land. I occasionally see bugs related to something being unexpectedly `nil` but it's pretty rare IME.
I'd love to evidence what I'm saying with specific numbers since this kind of discussion would benefit from being as objective as possible. Sadly I don't have them. But I still believe what I'm saying and I have a few guesses about some of the causes:
1. Immutable data - so, so many bugs are caused by data mutating out from under you in subtle ways. If you write `x = 1` in your Elixir function, nothing can change the value of `x` except an explicit rebinding. You can then write e.g. `y = f(x)` and know `x` remains unchanged after. Note: this is also true even if the variable is a composite type. `my_struct = blah()` will remain the same in it's entirety no matter what you do with `my_struct`. This is different than in JS where e.g. you can change the contents of an object even if it's declared `const`.
2. Assertive style - the Elixir community favors writing things in an "assertive" fashion [1]. Briefly, this a way of writing code that will fail the moment an assumption is broken rather than letting the issue propagate.
3. Pattern matching (somewhat like destructuring in JS) - Elixir code actually ends up feeling "typed" with pattern matching. E.g. `%Time{} = today = Date.utc_today()` will attempt to bind `today` to the result of `Date.utc_today()` and will raise a `MatchError` when the result, a `%Date{}` struct, fails to be a `%Time{}` struct. Or `[a, b] = [1, 2, 3]` will raise a `MatchError` because `[1, 2, 3]` isn't a list of length exactly 2. You can use pattern matching to write very assertive code quite tersely.
These reasons are all local properties of code. But when all its parts are written in this way, a program as a whole gains a level of correctness that's hard to achieve in a dynamically typed language without them.
Also these reasons aren't exhaustive, but they're top of mind when thinking about this topic.
Not all dynamically typed systems are equal. Just like not all statically typed systems are equal. Python and Javascript are a mess. But languages like Elixir aren't just Java without types.
javascript is like… unusually messy and weird, so maybe that colours most people's perspective. you don't have to worry about type coercion and weird kinds of equality and so on in python and ruby to anywhere near the same extent.
> people used to write server software in compiled languages feel the need for them because any runtime bug means downtime
I keep hearing that but I don't think it's been true in many years? Whether it's Go, Java, C#, Rust... a runtime bug will only fail the request, not the whole server.
FWIW, the main reason I like types isn't for the compile-time guarantees (although they're certainly nice). It's for documenting what are the data types I'm working with rather than having to guess them from the code, it's for knowing that something is a square hole therefore I should put a square piece in.
That’s my top issue with Clojure: I see what the function does, but is it expecting a list, a string, either, or a map? The function may apply correctly, but what was it supposed to do? Java may be boring, but it’s surprise-free.
In Elixir this is less of an issue because of pattern matching and very clear errors showing the actual arguments passes, that are unbeatable for debugging - you look at the log and can “see” the issue.
yeah, your overall point stands. Sometimes you can get a bit mixed up on "wait, does this take a File object or a string with the filename?". I guess my point was that because you program to interfaces this happens a bit less often than one would expect. If it can take a vector it can usually also take a list
Same. Alternatively (or in addition), I sometimes present my preferred idea as being a "bad/naive/stupid option" (or a suggestion from someone who can't be trusted) to see how it stands up to sycophancy to it being bad. As expected the LLM will usually say "yeah it's bad!" and give plausible-sounding reasons for it, but if these reasons are nonsensical it's a good sign that I'm not missing anything
reply