I guess I'll play devil's advocate here, don't shoot me.
Over the course of my career I've had to deal with multiple hacks, DDOSes, and even situations working with the FBI. It's a mess, and extremely frustrating and unfair to those of us who are just trying to do a good job and make a living. Those of you who are throwing stones at Microsoft's coding, how confident are you that your code is safe from this new AI age?
Obviously MS handled this poorly, even after reading this article it's not clear how MS handles bug bounties. But that doesn’t mean this “researcher” deserves a pass.
Releasing 0-days, especially working exploit code for unpatched vulnerabilities, is extremely unethical. It has real potential to cause a lot of harm to regular engineers, and users who had nothing to do with the dispute.
I don't think it's their fault for not making code without exploits. I do think they should try and close them in a timely fashion when the exploit is pointed out though - the longer they wait the more chance bad actors find it in addition to the security researchers. Ultimately they need to cooperate here for users to be safe.
> I do think they should try and close them in a timely fashion when the exploit is pointed out though - the longer they wait the more chance bad actors find it in addition to the security researchers.
You are assuming it is not already being actively exploited and there will be a timely response to fix it, which is why we have these ticking clocks.
They should also be fully transparent and not silently patch, and only issue a CVE weeks later after being called out like they did with RedSun, from this same researcher.
That Microsoft releases vulnerable software isn't the issue (that's a known quality at this point), it's their lack of transparency and refusal to hold themselves accountable.
Putting it out from "only a small group of people/companies exploit it" to the public is the way how you get it fixed. In this case it seems that was the only way that was left after Microsoft refuses cooperation. What counts are the results: This are fixed now.
the new bottleneck for development at work is code reviews. devs are creating whole features that would take months in only a couple weeks, but code reviewing that is a slow, painful process
This is why I'm not that excited about vibe coding. The bottleneck has always been understanding what the heck is going on.
In my view you should 1) use AI as a tool to help you learn and 2) write boilerplate you could have easily written yourself. Getting it to think for you is counterproductive (at least until it replaces us entirely).
The low latency is more of a pain point than a good thing, the way they have it implemented. Trying to have a casual conversation with it, as humans we naturally pause, and GPT will take this as you are "done" and start blabbing away.
I also suffer from finding the appropriate word I want as I've gotten older and slower, and this fast-voice-gpt just ends up frustrating me more than helping. I have to sit there and think out the whole sentence in my head before I say anything -- not very natural.
I think these are 2 different layers of "latency". The latency in the article is referring to the transport of the audio stream itself while the latency in your scenario is about how quickly to start responding inside the audio stream.
Suppose you have 100ms audio latency and no wait time. Then, natural pause will trigger response immediately but you won't notice it has started until after ~200ms (round-trip time). Twice as annoying.
I think he’s saying they are doing an insane level of complexity to shave ~100ms off response times in a scenario where that isn’t important and might even be a negative
When GP mentioned reducing conversational latency as a negative that made sense (and should probably be done IMO), it just wasn't the same category of latency the article talks about reducing. I.e. increasing "network latency" just makes the conversation feel more and more out of sync, it doesn't change the rate at which the AI will interrupt ("turn latency") because the latter is based on the duration of the pause in the audio stream, not the duration it took to deliver that audio stream.
If you meant there is a case where reducing the network latency at the same delivery reliability for a given audio stream is actually a negative then I'd love to hear more about it as I'm a network guy always in search of an excuse for latency :D.
> If you meant there is a case where reducing the network latency […] is actually a negative then I'd love to hear more about it
That is exactly what parent did:
> they are doing an insane level of complexity to shave off ~100 ms
The downside is everything they had to do to achieve it, and maintaining that capability going forward, when the product can tolerate much more. It is the definition of premature optimization.
It just maybe isn’t at a level where it is relevant in your argument/decision space in IT.
By you want to be able to interject “hold on…” and have it immediately stop talking, when it goes off the rails.
And GP is correctly pointing out that the only negative here (silence waiting latency maybe being too low) is tunable separately from the network latency number.
I want to be able to click the "Stop" button on my earphones remote. I want to be able to interject "woah" or "stop!" or "wait!" or that it would detect that I've inhaled a breath, or that my eyes glazed over. I want the LLM to figure out that every speed setting for its voice output is in "auctioneer" territory rather than "lecturing university professor with tenure and a pension" pacing.
But we won't get any of that, because the prime directive of LLMs is to burn tokens like there's no tomorrow. Burn tokens on a naïve answer without asking clarifying questions. Burn tokens on writing, debugging, and running a Python script or accessing and parsing 10 websites without asking for consent. Burn tokens on half-baked images with misspellings and 31 fingers. Burn tokens arguing "how many 'r's in strawberry?". Burn tokens asking a followup question at the end of every single answer, begging the user to re-engage and burn more tokens.
There is a little red "Stop" control when text output is being produced, at least, but does "Stop" halt everything and throw away the context? Re-prompt from the beginning?
The "maximize tokens burnt" prime directive is not to be found in any system prompt or user documentation. It is seemingly a common feature of the training for any consumer model.
Currently, if I'm using voice for an LLM, I use the voice dictation in the keyboard feature, because then the response is in text. There is no way to prevent "responding in kind" if I query the thing with audio. Or in Swahili.
you actually don’t want it to immediately stop because people say things like “hm” “yeh” during machine output. Maybe you say “no” to someone next to you and don’t want to interrupt output.
To confidently interrupt I would want to assert that the user has been speaking for > N time. You could do other things like parse a streaming transcription for keywords but generally it feels like bad UX to me to cut output the second input is detected. Letting the user talk for 1-2s gives a much stronger signal and it isn’t too weird for someone to keep talking for 1.5s after you start.
I’ve also experienced this and it’s really annoying. There is this pressure to keep talking if I’m not done with my thought that feels pretty unnatural at least for me. If I’m searching for the right word, I want the opportunity to find it.
I think the solution is to handle pauses more intelligently rather than having a higher latency protocol. With low latency you can interrupt and the bot can immediately stop rambling.
100%. I have to hold the floor by filling the space with "ummmmmmmm.... uhhhh...." which inevitably distracts me from my point altogether. Poor user experience.
Seems like there's a big risk of having that habit leak into human conversation. A lot of people try really hard to train themselves not to add those fillers.
I find this is a problem even with human conversations. Some people just aren’t very good at telegraphing when they’ve finished ‘their turn’ talking. Or worse yet, aren’t willing to take turns in the first place.
I know it's not the perfect solution for you, but I use a voice recorder and send the LLM the transcript. And my god is it working great.
Usually I just explain the things I want it to do. The longest was 30 minutes rambling of explaining the methods section of a paper in non chronological order. It worked unbelievable good for me.
If you're setting this up yourself instead of using a lab's built-it speech functionality, you can run a small LLM in parallel, on a local model or small model like Haiku, that acts as a gate for either doing TTS on the response or not. Its only job is to decide if the transcription it receives is of someone being done talking or if that person is likely to still be mid-thought or mid-sentence.
With higher latency this would be even more of an issue. When you pause and start talking again, the model wouldn't catch that until it has already interrupted you.
The actual implementation is at fault. I had some luck with instructing the model to only respond with "Mhm" until I've explicitly finished my thought and asked it a question. Makes this much less of an issue.
But I've decided that their voice mode is completely unusable for a different reason: the model feels incredibly dumb to interact with, keeps repeating and re-phrasing what I said, ends every single answer with a "hook" making the entire interaction idiotically robotic, completely ignores instructions when you ask it to stop that, and - most importantly - doesn't feel helpful for brainstorming. I was completely surprised how bad it is in practice; this should be their killer app but the model feels incredibly badly tuned.
I am excited for VAD to go away. PersonaPlex totally seems like the future.
However things like 'Call center helpline' turn based actually seems better! You don't want to be interrupted when giving information back and forth (I think?)
There's a really interesting project in Japanese natural language processing called J-Moshi that had a novel approach and in my opinion good results.
They tried to make it mimic the way Japanese is full of really quick acknowledgement sounds and it seems to allow it to handle those pauses and interruptions really well.
I must admit it's a bit weird when LLMs laugh, I don't really know how I feel about that but it seems to laugh at the right times. Very tangential, but cockatoos have been known to mimic the right time to laugh presumably based on tonal cues that a joke was just made (I have experienced this first hand with rescue birds who li e amongst humans)
Reducing the network latency helps with this exactly. OpenAI can make better timed decisions when to begin responding so it'll feel less like an interruption. I've also seen some research on full duplex voice models that handle interruption more like an organic conversation and low latency will help there as well
Hard problem. I find myself adding in filler to stop the thing from jabbering.
I also think it spends most of its iq on sounding good rather than thinking about the problem. “Yeah absolutely I can see why you’d like to…” etc. This is likely because it’s on a timer and maybe voice is more expensive to process? Text responses spend more time on the task.
This is more of a VAD/turn detection issue. It's gotten a lot better over the last few years, but it's a hard problem. The extra ~100ms of latency makes a huge difference otherwise, especially when you have use cases that require tool calling that can easily add 500ms+ of latency.
If you have tool calling complex enough that it necessitates a higher reasoning level, and you would otherwise have reasoning set to "none", this can easily come out to 500ms.
Agreed. It’s stressful. I think they need to have an option to adopt a suffix, so they don’t start babbling until there is an “over” followed by a pause like in the old army walkie talkie days.
Strongly agree, some of us like to choose our words more carefully when interacting with an LLM.
I've tried to convey this to OpenAI through various available channels (dev forums, app feedback, etc.).
Grok solves this by having an optional push-to-talk mode, but this is not hands-free and thus more cumbersome than just having a user-configurable variable like seconds_delay_before_sending_voice_input.
yeh exactly, you cannot get a strong signal that a user is done speaking without some amount of “wait for 500ms of silence”. You could kick of processing and abandon if they continued talking, but that seems over optimized.
1-2s replies feel natural and like you pointed out pausing for 2-3s mid sentence is super normal.
> waiting on the govt to do something is a path of failure
To keep the goverment accountable is a duty of every citizen and the only way to have a functioning society. The failure is to let the goverment be arbitrary and cater to the powerful instead of following the rule of law and applying it equally at all levels.
Over the course of my career I've had to deal with multiple hacks, DDOSes, and even situations working with the FBI. It's a mess, and extremely frustrating and unfair to those of us who are just trying to do a good job and make a living. Those of you who are throwing stones at Microsoft's coding, how confident are you that your code is safe from this new AI age?
Obviously MS handled this poorly, even after reading this article it's not clear how MS handles bug bounties. But that doesn’t mean this “researcher” deserves a pass.
Releasing 0-days, especially working exploit code for unpatched vulnerabilities, is extremely unethical. It has real potential to cause a lot of harm to regular engineers, and users who had nothing to do with the dispute.
reply