Maybe with very fast models you could request animation frames, e.g., frame 1) right foot at 12, left foot at 6; frame 2) right foot at 3, left foot at 9, etc.?
And instead of reporting tps, you would - of course! - report pfps (pelican frames per second).
As someone who has spent quite a lot of time on inference, I would a add a small note:
Deployment looks very different for MoE than dense style models so I would say that it is more nuanced than "inference memory reqs remain the same". Memory can be very different for MoE style models.
I think it depends on what a memory system includes. Those that automatically inject relevant information into context are in my experience better than just md docs because agents often ignore, forget to read, or don't read md files in full.
It isn't ideal, but I am starting to write code (with AI tab completions) while waiting for LLMs. The tab completions are sometimes overeager and I wish I had more control over them, but at least I am not staring at "Thinking" all day. Having said that sometimes you have to monitor AI because, e.g., AGY CLI, often goes off the rails completely, including writing code outside of the "sandbox."
Thank you for explaining. Do you think there are still opportunities for stack optimizations to meaningfully speed up inference on single consumer-grade GPUs?
I'm sure there are, and I really hope we can work on consumer-grade GPUs at some point.
It should be possible to apply the same methodology (digging deep into the hardware details to understand all its little characteristics, and rethinking the inference stack around that).
Or do you want it to speak to you too? I think this would have to be TTS on your phone. You can have ChatGPT speak to you but I don't see that feature in Codex.
Sure I speak to ChatGPT all the time and I've used the feature you've linked but it can't do the things I described. It won't be like, "hey let me go look into that" and then come back in 3 minutes with an answer. It's essentially a dictation feature.
reply