NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
Show HN: I open-sourced my AI toy company that runs on ESP32 and OpenAI realtime (github.com)
Sean-Der 3 hours ago [-]
This is wonderful, really great job on this! For me physical devices is when it really starts to feel magical. My pre-schooler never engaged with Speech-to-Speech examples I showed her on a screen. However, when I showed her a reindeer toy[1] on my desk that tells joke that is when it became real. It is the same joy/wonder I felt playing Myst for the first time.

----

If anyone is trying to build physical devices with Realtime API I would love to help. I work at OpenAI on Realtime API and worked on [0] (was upstreamed) and I really believe in this space. I want to see this all built with Open/Interoperable standards so we don't have vendor lock-in and developers can build the best thing possible :)

[0] https://github.com/openai/openai-realtime-embedded

[1] https://youtu.be/14leJ1fg4Pw?t=804

4 minutes ago [-]
drakenot 3 hours ago [-]
Something that really kills the 'effect' of most of the Voice > AI demos that I see is the cold start / latency.

The OpenAI "Voice Mode" is closer, but when we can have near instantaneous and natural back and forth voice mode, that will be a big in terms of it feeling magical. Today, it is say something, awkwardly wait N seconds then listen to the reply and sometimes awkwardly interrupt it.

Even if the models were no smarter than they are today, if we could crack that "conversational" piece and performance piece, it would be a big difference in my opinion.

akadeb 1 hours ago [-]
Yeah the way I am handling this is turn detection which feels unnatural. I like how Livekit handles turn detection with a small model[0][1] [0]https://www.youtube.com/watch?v=EYDrSSEP0h0 [1]https://docs.livekit.io/agents/build/turns/turn-detector/

``` turn_detection: { type: "server_vad", threshold: 0.4, prefix_padding_ms: 400, silence_duration_ms: 1000, }, ```

Sean-Der 2 hours ago [-]
I think it will always feel unnatural as long as 'AI Speech' is turn based. Right now developers used Voice Activity Detection to detect when the user has stopped talking.

What would be REALLY cool is if we had something that would interrupt you during conversation like talking with a real human.

conductr 2 hours ago [-]
I can see how interruptions would prove even more unnatural and annoying pretty quick. There's a lot of nuance in knowing how to interrupt properly and often, people that interrupt only do so quickly, then yield, allow person to finish then resume - very situational and tons of nuance. Otherwise, with current level of sophistication, you'd just have the AI talking over you the entire time, not allowing you to complete your thoughts/questions/commands/etc and people would quickly be more frustrated and just turn it off.
hoppp 2 hours ago [-]
Its great.lovely. but on the long run these toys rely on subscription payment?

Both the supabase Api and OpenAI billing is per api call.

So the lovely talking toys can die if the company stops being profitable.

I would love to see a version with decent hardware that runs a local model, that could have a long lifespan and work offline.

xp84 2 hours ago [-]
> lovely talking toys can die if the company stops being profitable.

This is a good point to me as a parent -- in a world where this becomes a precious toy, it would be a serious risk of emotional pain if the child experienced this scenario like the death of a pet or friend.

> version with decent hardware that runs a local model

I feel like something small and efficient enough to meet that (today) would be dumb as a post. Like Siri-level dumb.

Personally, I'd prefer a toy which was tethered to a home device. Without a cloud (and thus commercial) dependency, the toy wouldn't be 'smart' outside of Wi-fi range, but I'd design it so that it got 'sleepy' when away from Wi-fi, able to be "woken up" and, in that state, to respond to a few phrases with canned, Siri-like answers. Perhaps new content could be made up for it daily and downloaded to local storage while at home, so that it could still "tell me a story" offline etc.

supermatt 2 hours ago [-]
This looks like so much fun! I have recently gotten into working with electronics, so it seems like a nice little project to undertake.

I noticed that it is dependent on openAIs realtime API, so it got me wondering what open alternatives there are as I would love a more realtime alexa-like device in my home that doesnt contact the cloud. I have only played with software, but the existing solutions have never felt realtime to me.

I could only find <https://github.com/fixie-ai/ultravox> that would seem to really work as realtime. It seems to be some model that wires up llama and whisper somehow, rather than treating them as separate steps which is common with other projects.

What other options are available for this kind of real-time behaviour?

Sean-Der 1 hours ago [-]
My plan is that Espressif’s WebRTC code[0] will hook up to pipe at [1] that gets you the freedom to do whatever you want.

The design of OpenAI + WebRTC was to lean on WebRTC as much as possible to make it easier for users.

[0] https://github.com/espressif/esp-webrtc-solution

[1] https://github.com/pipecat-ai/pipecat

supermatt 51 minutes ago [-]
Fantastic! This will save a ton of work
_neil 2 hours ago [-]
Not on-device but for local network I’ve been looking at Speaches[0]. Haven’t tried it yet, but I have been running kokoru-web[1] and the quality and speed is really good.

[0] https://speaches.ai/ [1] https://huggingface.co/spaces/Xenova/kokoro-web

3D30497420 2 hours ago [-]
Maybe inspiration from how Home Assistant can do local speech-to-text and vice versa? https://www.home-assistant.io/voice_control/voice_remote_loc...

Pretty sure you'd need to host this on something more robust than an ESP32 though.

supermatt 1 hours ago [-]
Yeah, I was looking at home assistant as well, but it doesnt feel real-time, likely due to it having the transcription stage separate from the inference.
dayvid 50 minutes ago [-]
Really interesting. Also more powerful if integrated with animatronic movement. Reminds me of Furby. Doesn't even have to be full AI, just augmented with slightly smarter and more flexible capabilities
behnamoh 2 hours ago [-]
am I the only one who finds the unnecessarily positive vibes of OpenAI realtime voices unrealistic, too much, and borderline creepy?
mickael-kerjean 2 hours ago [-]
Yep and having it in a child toy is way beyond the border of creepy
bethekidyouwant 1 hours ago [-]
Yeah, I find Miss Rachel and most child educators to be a bit creepy but I’m not a toddler
ianbicking 2 hours ago [-]
What's been your experience with the Realtime API? I've been doing LLM with voice, but haven't really given it a try – the price is so high, and it feels like it's much harder to control. Specifically that you just get one system prompt and then the model takes over entirely. (Though looking at the API, I see you can inject text and do some other things to play around with the session.)
4 hours ago [-]
justanotheratom 2 hours ago [-]
This is quite cool. Two questions:

- why do you need nextjs frontend for what looks like a headless use case? - how much would be the OpenAI bill if there is 15 minutes of usage per day?

irq-1 2 hours ago [-]
> This equates to approximately $0.06 per minute of audio input and $0.24 per minute of audio output.

https://openai.com/index/introducing-the-realtime-api/

About the nextjs site, I was thinking maybe its difficult to have supabase hold long connections, or route the response? I'm curious too.

akadeb 1 hours ago [-]
The long connections are ultimately handled by Deno Edge so the site isn't used there. The NextJS frontend (which also could be an iOS/Android app) helps provide an interface to select character, create AI characters, set ESP32 volume, and view conversation history.
akadeb 2 hours ago [-]
thank you! The nextjs frontend is to set things like device volume, selecting which character you are interacting with, viewing conversation history etc. I just tried it and for a 15 minute chat, it's roughly 20c. Roughly 570 input tokens
JKCalhoun 2 hours ago [-]
And I am wondering, why use an ESP32 if you don't need the WiFi? (And, please, no WiFi in a toy!)
akadeb 2 hours ago [-]
Currently we connect to a Wifi network to reach the Deno edge server. Some popular toys doing it: Yoto, Toniebox
vunderba 3 hours ago [-]
I remember when LLMs started getting mass traction and the first thing everyone wanted to build was AG Talking Bear + ChatGPT.

https://en.wikipedia.org/wiki/AG_Bear

With regard to this project, using an ESP32 makes a lot of sense, I used an Espressif ESP32-S3 Box to build a smart speaker along with the Willow inference server and it worked very well. The ESP speech recognition framework helps with wake word / far field audio processing.

hakaneskici 3 hours ago [-]
Amazing, thank you for sharing. I'm interested in learning about your experience while building this :)

What kind of interesting challenges have you run into, and how have your work influenced the OpenAI's realtime API?

PS: Your github readme is quite well crafted, nowadays hard to come across.

reolbox 2 hours ago [-]
This is an AI reply.
2 hours ago [-]
hakaneskici 2 hours ago [-]
What made you think that?
johnisgood 2 hours ago [-]
The README seems like what GPT would spit out, with all the emojis, diagrams, etc.

Not the first time I ran into it, but I did not bother commenting.

I can recognize it from far away. Thankfully I am not the only one.

hakaneskici 2 hours ago [-]
I misunderstood the parent comment as if it was saying my post was AI ;)

I think the readme is still well crafted, AI couldn't do this without the author.

johnisgood 2 hours ago [-]
A combination of LLM and author. That is not to say it is bad or negative, to be honest, so yeah you are right.

If he meant your reply, I do not see any reasons as to why. :D

tantalor 2 hours ago [-]
I'm surprised by the overwhelming positive vibes in the comments here.

Maybe I'm alone? To me, this comes across as extremely creepy, the exact opposite of what we should desire from AI in products aimed at children.

adregan 1 hours ago [-]
Totally get the creepy part, but my criticism of devices like this is that they seem to be made by people with limited exposure to the creative power of children.

Children don’t need this; they are so much more creative than an AI (and the adults that trained the AI), and their creativity is fueled by boredom.

dayvid 51 minutes ago [-]
I mean when I was a kid I had action figures and played out scenarios. Would be pretty nuts if you could make your own TV shows with AIs assisting the play. Or set up your own battles, etc. Especially if it had more animatronic entry points
supermatt 1 hours ago [-]
I commented that I like the project, in that it is a project that helps you to create a realtime assistant - i would love to replace alexa/siri/whatever with something actually useful.

That said, I totally agree that I wouldn't want this in a kids toy. The whole idea is super creepy in that respect, with so much scope for abuse.

bethekidyouwant 1 hours ago [-]
Why is the idea of a child talking to a LLM creepy? Do you think a child is gonna figure out how to jailbreak the “keep it keep kid, friendly” prompt, and start talking about I don’t even know what … kids don’t know about adult things. That’s just not how kids be.
spencerflem 28 minutes ago [-]
I genuinely can't fathom how it wouldn't be creepy.

Bots are for doing tasks. I don't want to socialize with them and find the idea of kids being socialized by bots supremely weird. At least the AI girlfriend people are (probably unwell) adults.

Sean-Der 1 hours ago [-]
I hope these toys could be a joy/comfort for kids that don’t have a parent that cares.

I poured hours into games/programming because it was a happy place away from school etc… These toys could be the same.

This technology is neutral, but I see so much potential for projects that do good.

akadeb 1 hours ago [-]
For parents we added a `Story mode` option (similar to Yoto toy / Toniebox). The idea is: the AI crafts a story and invites the child to craft the story together in a more engaging way. The story prompt keeps the story focused and in scope.
behnamoh 2 hours ago [-]
Exactly my thoughts when I first saw the comments!
empath75 3 hours ago [-]
When someone figures this out, it's going to be a multi billion dollar company, but the safety concerns for actually putting something like this into the hands of children are unbelievable.
mithr 2 hours ago [-]
This. The idea is super cool in theory! But given how these sort of things work today, having a toy that can have an independent conversation with a kid and that, despite the best intentions of the prompt writer, isn't guaranteed to stay within its "sandbox", is terrifying enough to probably not be worth the risk.

IMO this is only exacerbated by how little children (who are the presumably the target audience for stuffed animals that talk) often don't follow "normal" patterns of conversation or topics, so it feels like it'd be hard to accurately simulate/test ways in which unexpected & undesirable responses could come out.

conductr 2 hours ago [-]
I'm trying to use my imagination, but what exactly is the fear? Perhaps the AI will explain where baby's come from in graphic detail before the parent is ready to have that conversation or something similar? Or, for us in US, maybe it tells your kid they should wear a bullet proof vest to pre-K instead of bringing a stuffy for naptime?

Essentially, telling kids the truth before they're ready and without typical parental censorship? Or is there some other fear, like the AI will get compromised by a pedo and he'll talk your kid into who knows what? Or similar for "fill in state actor" using mind control on your kid (which, honestly, I feel like is normalized even for adults; eg. Fox News, etc., again US-centric)

xp84 1 hours ago [-]
> Perhaps the AI will explain where baby's come from in graphic detail before the parent is ready to have that conversation or something similar?

I mean, that's not a silly fear. But perhaps you don't have any children? "Typical parental censorship" doesn't mean prudish pearl-clutching.

I have an autistic child who already struggles to be appropriate with things like personal space and boundaries -- giving him an early "birds and bees talk" could at minimum result in him doing and saying things that could cause severe trauma to his peers. And while he uses less self-control than a typical kid, even "completely normal" kids shouldn't be robbed of their innocence and forced to confront every adult subject until they're mature enough to handle it. There's a reason why content ratings exist.

Explaining difficult subjects to children, such as the Holocaust, sexual assault, etc. is very difficult to do in a way that doesn't leave them scarred, fearful, or worse, end up warping their own moral development so that they identify with the bad actors.

hoppp 2 hours ago [-]
Babies often have ipads now. I think they should make an offline toy with decent hardware inside. That would be somethin.
georgemcbay 3 hours ago [-]
Reminds me of Conan O'Brien's old WikiBear skits

https://youtu.be/0SfSx9ts46A

mcdow 3 hours ago [-]
Dude this is super cool! What made you decide to open source it?

I had a similar idea that I never followed through with(even down to using an ESP).

Basically you could make a Harry Potter talking painting with basically your device + an e-ink display that displays some 3D modeled character.

For others, here’s a direct link to a demo video:

https://m.youtube.com/watch?v=o1eIAwVll5I

magixx 2 hours ago [-]
I also thought about this but wanted to look into an ESP32 CAM to get vision working. For better or worse I didn't pursue the idea as I thought in the end repurposing a cell phone would be better overall.

I do wonder if the cellphone/app argument is why we didn't see that many hardware LLM API wrappers up until now. The rabbit R1 was basically just that.

I've seen more products in this space recently such as Ropet[1], LOOI[2], and others but for now it's going to be costly for companies to sell such a product at a fixed cost as I think a subscription model would be a hard sell [3] for consumers.

[1] https://www.kickstarter.com/projects/1067657324/ropet-your-n... [2] https://looirobot.com/products/looi-robot?variant=4909200762... [3] https://tech.yahoo.com/ai/articles/tragic-robot-shutdown-sho...

Sean-Der 3 hours ago [-]
I get a `Request has expired` could you upload somewhere else?
mcdow 3 hours ago [-]
My bad! Updated the link.
wormlord 1 hours ago [-]
What could go wrong?
ForHackernews 3 hours ago [-]
This is a cool demo but I would not let my child play with anything that talks to a cloud AI like this. Furby fever dreams made real.
3 hours ago [-]
deepcurryshit 3 hours ago [-]
[flagged]
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 18:16:45 GMT+0000 (Coordinated Universal Time) with Vercel.