Creating a Home Assistant speech-to-text engine
I bought a Home Assistant Voice to control my IoT devices with voice commands. Home Assistant offers two out-of-the-box options for speech-to-text: a Home Assistant Cloud subscription, or a local setup.
I didn’t need the extra features from the Cloud option, so I went with the local setup, but it performed extremely poorly. On my limited hardware, each voice command took 6-8 seconds to process. These transcriptions were slow even with a small-int8 faster-whisper model, so now they were both slow and inaccurate.
Thus, I decided to use an online API for the transcription instead, which meant building a custom speech-to-text engine.
Part 1: Receiving speech from Home Assistant
We will use a Wyoming server to receive audio data from Home Assistant and process it.
Setting up the boilerplate
After installing the wyoming Python package, we can begin with a simple event handler. This is done by creating a subclass of the AsyncEventHandler, and overriding the handle_event function. The handler doesn’t do anything with the events yet - it just returns False to ignore them.
from wyoming.event import Event
from wyoming.server import AsyncEventHandler
class MyEventHandler(AsyncEventHandler):
async def handle_event(self, event: Event) -> bool:
return False
With the event handler implemented, we create a server to receive incoming events.
import asyncio
from wyoming.server import AsyncTcpServer
async def main():
server = AsyncTcpServer(host="0.0.0.0", port=10300)
print(f"Starting Wyoming server with {server.host=} {server.port=}")
await server.run(MyEventHandler)
if __name__ == "__main__":
asyncio.run(main())
Handling incoming events
Now we handle the Describe event. This event lets Home Assistant connect to the server and identify it as a valid speech-to-text provider.
from wyoming.event import Event
from wyoming.info import AsrProgram, Attribution, Describe, Info
from wyoming.server import AsyncEventHandler, AsyncTcpServer
... # rest of the file here
class MyEventHandler(AsyncEventHandler):
async def handle_event(self, event: Event) -> bool:
if Describe.is_type(event.type):
await self._send_info()
return True
return False
async def _send_info(self):
info = Info() # TODO: the actual application will require proper values
await self.write_event(info.event())
Next, we handle other speech-to-text events outlined in the Wyoming documentation: transcribe, audio-start, audio-chunk, and audio-stop.
The final event handler class records the incoming audio chunks, saves them as a WAV file, and uses the OpenAI API to transcribe the file.
Part 2: Calling the OpenAI API
We write the WAV file to a BytesIO object instead of saving it to disk, to simplify file handling by keeping the data in memory. We pass the BytesIO object to the OpenAI SDK like this:
from openai import OpenAI
def transcribe_audio(file: BytesIO) -> str:
client = OpenAI(api_key="sk-proj-foo_bar")
return client.audio.transcriptions.create(
model="gpt-4o-mini-transcribe",
file=file,
language="en",
response_format="text",
)
However, we get the following error:
openai.BadRequestError: Error code: 400 - {'error': {'message': 'Unsupported file format', 'type': 'invalid_request_error', 'param': 'file', 'code': 'unsupported_value'}}
This error occurs because the OpenAI SDK tries to get the file type from the file name, but since BytesIO doesn't have a file name or extension, it can't figure out the file type (e.g. .wav).
To fix this, we add a dummy file name when sending the file, like this:
from openai import OpenAI
def transcribe_audio(file: BytesIO) -> str:
client = OpenAI(api_key="sk-proj-foo_bar")
return client.audio.transcriptions.create(
model="gpt-4o-mini-transcribe",
file=("dummy.wav", file), # <<<
language="en",
response_format="text",
)
Summary
I now have a Wyoming server that transcribes voice commands using the OpenAI API. The costs are fairly low - I’ve made 40 requests in a couple weeks totaling only $0.01.
Visit the add-on repository to set it up on your own Home Assistant instance, or check out the code itself on my GitHub repo!