Blog

Creating a Home Assistant speech-to-text engine

I bought a Home Assistant Voice to control my IoT devices with voice commands. Home Assistant offers two out-of-the-box options for speech-to-text: a Home Assistant Cloud subscription, or a local setup.

I didn’t need the extra features from the Cloud option, so I went with the local setup, but it performed extremely poorly. On my limited hardware, each voice command took 6-8 seconds to process. These transcriptions were slow even with a small-int8 faster-whisper model, so now they were both slow and inaccurate.

Thus, I decided to use an online API for the transcription instead, which meant building a custom speech-to-text engine.

Part 1: Receiving speech from Home Assistant

We will use a Wyoming server to receive audio data from Home Assistant and process it.

Setting up the boilerplate

After installing the wyoming Python package, we can begin with a simple event handler. This is done by creating a subclass of the AsyncEventHandler, and overriding the handle_event function. The handler doesn’t do anything with the events yet - it just returns False to ignore them.

from wyoming.event import Event
from wyoming.server import AsyncEventHandler

class MyEventHandler(AsyncEventHandler):
    async def handle_event(self, event: Event) -> bool:
        return False

With the event handler implemented, we create a server to receive incoming events.

import asyncio

from wyoming.server import AsyncTcpServer

async def main():
    server = AsyncTcpServer(host="0.0.0.0", port=10300)
    print(f"Starting Wyoming server with {server.host=} {server.port=}")
    await server.run(MyEventHandler)

if __name__ == "__main__":
    asyncio.run(main())

Handling incoming events

Now we handle the Describe event. This event lets Home Assistant connect to the server and identify it as a valid speech-to-text provider.

from wyoming.event import Event
from wyoming.info import AsrProgram, Attribution, Describe, Info
from wyoming.server import AsyncEventHandler, AsyncTcpServer

...  # rest of the file here

class MyEventHandler(AsyncEventHandler):
    async def handle_event(self, event: Event) -> bool:
        if Describe.is_type(event.type):
            await self._send_info()
            return True
        return False

    async def _send_info(self):
        info = Info()  # TODO: the actual application will require proper values
        await self.write_event(info.event())

Next, we handle other speech-to-text events outlined in the Wyoming documentation: transcribe, audio-start, audio-chunk, and audio-stop.

The final event handler class records the incoming audio chunks, saves them as a WAV file, and uses the OpenAI API to transcribe the file.

Part 2: Calling the OpenAI API

We write the WAV file to a BytesIO object instead of saving it to disk, to simplify file handling by keeping the data in memory. We pass the BytesIO object to the OpenAI SDK like this:

from openai import OpenAI

def transcribe_audio(file: BytesIO) -> str:
    client = OpenAI(api_key="sk-proj-foo_bar")
    return client.audio.transcriptions.create(
        model="gpt-4o-mini-transcribe",
        file=file,
        language="en",
        response_format="text",
    )

However, we get the following error:

openai.BadRequestError: Error code: 400 - {'error': {'message': 'Unsupported file format', 'type': 'invalid_request_error', 'param': 'file', 'code': 'unsupported_value'}}

This error occurs because the OpenAI SDK tries to get the file type from the file name, but since BytesIO doesn't have a file name or extension, it can't figure out the file type (e.g. .wav).

To fix this, we add a dummy file name when sending the file, like this:

from openai import OpenAI

def transcribe_audio(file: BytesIO) -> str:
    client = OpenAI(api_key="sk-proj-foo_bar")
    return client.audio.transcriptions.create(
        model="gpt-4o-mini-transcribe",
        file=("dummy.wav", file),  # <<<
        language="en",
        response_format="text",
    )

Summary

I now have a Wyoming server that transcribes voice commands using the OpenAI API. The costs are fairly low - I’ve made 40 requests in a couple weeks totaling only $0.01.

Visit the add-on repository to set it up on your own Home Assistant instance, or check out the code itself on my GitHub repo!

#homeassistant #iot