Speech recognition is everywhere these days, yet some languages, such as Shakhizat Nurgaliyev and Askat Kuzdeuov’s native Kazakh, lack sufficiently large public datasets for training keyword spotting models. To make up for this disparity, the duo explored generating synthetic datasets using a neural text-to-speech system called Piper, and then extracting speech commands from the audio with the Vosk Speech Recognition Toolkit.
Beyond simply building a model to recognize keywords from audio samples, Nurgaliyev and Kuzdeuov’s primary goal was to also deploy it onto an embedded target, such as a single-board computer or microcontroller. Ultimately, they went with the Arduino Nicla Voice development board since it contains not just an nRF52832 SoC, a microphone, and an IMU, but an NDP120 from Syntiant as well. This specialized Neural Decision Processor helps to greatly speed up inferencing times thanks to dedicated hardware accelerators while simultaneously reducing power consumption.
With the hardware selected, the team began to train their model with a total of 20.25 hours of generated speech data spanning 28 distinct output classes. After 100 learning epochs, it achieved an accuracy of 95.5% and only consumed about 540KB of memory on the NDP120, thus making it quite efficient.
To read more about Nurgaliyev and Kuzdeuov’s project and how they deployed an embedded ML model that was trained solely on generated speech data, check out their write-up here on Hackster.io.
The post Small-footprint keyword spotting for low-resource languages with the Nicla Voice appeared first on Arduino Blog.