Start

I will show you in this post how you can build your own voice assistant on a windows system, which works fully offline on your windows or linux operating system. This assistant can help you with the information of previously downloaded ai models such as the GPT4All Falcon model.

So let’s start building!

Install GPT4All

First of all, we need to download gpt4all. You can find a download for Windows, MacOS or Linux through the following link.

https://gpt4all.io/index.html

Download and install the software. As I am using Windows for this, I will just doubleclick the executable file (.exe) and follow the instructions for installation.

After installation completed, we can opt-in for data sharing of our chats and/or the software for improvements. Make your considerations about your usage and afterwards decide if you want to opt-in or not. It’s your choice.

I made the decision to opt-in for this example and the next step is now to choose our model. In this tutorial I will use the GPT4All Falcon model, as it is trained pretty good and also free to use, without the need to register for an api.

So the next step is to download this model to our GPT4All Desktop application. (At the time of writing, this model has a requirement of around 4GB of disk space and 8 GB of RAM)

Install Visual Studio (incl. Python3 & Libraries)

To be able to build the voice assistant and combine the text to speech with GPT4All, we need to build a short python3 script while using some great already available libraries.

As I am working on Windows today, I will install Visual Studio together with python3. This is just my favourite setup, but you can use whatever you like.

https://visualstudio.microsoft.com/de/downloads/

Now that python3 and visual studio is installed on our computer, we are opening a new python application project. After we opened the project in visual sudio, we are opening the python environment and install the gpt4all library through pip.

In the same way, we will install the openai-whisper, SpeechRecognition, PyAudio, soundfile, ffmpeg, pyttsx3 and playsound library.

If are using a different IDE and want to install those libraries directly through python3 terminal and pip you can issue the following commands.

python3 -m pip install gpt4all
python3 -m pip install openai-whisper
python3 -m pip install SpeechRecognition
python3 -m pip install playsound
python3 -m pip install PyAudio
python3 -m pip install soundfile
python3 -m pip install ffmpeg
python3 -m pip install pyttsx3

A lot of libraries have now been installed, which means that now is the time to write the code to use them. If you use visual studio, like I do, you already should have a python file where you can write your code. Should you use a different IDE, you should create a main.py file in the home directory of your project.

The Codeing

In our main.py codefile (or the one Visual Studio created by default), we will first import all the libraries, we just downloaded to our project.

from os import system
from unittest import result
import speech_recognition as sr
from playsound import playsound
from gpt4all import GPT4All
import sys
import whisper
import warnings
import time
import os
import pyttsx3

As the next step we are defining our word we want to use, to wake up our assistent, so that it is listening to our question and load our GPT4All model which we downloaded through the GPT4All Desktop application to our computer.

wake_word = 'michael'
model = GPT4All("C:/Users/<YourUsername>/AppData/Local/nomic.ai/GPT4All/gpt4all-falcon-newbpe-q4_0.gguf", allow_download=False)

Make sure that the filepath to your model at your computer is correct, by looking it up in your windows explorer.

After this we will initialize our voice recognizer, our microphone, as well as the models we need to transcribe what we say.

This is necessary, so that we are able to sent our vocal question in text form towards our GPT4All AI model.

r = sr.Recognizer()
source=sr.Microphone()
tiny_model_path = os.path.expanduser("C:/Users/<YourUsername>/.cache/whisper/tiny.pt")
base_model_path = os.path.expanduser("C:/Users/<YourUsername>/.cache/whisper/base.pt")
tiny_model = whisper.load_model(tiny_model_path)
base_model = whisper.load_model(base_model_path)
listening_for_wake_word = True
warnings.filterwarnings("ignore", category=UserWarning, module='whisper.transcribe', lineno=115)

The path we specified for our whisper models do not exist, yet. We will need to create these files later, but don’t worry, I will show you how this works.

I added two further lines of code in the snipped above. One is the “listening_for_wake_word” variable, which is used as a global variable to define if we are listening to a question on which we would like to react or if we are listening to the defined wake word.

Also you might wonder about the warnings.filterwarnings function I added. When we finished our code, this will be necessary to have a clean output of our program, as the whisper transcribe function is throwing in errors if you use an FP16 instead a FP32. This warning can be ignored, as it does not have any impact to the function of our assistant. Therefore we filter this warning out, as we know that it is no issue.

As a next step, we will need to define a lot of functions in our code to make the assistant work.

Speak function

def speak(text):
  engine = pyttsx3.init()
  for voice in engine.getProperty('voices'):
    if "English (Great Britain)" in str(voice.name):
      engine.setProperty('voice', voice.id)
  engine.say(text)
  engine.runAndWait()

This function will be used to let our computer talk to us. If we have any english written text, we can sent it to this function to let the computer read it for us.

For this to be functional, you need to make sure, that your machine has the “English (Great Britain)” TTS language pack installed. If you are using a different language, you will need to change this.

Listen for wake word function

def listen_for_wake_word(audio):
    global listening_for_wake_word
    with open("wake_detect.wav", "wb") as f:
        f.write(audio.get_wav_data())
    result = tiny_model.transcribe("wake_detect.wav")
    text_input = result['text']
    if wake_word in text_input.lower().strip():
        print("Awaiting your instruction.")
        speak('Awaiting your instruction.')
        listening_for_wake_word = False

This function is used to detect our wake word “michael”, which we defined in the ‘wake_word’ variable, previously.

First, we define our ‘listening_for_wake_word’ variable as a global one, so that python understands, that we want to change its value globally in our program. Second, we open the wake_detect.wav file and write the recorded audio data, to it. Third, we use our openai-whisper tiny model to transcribe the sentance we spoke and wrote into the .wav file into text form, so that we are able to react to what was said. From the result we are extracting the text into the ‘text_input’ variable and afterwards, check if it contains our ‘wake_word’ (“michael”).

If the wake_word is detected in the spoken sentance, we will give an output in the console, as well as verbally, so that we know, that the ‘listening_for_wake_word’ variable is now set to “False” and we are able to ask our question on the next step

Act on instructions function

def actOnInstructions(userInput, model):
    if "shut down yourself" in userInput.lower().strip():
        quit()
    else:
        output = model.generate(userInput, max_tokens=200)
        print('Michael: ', output)        
        speak(output)

This function is highly customizeable to your needs. I added two samples on what we can do with our assistant. If the input we tell our assistant is to “shut down yourself”, we will exit our assistant and close the program. In any other situation we will give the transcribed input of the user towards our GPT4All Falcon model and print the output from the model to the console, as well as letting it read from the system for us, so that we do not need to watch the console.

Prompt gpt function

def prompt_gpt(audio):
    global listening_for_wake_word
    try:
        with open("prompt.wav", "wb") as f:
            f.write(audio.get_wav_data())
        result = base_model.transcribe('prompt.wav')
        prompt_text = result['text']
        if len(prompt_text.strip()) == 0:
            print("Did not understand anything. Please ask again.")
            speak("Sorry, I did not understand what you said!")
            listening_for_wake_word = True
        else:
            print('User: ' + prompt_text)
            actOnInstructions(prompt_text, model)
            print('\nSay', wake_word, 'to wake me up. \n')
            listening_for_wake_word = True
    except Exception as e:
        print("Prompt error: ", e)

This is more or less a wrapper function we use to handle the possible occurance of exceptions or if the transcribe function of our whisper base model can’t produce an output. If everything can be processed, this function is giving over the text of what the user said, as well as the GPT4All Falcon ai model to the actOnInstructions function, where we take the action on what happens in the end.

callback function

def callback(recognizer, audio):
    global listening_for_wake_word
    if listening_for_wake_word:
        listen_for_wake_word(audio)
    else:
        prompt_gpt(audio)

This function is one of our main pieces, eventhough it is very short. This is the callback function, when the receiver received words said by the user. In this function we decide if we want to check against the wake_word or if we want to act on what was said. Therefore the function is using the ‘listening_for_wake_word’ global variable. Depending on this variable we decide if we go into the previous described listen_for_wake_word function or the prompt_gpt function.

start listening function

def start_listening():
    with source as s:
        r.adjust_for_ambient_noise(s, duration=2)
    print('\nSay', wake_word, 'to wake me up. \n')
    stop_listening = r.listen_in_background(source, callback)
    while True:
        time.sleep(1)

This is our main loop function, which helps us to let our assistent run and listen forever. The adjust_for_ambient_noise function is run for 2 seconds, before we start listening to the user to reduce false catches of voice due to microphone or background noises.

After that the only code thing missing is our main function to start listening to the user, when the program is executed.

This will be done by the following code.

if __name__ == '__main__':
    start_listening()

Now the python3 script is ready, but before we execute, we need to download our models, make some adjustments on the whisper library to run everything offline, as well as adjusting our windows system, to have all the executables available we need to process the audio files.

Download whisper files and adjust the library to run offline

As you know, we gave some pathes in our script for the tiny.pt and base.pt models.

Let’s create the folder and download our models.

Go to C:/Users/<YourUsername>
Create a folder called ".cache"
Go into the .cache folder
Create a folder called "whisper"
Go into the whisper folder
Execute the following powershell in this folder:
Invoke-WebRequest -Uri "https://openaipublic.azureedge.net/main/whisper/models/25a8566e1d0c1e2231d1c762132cd20e0f96a85d16145c3a00adf5d1ac670ead/base.en.pt" -OutFile "base.pt"
Invoke-WebRequest -Uri "https://openaipublic.azureedge.net/main/whisper/models/d3dd57d32accea0b295c96e26691aa14d8822fac7d9d27d5dc00b4ca2826dd03/tiny.en.pt" -OutFile "tiny.pt"
Invoke-WebRequest -Uri "https://openaipublic.blob.core.windows.net/gpt-2/encodings/main/vocab.bpe" -OutFile "vocab.bpe"
Invoke-WebRequest -Uri "https://openaipublic.blob.core.windows.net/gpt-2/encodings/main/encoder.json" -OutFile "encoder.json"

If you want to download medium or large models instead of tiny and base, you can check out the download links under following Github repo: https://github.com/openai/whisper/blob/main/whisper/__init__.py

Now search for the file location of openai_public.py and edit the file locations to the recently downloaded vocab.bpe and encoder.json files.

On my windows system, I was able to find this file in following location.

C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python39_64\Lib\site-packages\tiktoken_ext\openai_public.py

Open the file and edit the gpt2 function.

def gpt2():
    mergeable_ranks = data_gym_to_mergeable_bpe_ranks(
        vocab_bpe_file=os.path.expanduser("C:\\Users\\<YourUsername>\\.cache\\whisper\\vocab.bpe"),
        encoder_json_file=os.path.expanduser("C:\\Users\\<YourUsername>\\.cache\\whisper\\encoder.json"),
    )
    return {
        "name": "gpt2",
        "explicit_n_vocab": 50257,
        "pat_str": r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""",
        "mergeable_ranks": mergeable_ranks,
        "special_tokens": {ENDOFTEXT: 50256},
    }

If you like to have more information about the available models and languages, please visit the whisper github readme page. (https://github.com/openai/whisper?tab=readme-ov-file)

Install ffmpeg and add it to windows path

To be able to transcribe the wav files, we will need to have the ffmpeg executable available to our windows system.

For this you can download the zip file from ffmpeg.org - There are different download options.

I specifically downloaded the executable from gyan.dev which worked pretty good. (https://www.gyan.dev/ffmpeg/builds/ffmpeg-git-full.7z)

Unzip the file, rename it and copy it in our “C:\Users<YourUsername>.cache” directory. Now add the path to the ffmpeg executable to your systems environment path variable. Usually you should find the ffmpeg executable in the extracted ffmpeg folder under the bin directory.

Now when you open a terminal (command line), you should be able to write ffmpeg and get a help page from this tool. (If your cmd was open before the environment variable was set, close and reopen the cmd)

Now you should be able to run the voice assistant with the GPT4All Falcon model for the answers, you receive.