Synthesizing text into speech

Synthesizing text into speech

Learn how to turn a text input into spoken audio using AVFoundation in SwiftUI.

The AVFoundation framework provides a series of tools to handle media features in an application. One use case is converting text into spoken audio, known as text-to-speech.

What makes it possible is the combination of two classes:

The AVSpeechUtterance object contains the text to be spoken and all the settings that define how it should be spoken. An AVSpeechSynthesizer object then uses it to produce the spoken audio.

Here is a simple example:

import SwiftUI
import AVFoundation

struct ContentView: View {
    
    let utterance = AVSpeechUtterance(string: "Create with Swift")
    let synthesizer = AVSpeechSynthesizer()
    
    var body: some View {
        VStack {
            Button {
                utterance.voice = AVSpeechSynthesisVoice(language: "en-GB")
                synthesizer.speak(utterance)
            } label: {
                HStack{
                    Image(systemName: "microphone.fill")
                        .imageScale(.large)
                        .foregroundStyle(.tint)
                    Text("Synthesize")
                }
            }
            Divider()
        }
        
    }
}

The first step is creating an utterance object that stores the text to be turned into speech. Then, assign to the voice property of the utterance object the language the text is supposed to be spoken with, creating an AVSpeechSynthesisVoice object.

Then call the speak(_:) method of an AVSpeechSynthesizer object, passing the utterance created and configured as a parameter.

Configure the utterance

You can configure the utterance object to ensure the text is spoken in a manner that fits your application and user needs.

utterance.rate = 0.8
utterance.pitchMultiplier = 0.8
utterance.postUtteranceDelay = 0.2
utterance.volume = 0.8
  • rate: controls the speed of the speech. Values ​​below 1.0 make speech slower, while values ​​above 1.0 make the speech faster.
  • pitchMultiplier: changes the pitch of the speech. Values ​​less than 1.0 make the voice deeper. Values ​​greater than 1.0 make it higher.
  • postUtteranceDelay: generates pauses between speeches. This way, it determines a waiting time between more than one speech in seconds.
  • volume: controls the volume of speech. It can range from silence (0.0) to maximum volume (1.0).

The following code is a simple example of a view where the user types in a TextField, and the text is converted to speech.

import SwiftUI
import AVFoundation

struct ContentView: View {

    @State var text: String
    
    @State var utterance = AVSpeechUtterance()
    let synthesizer = AVSpeechSynthesizer()
    
    var body: some View {
        VStack {
            TextField("Type here", text: $text)
                .multilineTextAlignment(.center)
                
            Divider()
            
            Button {
                utterance = AVSpeechUtterance(string: text)
                utterance.voice = AVSpeechSynthesisVoice(language: "en-GB")
                synthesizer.speak(utterance)
    
                utterance.rate = 0.8
                utterance.pitchMultiplier = 0.8
                utterance.postUtteranceDelay = 0.2
                utterance.volume = 0.8
            } label: {
                HStack{
                    Image(systemName: "microphone.fill")
                        .imageScale(.large)
                        .foregroundStyle(.tint)
                    Text("Synthesize")
                }
            }
        }
    }
}