
Synthesizing text into speech
Learn how to turn a text input into spoken audio using AVFoundation in SwiftUI.
The AVFoundation
framework provides a series of tools to handle media features in an application. One use case is converting text into spoken audio, known as text-to-speech.
What makes it possible is the combination of two classes:
The AVSpeechUtterance
object contains the text to be spoken and all the settings that define how it should be spoken. An AVSpeechSynthesizer
object then uses it to produce the spoken audio.
Here is a simple example:
import SwiftUI
import AVFoundation
struct ContentView: View {
let utterance = AVSpeechUtterance(string: "Create with Swift")
let synthesizer = AVSpeechSynthesizer()
var body: some View {
VStack {
Button {
utterance.voice = AVSpeechSynthesisVoice(language: "en-GB")
synthesizer.speak(utterance)
} label: {
HStack{
Image(systemName: "microphone.fill")
.imageScale(.large)
.foregroundStyle(.tint)
Text("Synthesize")
}
}
Divider()
}
}
}
The first step is creating an utterance
object that stores the text to be turned into speech. Then, assign to the voice
property of the utterance
object the language the text is supposed to be spoken with, creating an AVSpeechSynthesisVoice
object.
Then call the speak(_:)
method of an AVSpeechSynthesizer
object, passing the utterance created and configured as a parameter.
Configure the utterance
You can configure the utterance object to ensure the text is spoken in a manner that fits your application and user needs.
utterance.rate = 0.8
utterance.pitchMultiplier = 0.8
utterance.postUtteranceDelay = 0.2
utterance.volume = 0.8
rate
: controls the speed of the speech. Values below 1.0 make speech slower, while values above 1.0 make the speech faster.pitchMultiplier
: changes the pitch of the speech. Values less than 1.0 make the voice deeper. Values greater than 1.0 make it higher.postUtteranceDelay
: generates pauses between speeches. This way, it determines a waiting time between more than one speech in seconds.volume
: controls the volume of speech. It can range from silence (0.0) to maximum volume (1.0).
The following code is a simple example of a view where the user types in a TextField
, and the text is converted to speech.
import SwiftUI
import AVFoundation
struct ContentView: View {
@State var text: String
@State var utterance = AVSpeechUtterance()
let synthesizer = AVSpeechSynthesizer()
var body: some View {
VStack {
TextField("Type here", text: $text)
.multilineTextAlignment(.center)
Divider()
Button {
utterance = AVSpeechUtterance(string: text)
utterance.voice = AVSpeechSynthesisVoice(language: "en-GB")
synthesizer.speak(utterance)
utterance.rate = 0.8
utterance.pitchMultiplier = 0.8
utterance.postUtteranceDelay = 0.2
utterance.volume = 0.8
} label: {
HStack{
Image(systemName: "microphone.fill")
.imageScale(.large)
.foregroundStyle(.tint)
Text("Synthesize")
}
}
}
}
}