
Implementing advanced speech-to-text in your SwiftUI app
Learn how to integrate real-time voice transcription into your application using SpeechAnalyzer.
Apple has recently introduced speech-to-text functionality across many of its apps, including Notes and Voice Memos, reflecting a broader shift toward voice as a primary input method. In line with this, Apple has released a new API called SpeechAnalyzer
, which leverages a faster, more efficient model specifically fine-tuned for processing longer audio recordings and handling speech from distant speakers.
By the end of this tutorial, you will understand how to access an audio buffer using the microphone and then make it available for the new SpeechAnalyzer class to process and convert into text.
Step 1 - Getting the audio from the microphone
In this step, we create an AudioManager
class that manages microphone access and real-time audio streaming using AVFoundation
. This class handles audio session configuration and microphone permission requests, and provides methods to initiate and terminate capturing audio buffers from the device’s microphone.
import Foundation
import AVFoundation
class AudioManager {
// 1.
private let audioEngine = AVAudioEngine()
private var audioTapInstalled = false
// 2.
func setupAudioSession() throws {
let audioSession = AVAudioSession.sharedInstance()
try audioSession.setCategory(.record, mode: .measurement, options: .duckOthers)
try audioSession.setActive(true, options: .notifyOthersOnDeactivation)
}
// 3.
func requestMicrophonePermission() async -> Bool {
await withCheckedContinuation { continuation in
AVAudioApplication.requestRecordPermission { granted in
continuation.resume(returning: granted)
}
}
}
// 4.
func startAudioStream(onBuffer: @escaping (AVAudioPCMBuffer) -> Void) throws {
guard !audioTapInstalled else { return }
audioEngine.inputNode.installTap(
onBus: 0,
bufferSize: 4096,
format: audioEngine.inputNode.outputFormat(forBus: 0)
) { buffer, _ in
onBuffer(buffer)
}
audioEngine.prepare()
try audioEngine.start()
audioTapInstalled = true
}
// 5.
func stopAudioStream() {
guard audioTapInstalled else { return }
audioEngine.stop()
audioEngine.inputNode.removeTap(onBus: 0)
audioTapInstalled = false
}
}
- The class initializes an
AVAudioEngine
instance, which is the core component for audio processing. It also tracks whether an audio tap (used for capturing audio data) is currently installed with theaudioTapInstalled
property. - The
setupAudioSession()
method configures the shared audio session for recording in.measurement
mode, which optimizes for accurate audio input, such as speech recognition. Additionally, the.duckOthers
option lowers the volume of other audio sources during recording. - The
requestMicrophonePermission()
method asynchronously requests permission from the user to access the microphone. - The
startAudioStream(onBuffer:)
method installs a tap on the audio engine’s input node, capturing audio buffers in real time. The providedonBuffer
closure is called each time a new buffer of audio data is available. After that, we can use theprepare()
method to correctly set theaudioEngine
and then thestart()
method to start the audio recognition. - The
stopAudioStream()
method stops the audio engine and removes the tap from the input node, cleaning up resources and stopping audio capture.
Step 2 - Buffer converter
In this step, we’ll introduce a new class named BufferConverter
, which will handle converting audio buffers (AVAudioPCMBuffer
) from one audio format to another using the AVFoundation
framework.
import Foundation
import Speech
import AVFoundation
class BufferConverter {
// 1.
enum Error: Swift.Error {
case failedToCreateConverter
case failedToCreateConversionBuffer
case conversionFailed(NSError?)
}
// 2.
private var converter: AVAudioConverter?
// 3.
func convertBuffer(_ buffer: AVAudioPCMBuffer, to format: AVAudioFormat) throws -> AVAudioPCMBuffer {
let inputFormat = buffer.format
guard inputFormat != format else {
return buffer
}
if converter == nil || converter?.outputFormat != format {
converter = AVAudioConverter(from: inputFormat, to: format)
converter?.primeMethod = .none
}
guard let converter = converter else {
throw Error.failedToCreateConverter
}
let sampleRateRatio = converter.outputFormat.sampleRate / converter.inputFormat.sampleRate
let scaledInputFrameLength = Double(buffer.frameLength) * sampleRateRatio
let frameCapacity = AVAudioFrameCount(scaledInputFrameLength.rounded(.up))
guard let conversionBuffer = AVAudioPCMBuffer(pcmFormat: converter.outputFormat, frameCapacity: frameCapacity) else {
throw Error.failedToCreateConversionBuffer
}
var nsError: NSError?
var bufferProcessed = false
let status = converter.convert(to: conversionBuffer, error: &nsError) { packetCount, inputStatusPointer in
defer { bufferProcessed = true }
inputStatusPointer.pointee = bufferProcessed ? .noDataNow : .haveData
return bufferProcessed ? nil : buffer
}
guard status != .error else {
throw Error.conversionFailed(nsError)
}
return conversionBuffer
}
}
- The
Error
enum defines possible errors that might occur during the conversion process, such as failing to create the converter or the output buffer, or a generic conversion failure. - Define a
AVAudioConverter
property, which will be used to perform the actual audio format conversion. - Define the
convertBuffer()
method to convert an audio buffer to a different format usingAVAudioConverter
. It first checks if conversion is necessary by comparing the input and target formats. If they match, it returns early. Otherwise, it prepares or reuses a converter and calculates the appropriate size for the output buffer based on the sample rate difference. Finally, it performs the conversion, feeding the input buffer into the converter and returning the newly formatted buffer. If the process fails, it throws an error.
Step 3 - Transcription Manager
In this step, we’ll define the TranscriptionManager
class, which brings together all the components needed for real-time speech transcription using Apple’s Speech framework.
import Foundation
import Speech
class TranscriptionManager {
// 1.
private var inputBuilder: AsyncStream<AnalyzerInput>.Continuation?
private var transcriber: SpeechTranscriber?
private var analyzer: SpeechAnalyzer?
private var recognizerTask: Task<(), Error>?
private var analyzerFormat: AVAudioFormat?
private var converter = BufferConverter()
// 2.
func requestSpeechPermission() async -> Bool {
let status = await withCheckedContinuation { continuation in
SFSpeechRecognizer.requestAuthorization { status in
continuation.resume(returning: status)
}
}
return status == .authorized
}
// 3.
func startTranscription(onResult: @escaping (String, Bool) -> Void) async throws {
transcriber = SpeechTranscriber(
locale: Locale.current,
transcriptionOptions: [],
reportingOptions: [.volatileResults],
attributeOptions: []
)
analyzer = SpeechAnalyzer(modules: [transcriber!])
analyzerFormat = await SpeechAnalyzer.bestAvailableAudioFormat(compatibleWith: [transcriber!])
let (inputSequence, inputBuilder) = AsyncStream<AnalyzerInput>.makeStream()
self.inputBuilder = inputBuilder
recognizerTask = Task {
for try await result in transcriber!.results {
let text = String(result.text.characters)
onResult(text, result.isFinal)
}
}
try await analyzer?.start(inputSequence: inputSequence)
}
// 4.
func processAudioBuffer(_ buffer: AVAudioPCMBuffer) throws {
guard let inputBuilder, let analyzerFormat else { return }
let converted = try converter.convertBuffer(buffer, to: analyzerFormat)
inputBuilder.yield(AnalyzerInput(buffer: converted))
}
// 5.
func stopTranscription() async {
inputBuilder?.finish()
try? await analyzer?.finalizeAndFinishThroughEndOfInput()
recognizerTask?.cancel()
recognizerTask = nil
}
}
- Defines key properties: an async stream input continuation for feeding audio data, a
SpeechTranscriber
for generating text from audio, aSpeechAnalyzer
for processing the input, and aBufferConverter
to ensure audio format compatibility. - Define the
requestSpeechPermission()
method that asynchronously requests user authorization for speech recognition, returning true
if permission is granted. - The
startTranscription(onResult:)
method sets up the transcriber and analyzer, prepares the audio format, and starts listening for transcription results. Each result is passed to the provided callback as soon as it’s available. - The
processAudioBuffer(_:)
method converts incoming audio buffers to the analyzer’s required format and feeds them into the transcription pipeline. - The
stopTranscription()
method ends the transcription session by finishing the input stream, finalizing the analyzer, and cancelling the recognition task.
Step 4 - Model for transcription
In this step, we need to define a simple data model, TranscriptionModel
, which is used to manage and display the state of a live speech transcription session.
import Foundation
struct TranscriptionModel {
var finalizedText: String = ""
var currentText: String = ""
var isRecording: Bool = false
var displayText: String {
return finalizedText + currentText
}
}
Step 5 - Define the ViewModel
In this step, we’ll define the SpeechToTextViewModel
, which combines the two classes that we defined before to handle permission requests, starts and stops recording sessions, and updates the transcript in real time as speech is recognized.
import Foundation
import Speech
import AVFoundation
@MainActor
@Observable
class SpeechToTextViewModel {
// 1.
private(set) var model = TranscriptionModel()
private(set) var errorMessage: String?
private let audioManager = AudioManager()
private let transcriptionManager = TranscriptionManager()
// 2.
private func requestPermissions() async -> Bool {
let speechPermission = await transcriptionManager.requestSpeechPermission()
let micPermission = await audioManager.requestMicrophonePermission()
return speechPermission && micPermission
}
// 3.
func toggleRecording() {
if model.isRecording {
Task { await stopRecording() }
} else {
Task { await startRecording() }
}
}
// 4.
func clearTranscript() {
model.finalizedText = ""
model.currentText = ""
errorMessage = nil
}
// 5.
private func startRecording() async {
guard await requestPermissions() else {
errorMessage = "Permissions not granted"
return
}
do {
try audioManager.setupAudioSession()
try await transcriptionManager.startTranscription { [weak self] text, isFinal in
Task { @MainActor in
guard let self = self else { return }
if isFinal {
self.model.finalizedText += text + " "
self.model.currentText = ""
} else {
self.model.currentText = text
}
}
}
try audioManager.startAudioStream { [weak self] buffer in
try? self?.transcriptionManager.processAudioBuffer(buffer)
}
model.isRecording = true
errorMessage = nil
} catch {
errorMessage = error.localizedDescription
}
}
// 6.
private func stopRecording() async {
audioManager.stopAudioStream()
await transcriptionManager.stopTranscription()
model.isRecording = false
}
}
- Create an instance of the
AudioManager
class, theTranscriptionManager
class, and theTranscriptionModel
class. Additionally, create a string to store any potential errors. - Define the
requestPermissions()
method to ensure both speech recognition and microphone permissions are granted before recording begins. - Define the
toggleRecording()
to start and stop the recording based on the current state. - Define the
clearTranscript()
to reset both finalized and current transcription text, and clear any error messages. - Define the
startRecording()
method to request permissions, set up the audio session, initiate transcription, and stream audio buffers for processing. As speech is recognized, it updates the transcript in real time. - Define the
stopRecording()
method to stop both the audio stream and the transcription process, and update the recording state.
Step 6 - View
In this step, we’ll build the primary user interface for our speech-to-text app using SwiftUI. The interface will allow users to start and stop audio recording with a single tap, view live transcription results as they speak, see a visual indicator when recording is active, and clear the transcript when needed.
import SwiftUI
struct ContentView: View {
@State private var viewModel = SpeechToTextViewModel()
var body: some View {
NavigationView {
VStack(spacing: 20) {
Button(action: {
viewModel.toggleRecording()
}) {
VStack {
Image(systemName: viewModel.model.isRecording ? "stop.circle.fill" : "mic.circle.fill")
.font(.system(size: 60))
.foregroundColor(viewModel.model.isRecording ? .red : .blue)
Text(viewModel.model.isRecording ? "Stop Recording" : "Start Recording")
.font(.headline)
.foregroundColor(viewModel.model.isRecording ? .red : .blue)
}
}
.padding()
if viewModel.model.isRecording {
HStack {
Circle()
.fill(Color.red)
.frame(width: 10, height: 10)
.scaleEffect(viewModel.model.isRecording ? 1.0 : 0.5)
.animation(.easeInOut(duration: 0.5).repeatForever(), value: viewModel.model.isRecording)
Text("Recording...")
.font(.caption)
.foregroundColor(.secondary)
}
}
ScrollView {
VStack(alignment: .leading, spacing: 10) {
if !viewModel.model.displayText.isEmpty {
Text(viewModel.model.displayText)
.font(.body)
.padding()
.frame(maxWidth: .infinity, alignment: .leading)
.background(Color.gray.opacity(0.1))
.cornerRadius(10)
} else {
Text("Tap the microphone to start recording...")
.font(.body)
.foregroundColor(.secondary)
.padding()
}
}
}
.frame(maxWidth: .infinity, maxHeight: .infinity)
if let errorMessage = viewModel.errorMessage {
Text(errorMessage)
.font(.caption)
.foregroundColor(.red)
.padding(.horizontal)
}
Spacer()
}
.padding()
.navigationTitle("Speech to Text")
.toolbar {
ToolbarItem(placement: .navigationBarTrailing) {
Button("Clear") {
viewModel.clearTranscript()
}
.disabled(viewModel.model.displayText.isEmpty)
}
}
}
}
}
The view consists of a large button that toggles recording, dynamically updating its icon and label based on whether recording is active. When recording, a pulsating red indicator and "Recording..." text appear to provide visual feedback. Below, a scrollable text area displays the transcribed speech or a prompt to start recording. If an error occurs, it’s shown in red at the bottom.
Step 7 - Privacy Authorization
To ensure that the app can access the microphone and start the transcription process, we need to add some capabilities to our Xcode project. We can do that by going to Signing & Capabilities and adding new capabilities. For this project, we need to add two specific capabilities:
- Microphone
- Speech Recognition
Conclusion
In this project, we implemented the new SpeechAnalyzer API in a SwiftUI app, utilizing all the new capabilities included in the transcription model developed by Apple. This is the final result:
You can download the complete project at this link: