Implementing advanced speech-to-text in your SwiftUI app

Implementing advanced speech-to-text in your SwiftUI app

Learn how to integrate real-time voice transcription into your application using SpeechAnalyzer.

Apple has recently introduced speech-to-text functionality across many of its apps, including Notes and Voice Memos, reflecting a broader shift toward voice as a primary input method. In line with this, Apple has released a new API called SpeechAnalyzer, which leverages a faster, more efficient model specifically fine-tuned for processing longer audio recordings and handling speech from distant speakers.

By the end of this tutorial, you will understand how to access an audio buffer using the microphone and then make it available for the new SpeechAnalyzer class to process and convert into text.

Step 1 - Getting the audio from the microphone

In this step, we create an AudioManager class that manages microphone access and real-time audio streaming using AVFoundation. This class handles audio session configuration and microphone permission requests, and provides methods to initiate and terminate capturing audio buffers from the device’s microphone.

import Foundation
import AVFoundation

class AudioManager {

    // 1.
    private let audioEngine = AVAudioEngine()
    private var audioTapInstalled = false

    // 2.
    func setupAudioSession() throws {
        let audioSession = AVAudioSession.sharedInstance()
        try audioSession.setCategory(.record, mode: .measurement, options: .duckOthers)
        try audioSession.setActive(true, options: .notifyOthersOnDeactivation)
    }

    // 3.
    func requestMicrophonePermission() async -> Bool {
        await withCheckedContinuation { continuation in
            AVAudioApplication.requestRecordPermission { granted in
                continuation.resume(returning: granted)
            }
        }
    }

    // 4.
    func startAudioStream(onBuffer: @escaping (AVAudioPCMBuffer) -> Void) throws {
        guard !audioTapInstalled else { return }
        
        audioEngine.inputNode.installTap(
            onBus: 0, 
            bufferSize: 4096, 
            format: audioEngine.inputNode.outputFormat(forBus: 0)
        ) { buffer, _ in
            onBuffer(buffer)
        }
        
        audioEngine.prepare()
        try audioEngine.start()
        audioTapInstalled = true
    }

    // 5.
    func stopAudioStream() {
        guard audioTapInstalled else { return }
        
        audioEngine.stop()
        audioEngine.inputNode.removeTap(onBus: 0)
        audioTapInstalled = false
    }
}
  1. The class initializes an ‎AVAudioEngine instance, which is the core component for audio processing. It also tracks whether an audio tap (used for capturing audio data) is currently installed with the audioTapInstalled property.
  2. The setupAudioSession() method configures the shared audio session for recording in .measurement mode, which optimizes for accurate audio input, such as speech recognition. Additionally, the .duckOthers option lowers the volume of other audio sources during recording.
  3. The ‎requestMicrophonePermission() method asynchronously requests permission from the user to access the microphone.
  4. The ‎startAudioStream(onBuffer:) method installs a tap on the audio engine’s input node, capturing audio buffers in real time. The provided onBuffer closure is called each time a new buffer of audio data is available. After that, we can use the prepare() method to correctly set the audioEngine and then the start() method to start the audio recognition.
  5. The ‎stopAudioStream() method stops the audio engine and removes the tap from the input node, cleaning up resources and stopping audio capture.

Step 2 - Buffer converter

In this step, we’ll introduce a new class named BufferConverter, which will handle converting audio buffers (AVAudioPCMBuffer) from one audio format to another using the AVFoundation framework.

import Foundation
import Speech
import AVFoundation


class BufferConverter {

    // 1.
    enum Error: Swift.Error {
        case failedToCreateConverter
        case failedToCreateConversionBuffer
        case conversionFailed(NSError?)
    }

    // 2.
    private var converter: AVAudioConverter?

    // 3.
    func convertBuffer(_ buffer: AVAudioPCMBuffer, to format: AVAudioFormat) throws -> AVAudioPCMBuffer {
        let inputFormat = buffer.format
        guard inputFormat != format else {
            return buffer
        }
        
        if converter == nil || converter?.outputFormat != format {
            converter = AVAudioConverter(from: inputFormat, to: format)
            converter?.primeMethod = .none
        }
        
        guard let converter = converter else {
            throw Error.failedToCreateConverter
        }
        
        let sampleRateRatio = converter.outputFormat.sampleRate / converter.inputFormat.sampleRate
        let scaledInputFrameLength = Double(buffer.frameLength) * sampleRateRatio
        let frameCapacity = AVAudioFrameCount(scaledInputFrameLength.rounded(.up))
        guard let conversionBuffer = AVAudioPCMBuffer(pcmFormat: converter.outputFormat, frameCapacity: frameCapacity) else {
            throw Error.failedToCreateConversionBuffer
        }
        
        var nsError: NSError?
        var bufferProcessed = false
        
        let status = converter.convert(to: conversionBuffer, error: &nsError) { packetCount, inputStatusPointer in
            defer { bufferProcessed = true }
            inputStatusPointer.pointee = bufferProcessed ? .noDataNow : .haveData
            return bufferProcessed ? nil : buffer
        }
        
        guard status != .error else {
            throw Error.conversionFailed(nsError)
        }
        
        return conversionBuffer
    }
}
  1. The Error enum defines possible errors that might occur during the conversion process, such as failing to create the converter or the output buffer, or a generic conversion failure.
  2. Define a AVAudioConverter property, which will be used to perform the actual audio format conversion.
  3. Define the convertBuffer() method to convert an audio buffer to a different format using AVAudioConverter. It first checks if conversion is necessary by comparing the input and target formats. If they match, it returns early. Otherwise, it prepares or reuses a converter and calculates the appropriate size for the output buffer based on the sample rate difference. Finally, it performs the conversion, feeding the input buffer into the converter and returning the newly formatted buffer. If the process fails, it throws an error.

Step 3 - Transcription Manager

In this step, we’ll define the TranscriptionManager class, which brings together all the components needed for real-time speech transcription using Apple’s Speech framework.

import Foundation
import Speech

class TranscriptionManager {

    // 1.
    private var inputBuilder: AsyncStream<AnalyzerInput>.Continuation?
    private var transcriber: SpeechTranscriber?
    private var analyzer: SpeechAnalyzer?
    private var recognizerTask: Task<(), Error>?
    private var analyzerFormat: AVAudioFormat?
    private var converter = BufferConverter()

    // 2.
    func requestSpeechPermission() async -> Bool {
        let status = await withCheckedContinuation { continuation in
            SFSpeechRecognizer.requestAuthorization { status in
                continuation.resume(returning: status)
            }
        }
        return status == .authorized
    }

    // 3.
    func startTranscription(onResult: @escaping (String, Bool) -> Void) async throws {
        transcriber = SpeechTranscriber(
            locale: Locale.current, 
            transcriptionOptions: [], 
            reportingOptions: [.volatileResults], 
            attributeOptions: []
        )
        analyzer = SpeechAnalyzer(modules: [transcriber!])
        analyzerFormat = await SpeechAnalyzer.bestAvailableAudioFormat(compatibleWith: [transcriber!])
        
        let (inputSequence, inputBuilder) = AsyncStream<AnalyzerInput>.makeStream()
        self.inputBuilder = inputBuilder
        
        recognizerTask = Task {
            for try await result in transcriber!.results {
                let text = String(result.text.characters)
                onResult(text, result.isFinal)
            }
        }
        
        try await analyzer?.start(inputSequence: inputSequence)
    }

    // 4.
    func processAudioBuffer(_ buffer: AVAudioPCMBuffer) throws {
        guard let inputBuilder, let analyzerFormat else { return }
        let converted = try converter.convertBuffer(buffer, to: analyzerFormat)
        inputBuilder.yield(AnalyzerInput(buffer: converted))
    }

    // 5.
    func stopTranscription() async {
        inputBuilder?.finish()
        try? await analyzer?.finalizeAndFinishThroughEndOfInput()
        recognizerTask?.cancel()
        recognizerTask = nil
    }
}
  1. Defines key properties: an async stream input continuation for feeding audio data, aSpeechTranscriber for generating text from audio, a SpeechAnalyzer for processing the input, and a BufferConverter to ensure audio format compatibility.
  2. Define the ‎requestSpeechPermission() method that asynchronously requests user authorization for speech recognition, returning ‎true if permission is granted.
  3. The startTranscription(onResult:) method sets up the transcriber and analyzer, prepares the audio format, and starts listening for transcription results. Each result is passed to the provided callback as soon as it’s available.
  4. The ‎processAudioBuffer(_:) method converts incoming audio buffers to the analyzer’s required format and feeds them into the transcription pipeline.
  5. The ‎stopTranscription() method ends the transcription session by finishing the input stream, finalizing the analyzer, and cancelling the recognition task.

Step 4 - Model for transcription

In this step, we need to define a simple data model, ‎TranscriptionModel, which is used to manage and display the state of a live speech transcription session.


import Foundation

struct TranscriptionModel {
    var finalizedText: String = ""
    var currentText: String = ""
    var isRecording: Bool = false
    
    var displayText: String {
        return finalizedText + currentText
    }
}

Step 5 - Define the ViewModel

In this step, we’ll define the SpeechToTextViewModel, which combines the two classes that we defined before to handle permission requests, starts and stops recording sessions, and updates the transcript in real time as speech is recognized.

import Foundation
import Speech
import AVFoundation

@MainActor
@Observable
class SpeechToTextViewModel {

    // 1.
    private(set) var model = TranscriptionModel()
    private(set) var errorMessage: String? 
    private let audioManager = AudioManager()
    private let transcriptionManager = TranscriptionManager()


    // 2.
    private func requestPermissions() async -> Bool {
        let speechPermission = await transcriptionManager.requestSpeechPermission()
        let micPermission = await audioManager.requestMicrophonePermission()
        return speechPermission && micPermission
    }
    
    // 3.
    func toggleRecording() {
        if model.isRecording {
            Task { await stopRecording() }
        } else {
            Task { await startRecording() }
        }
    }

    // 4.
    func clearTranscript() {
        model.finalizedText = ""
        model.currentText = ""
        errorMessage = nil
    }

    // 5.
    private func startRecording() async {
        guard await requestPermissions() else {
            errorMessage = "Permissions not granted"
            return
        }
        
        do {
            try audioManager.setupAudioSession()
            
            try await transcriptionManager.startTranscription { [weak self] text, isFinal in
                Task { @MainActor in
                    guard let self = self else { return }
                    if isFinal {
                        self.model.finalizedText += text + " "
                        self.model.currentText = ""
                    } else {
                        self.model.currentText = text
                    }
                }
            }
            
            try audioManager.startAudioStream { [weak self] buffer in
                try? self?.transcriptionManager.processAudioBuffer(buffer)
            }
            
            model.isRecording = true
            errorMessage = nil
        } catch {
            errorMessage = error.localizedDescription
        }
    }

    // 6.
    private func stopRecording() async {
        audioManager.stopAudioStream()
        await transcriptionManager.stopTranscription()
        model.isRecording = false
    }

}
  1. Create an instance of the AudioManager class, the TranscriptionManager class, and the TranscriptionModel class. Additionally, create a string to store any potential errors.
  2. Define the requestPermissions() method to ensure both speech recognition and microphone permissions are granted before recording begins.
  3. Define the toggleRecording() to start and stop the recording based on the current state.
  4. ‎Define the clearTranscript() to reset both finalized and current transcription text, and clear any error messages.
  5. Define the startRecording() method to request permissions, set up the audio session, initiate transcription, and stream audio buffers for processing. As speech is recognized, it updates the transcript in real time.
  6. Define the stopRecording() method to stop both the audio stream and the transcription process, and update the recording state.

Step 6 - View

In this step, we’ll build the primary user interface for our speech-to-text app using SwiftUI. The interface will allow users to start and stop audio recording with a single tap, view live transcription results as they speak, see a visual indicator when recording is active, and clear the transcript when needed.

import SwiftUI

struct ContentView: View {
    @State private var viewModel = SpeechToTextViewModel()
    
    var body: some View {
        NavigationView {
            VStack(spacing: 20) {
                
                Button(action: {
                    viewModel.toggleRecording()
                }) {
                    VStack {
                        Image(systemName: viewModel.model.isRecording ? "stop.circle.fill" : "mic.circle.fill")
                            .font(.system(size: 60))
                            .foregroundColor(viewModel.model.isRecording ? .red : .blue)
                        
                        Text(viewModel.model.isRecording ? "Stop Recording" : "Start Recording")
                            .font(.headline)
                            .foregroundColor(viewModel.model.isRecording ? .red : .blue)
                    }
                }
                .padding()
                
                if viewModel.model.isRecording {
                    HStack {
                        Circle()
                            .fill(Color.red)
                            .frame(width: 10, height: 10)
                            .scaleEffect(viewModel.model.isRecording ? 1.0 : 0.5)
                            .animation(.easeInOut(duration: 0.5).repeatForever(), value: viewModel.model.isRecording)
                        
                        Text("Recording...")
                            .font(.caption)
                            .foregroundColor(.secondary)
                    }
                }
                
                ScrollView {
                    VStack(alignment: .leading, spacing: 10) {
                        if !viewModel.model.displayText.isEmpty {
                            Text(viewModel.model.displayText)
                                .font(.body)
                                .padding()
                                .frame(maxWidth: .infinity, alignment: .leading)
                                .background(Color.gray.opacity(0.1))
                                .cornerRadius(10)
                        } else {
                            Text("Tap the microphone to start recording...")
                                .font(.body)
                                .foregroundColor(.secondary)
                                .padding()
                        }
                    }
                }
                .frame(maxWidth: .infinity, maxHeight: .infinity)
                
                if let errorMessage = viewModel.errorMessage {
                    Text(errorMessage)
                        .font(.caption)
                        .foregroundColor(.red)
                        .padding(.horizontal)
                }
                
                Spacer()
            }
            .padding()
            .navigationTitle("Speech to Text")
            .toolbar {
                ToolbarItem(placement: .navigationBarTrailing) {
                    Button("Clear") {
                        viewModel.clearTranscript()
                    }
                    .disabled(viewModel.model.displayText.isEmpty)
                }
            }
        }
    }
}

The view consists of a large button that toggles recording, dynamically updating its icon and label based on whether recording is active. When recording, a pulsating red indicator and "Recording..." text appear to provide visual feedback. Below, a scrollable text area displays the transcribed speech or a prompt to start recording. If an error occurs, it’s shown in red at the bottom.

Step 7 - Privacy Authorization

To ensure that the app can access the microphone and start the transcription process, we need to add some capabilities to our Xcode project. We can do that by going to Signing & Capabilities and adding new capabilities. For this project, we need to add two specific capabilities:

  • Microphone
  • Speech Recognition
0:00
/0:34

Conclusion

In this project, we implemented the new SpeechAnalyzer API in a SwiftUI app, utilizing all the new capabilities included in the transcription model developed by Apple. This is the final result:

0:00
/0:35

You can download the complete project at this link: