UX Design

The re-emergence of sound interfaces with Siri and Artificial Intelligence

Learn how Siri and artificial intelligence are pushing audio to the forefront of technology’s user experience.

Francesco Dell'Aglio

May 22, 2025 • 12 min read

Apple announced plans to enhance Siri using advanced AI and machine learning technologies, aiming to create more intuitive and context-aware voice interactions, which would make voice interactions a peculiar part of our daily interaction with personal and mobile devices. As Siri's abilities grow, voice is emerging not just as a convenient interface but as a fundamental sensory experience in most of our personal devices. This shift is reaffirming, centuries later, sound as a primary mode of interaction, especially when it comes to emerging technologies. Furthermore, it holds significant implications for how we interact with our digital and physical environments.

The evolving role of Siri from everyday devices to Mixed Reality

Siri’s integration with AI and ML leverages advancements in natural language processing (NLP) and voice recognition, allowing the system to interpret increasingly complex, context-sensitive commands. This evolution is evident not only in advanced and innovative mixed reality (MR) devices like Vision Pro but also in standard mobile devices like the iPhone, where Siri is meant to deliver a personalized experience.

In fact, with NLP models trained to interpret human language nuances, Siri can adjust to accents, emotions, and specific phrasings, resulting in a voice assistant that feels almost human—engaging with users in ways that are perceived as friendly, contextual, assistive, and even anticipatory.

Integrated Apple Intelligence features enhance Siri’s ability to perform context-sensitive, personalized, and predictive tasks. Here are some key examples:

Context Awareness and Follow-Up Questions

Siri can understand the context of a conversation, allowing it to ask clarifying or follow-up questions. Here is an example:

User: "Remind me to call Mom tomorrow."
Siri: "What time should I remind you?"

Proactive Suggestions

Siri uses on-device intelligence to predict what users might need based on user’s habits and routines. Examples are:

Suggesting frequently used apps at specific times or locations.
Notifying to leave for an appointment based on traffic conditions.
Surfacing frequently accessed files or documents when opening an app.

Personalization

Siri adapts to the user’s usage patterns, tailoring suggestions to the user’s preferences. Examples are:

Offering shortcuts for favorite tasks, like starting a workout or sending a regular message.
Adjusting music or podcast recommendations based on listening history.

Siri Shortcuts and Automation

Users can create custom Siri Shortcuts, enabling voice-triggered automation for complex tasks. For example, saying, "Hey Siri, I’m going home," could trigger a shortcut that:

Sends ETA to a family member.
Starts navigation in Apple Maps.
Adjusts smart home thermostat to a comfortable temperature.

A Future with Apple Intelligence

Another use case, announced in 2024 but which is not yet implemented, is Intelligent Search. Siri would leverage Apple Intelligence to deliver precise search results across the device. Examples of its use are finding specific emails by saying, "Show me emails from John about the project" or locating photos by saying, "Show me pictures from my trip to Italy”. These are use cases that, once Siri becomes fully integrated with the system and aware of the content and contexts on-device, would elevate how people interact by voice with their devices.

On iPhones, Siri’s capabilities would extend to “on-screen awareness”. This means that Siri understands what users are talking about by accessing the content currently displayed on the device’s screen, allowing them to manage tasks, retrieve information, or adjust settings simply by speaking. This offers a powerful, hands-free way to access various apps on the home screen, from setting reminders to sending messages or controlling media playback. As AI-driven personalization continues to improve, Siri will likely be able to carry out increasingly sophisticated tasks across different apps, making daily interactions smoother and more efficient.

To illustrate it, let’s use the task manager app Things 3. It helps to organize to-dos, projects, and deadlines efficiently. It integrates perfectly with Siri for natural language task creation and management.

Image from https://culturedcode.com/things

Here are some ways that Things 3 could take advantage of on-screen awareness:

Task creation from anywhere: When browsing a webpage or reading an email, the user can ask Siri, "Remind me to follow up on this”, from which Siri can detect the context and create a task in Things 3, adding the page or email as metadata of the task.
Contextual task management: While checking a message from a colleague, the user can say, "Hey Siri, add a task to reply to this message in Things”, and Siri would attach the relevant content to the task automatically.
Hands-free updates: Use Siri to update existing tasks by saying, "Hey Siri, mark my 'Submit report' task as done in Things."
Proactive suggestions: Siri might suggest opening Things 3 or creating a task based on the user's behavior, such as frequent use of the app at certain times of the day.
Custom Shortcuts: Create Siri Shortcuts like "Start my workday" to automatically open Things 3 and show the daily task list.

Why do apps like Things 3 fit Siri’s Strengths?

On-Screen awareness: Siri knows what user is interacting with (e.g., a webpage or an email) and allows to create tasks tied to that context.
Hands-Free convenience: Interaction with tasks using only voice commands.
Personalized workflow: Siri learns habits and can recommend actions like checking off completed tasks or reviewing overdue ones.

Spatial Computing opens a new dimension for interactions

The role of Siri in enhancing app interaction on iPhones is evolving to act as a spatial navigator in MR, illustrating the powerful potential of voice technology in upgrading usability, accessibility, and immersion across various devices and environments. By combining AI-driven personalization with spatial awareness, Siri exemplifies the next step in making digital interactions more natural, efficient, and integrated into our daily lives.

Nonetheless, in the realm of MR and spatial computing, where Vision Pro and similar devices operate, Siri transcends its traditional role as a voice assistant, becoming a central mode of navigation and interaction. By overlaying digital content in physical spaces, MR demands intuitive, hands-free methods of interaction. Siri's voice capabilities allow users to perform complex interactions within digital spaces simply by speaking naturally. These capabilities leverage natural language processing (NLP) to understand commands in a conversational tone, making it intuitive to control apps, manage tasks, and navigate devices.

This synergy between voice, AI, and immersive environments underscores Siri's role as a fully integrated personal assistant, enabling both practical utility and immersive experiences that are accessible and engaging. Through these advancements, Siri is not only paving the way for a future where digital interactions are fluid, personalized, and seamlessly interwoven with our everyday lives but also defining the need for a brand-new approach to UI and UX design.

Spatial Sound: A key to immersion

The advent of spatial audio in iOS devices marked a significant leap in how we experience sound, particularly through wireless headphones like AirPods Pro and AirPods Max. Apple introduced spatial audio to deliver a more immersive sound experience, using dynamic head-tracking to simulate real-world audio dynamics in music, movies, and video calls. By adjusting the audio output to match the listener’s head movement, spatial audio allows sound to stay anchored to its source in 3D space, enhancing the sense of direction, depth, and distance.

This technology relies on gyroscopes and accelerometers within wireless headphones to track head movements in real-time, creating a stationary soundscape that feels like it exists in the environment rather than merely in the headphones. This approach enables immersive listening experiences, where sound is positioned around the listener as if it were coming from specific locations, similar to how we hear in real life. For example, in a concert recording, spatial audio can recreate the sense of being surrounded by musicians, with each instrument appearing to emanate from a fixed spot on the virtual stage.

Apple Fitness+ and Spatial Workouts

Apple Fitness+ integrates Siri commands for a more immersive workout experience with spatial audio-guided workouts that feel as if trainers are in the room with the user. Users can use Siri to select workouts, control playback, and get personalized suggestions based on previous activities. It provides an immersive experience through spatial audio on AirPods Pro or Max.

Image from https://www.apple.com/apple-fitness-plus/

Siri’s voice integration allows users to start workouts, get tips, or switch exercises hands-free, which is useful for staying focused during intense workouts. The combination of Siri and spatial audio showcases how immersive audio can enhance workout experiences, making it feel as though the trainer is positioned around the user. It highlights how even non-visual MR elements can make experiences more engaging.

Vision Pro users can experience workouts that are spatially anchored, so if you move around, the trainer’s position remains constant in the room. Siri enables hands-free control, allowing users to start a workout, pause, switch exercises, or get feedback without needing to touch any physical device. This setup not only makes workouts more engaging and motivating but also shows how spatial audio and Siri can make virtual fitness feel more natural and immersive. This is particularly useful for fitness routines requiring hands-free interaction.

In mixed reality (MR) environments, such as those experienced with Vision Pro, spatial audio takes on an even greater role. The sound design in MR devices is crafted not only for music or film but to integrate fully into the immersive, interactive experiences that mixed reality offers.

Audio cues are strategically designed to simulate real-world dynamics, creating a 3D auditory environment that matches the visual MR content. When using a device like Vision Pro, users can experience sound that "moves" and "shifts" with them, enhancing the sense of realism as they interact with virtual objects in their surroundings.

By combining Siri’s voice recognition with spatial audio, MR devices can create experiences where users not only control devices with voice but receive audio feedback that feels deeply embedded in their immediate surroundings. This immersive soundscape elevates Siri’s role from a mere voice assistant to a responsive, spatially aware entity. Siri’s responses are no longer isolated voices emerging from a static speaker; they feel like part of the user's shared physical space. Imagine asking Siri for the weather and hearing the response as if it’s coming from the sky, or inquiring about nearby restaurants and hearing recommendations from the direction of each location. With spatial audio, Siri can deliver responses that seem to emerge from the environment, aligning auditory and visual cues to create a heightened sense of presence.

The seamless interaction of Siri, spatial audio, and head-tracking technologies also expands the possibilities for MR accessibility. For example, individuals with visual impairments can navigate virtual spaces using audio cues alone, with Siri’s responses serving as both guidance and orientation within a spatial context. This integration of AI-driven voice and spatial sound is not only pushing the boundaries of immersion but is also making advanced digital interactions more natural and inclusive across varied environments.

The foundation laid by spatial audio in wireless headphones has set the stage for a more advanced, multi-sensory approach to technology. As MR devices continue to evolve, the role of spatial sound will only deepen, offering users experiences that blur the line between physical and digital realities and pushing voice technology like Siri to become even more coherently woven into our world. Through this evolution, sound, more than sight, is emerging as a central interface, connecting us with digital environments in ways that feel more human and intuitive.

The socio-anthropological perspective of sound as a reaffirmed sensory priority

This approach to sound as a primary interface in technology ties into deeper socio-anthropological roots. Historically, sound has been a powerful medium of connection, trust, and presence—traits that modern technology often lacks due to its visual-heavy interfaces. Humans are naturally wired to perceive space and emotion through sound, which has been essential for communication, navigation, and community bonding.

The prioritization of audio in MR could signal a rebalancing in how we relate to digital worlds. For instance, in a world where smartphones, visors, and MR devices dominate, we risk over-relying on visual senses, potentially leading to visual fatigue or even disconnection from our immediate physical environment. Sound-based interfaces help counterbalance this by re-engaging our auditory senses in ways that feel natural and human-centric.

Moreover, sound’s role in MR could further democratize technology. For individuals with visual impairments, hearing difficulties, or mobility challenges, sound-oriented designs can provide an inclusive pathway to the benefits of spatial computing. The simplicity and accessibility of speaking to a device in natural language could create opportunities for people across diverse linguistic and physical abilities to engage with and benefit from MR technologies.

Technical Elements Behind Sound and Spatial Computing

Several technical components make this possible. Spatial audio is achieved through techniques such as head tracking, which detects a user’s head position and direction to create a stable sound field. When paired with precise directional speakers or spatial audio algorithms, MR devices can craft highly realistic audio experiences. Apple’s Vision Pro, for example, uses an array of sensors, cameras, and microphones to capture spatial data in real-time, allowing the device to place sounds and visual elements accurately within a 3D environment. The advancements in NLP and voice synthesis are equally important. Siri’s ML-based voice recognition can filter out background noise, recognize the context of commands, and adapt to the user’s unique voice profile. This requires complex language models that continuously learn and update, creating a feedback loop of improvements over time.

FaceTime is enhanced on Vision Pro to provide spatial audio and mixed reality features, allowing users to feel as though they are in a shared space with other participants. FaceTime on Vision Pro uses spatial audio to simulate where each person is positioned in a virtual room, making conversations feel more natural. Vision Pro’s 3D avatars also enhance interaction by showing participants' facial expressions and gestures. With Siri, users can join or leave calls, adjust volume, and control other aspects of the call without needing to touch any device. This creates an incredibly lifelike, immersive communication experience. The integration of Siri adds convenience for those managing multiple aspects of the conversation or environment, making virtual meetings feel less constrained by traditional device controls.

Sound as the foundation of immersive technologies’ future

With Siri and other AI voice assistants becoming smarter and more intuitive, sound’s role in technology will only grow. We are at a point where sound is no longer an accessory but a core element of how we navigate and understand digital spaces. In a future of spatial computing and augmented realities, sound could serve as a primary channel of information, creating more natural and less intrusive interactions.

Video from WWDC 2023 - Meet the new Vision Pro

From a broader perspective, this shift reaffirms humanity’s intrinsic connection to sound. By engaging with MR through Siri and spatial audio, technology moves a step closer to blending perfectly and coherently with our environment. This fusion of sound, voice, and AI not only enhances functionality but also aligns digital interactions with fundamental aspects of human perception and communication. Sound is no longer just a medium in our interactions—it’s quickly becoming the foundation of such immersive experiences shaping the future of both hardware and software as well as the future of a whole new set of digital devices and apps, and maybe of a reshaping of Human Interface Guidelines.

As mixed reality matures, the integration of spatial sound and AI-driven voice interfaces like Siri will need to evolve into a fully programmable, context-aware auditory framework—where sound is not merely a sensory layer, but a core architectural component of user interaction, system feedback, and environmental cognition. The standardization and scalable implementation of these audio paradigms—enabling intelligent, adaptive, and non-visual interfaces—will define the core challenges for the next generation of technologies and are likely to fundamentally disrupt traditional approaches to user interface design.

References

Apple Inc. (2020). About spatial audio with Dolby Atmos in Apple Music. Apple Support. Retrieved from https://support.apple.com/

Apple’s technical documentation offers an overview of spatial audio technology as applied to iOS devices and AirPods, including the principles behind head-tracking and dynamic sound placement.
Apple Inc. (2022). How Siri works with iOS 16. Apple Developer Documentation. Retrieved from https://developer.apple.com/documentation/siri/

Apple’s developer documentation on Siri in iOS 16 includes insights into Siri’s integration with third-party apps and explains the technical framework that allows Siri to interact with multiple applications on iPhones.
Apple Inc. (2023). Vision Pro: The Next Step in Spatial Computing. Apple Developer Documentation. Retrieved from https://developer.apple.com/documentation/visionpro

Apple's Vision Pro documentation offers an overview of spatial computing technologies used in their MR device, including spatial audio’s role in enhancing immersion and user experience.

Bibliography

Dolby Laboratories. (2021). “The Future of Audio: Spatial Sound in Mixed Reality.” Dolby Insights. Retrieved from https://www.dolby.com
Blauert, J. (1997). Spatial Hearing: The Psychophysics of Human Sound Localization. MIT Press.
Chan, J., & Shen, Y. (2020). “Using Neural Networks for Personalized Voice Recognition.” In IEEE Transactions on Consumer Electronics