Development, Machine Learning, Mozilla Hubs, Spatial Computing

Authoring Immersive Environments with glTF for Multi-User Mixed Reality Web Applications

I was invited to give a lightning talk at the W3C Inclusive Design for Immersive Technology workshop about what we’re doing with scene authoring through Spoke for Hubs. As with many of my talks, I think they make better blog posts, and I’m excited to share this one with you today.

About Hubs & Spoke

For those of you who aren’t familiar with the work we’re doing a Mozilla on social mixed reality, Hubs is a social platform that uses spatialized audio, 3D environments, and media composition features to support collaboration in virtual reality, or through a “flat” screen on desktop or mobile devices. Spoke is a web-based compositing tool for building the 3D scenes that are used in Hubs.

In this post, I’m going to briefly explain how we compose scenes for 3D environments in these two applications as an opportunity to explore declarative scene creation using the glTF file format. I’ll then talk to some early thoughts around opportunities for improving accessibility in social mixed reality applications.

glTF and Custom Components

In Hubs, 3D environments are based on the glTF format with a number of custom components. Like glTF, the Spoke file format itself is JSON structured. The file contains custom components that are used to execute additional, application-specific interpretations of the defined scene content.

One example of a custom component that is declared in the scene file and interpreted a specific way by Hubs is how we have built a spawn-point element. This provides information that Hubs can use to select appropriate entry points, but in the absence of an interpreter that has logic to comprehend the specific definition, these components are not taken into consideration.

The Spoke client has encoded an interpretation of the scene that visualizes the spawn point element with a client-side mesh rendering of a robot.

The Hubs client has encoded an interpretation of the scene that does not visualize the spawn point element, but knows that it is a safe landing point for users who enter the room.

Blender, which is another 3D editor, visualizes the transform data that is present in the glTF file – the position, rotation, and scale of the spawn-point element, as a straight line, but does not apply any interpretation of mesh information since that mesh is not part of the glTF file itself. The scene itself doesn’t change. It declares what it is, and it’s up to applications to handle those custom components (or not).

This type of scene definition grants Hubs and other applications the ability to interpret and dynamically apply appropriate components to a given platform. It also enables a higher degree of re-usability and transportation of scenes across applications. For Hubs, an instance of a scene that users can join into is what we call a ‘room’. While scenes cannot be edited from within a room, new, temporary objects can be added. Rooms expose newly added objects to the DOM as A-Frame entities, creating new objects that can be manipulated by users through mouse events or JavaScript code.

The flexibility that is provided through components and extensions of the glTF format means that we’re able to add new behaviors to elements within our scenes, and update the client to respond and handle those accordingly. There is a vast amount of visual content that can be added into a Hubs room, and we’re starting to explore ways that we can make these environments accessible.

Exploring Visual Accessibility for 3D Environments

Image recognition algorithms show some initial promise for identifying scene content, but generating automatic captions from 3D spaces lacks nuance and detail required to fully describe the scene. It is clear that a more robust solution is needed – likely one that involves a combination of machine learning techniques, and author-supplied information. One of the areas that we’re starting to think about is how we can surface descriptive text from a scene into the Hubs client, perhaps specified in new custom extensions in the glTF models that make up a scene.

This would allow a client, or possibly a browser to surface this information to users in a number of different ways – perhaps through captioning, or read aloud, depending on the user’s preferences or needs. 

Scenes that are built with the Spoke tool and used in Hubs can also be composed of 2D web media, including videos, images, and PDF documents. In addition to providing options for representing the scenes in a more accessible way, there are also a number of considerations that may utilize techniques like character recognition to recognize text or image content.

Future Thoughts

Looking ahead, I’m motivated and looking forward to better understanding the work the immersive technology industry can do to incorporate learnings from these communities and projects into our own applications. In particular, I’m excited to learn more and participate in the conversations for research and standards around:

  • Encouraging and supporting research into spatial accessibility structures
  • Supporting and experimenting with alternate text components for objects
  • Implementing application-specific ways to surface room information to a user, such as gaze or mouse-based captioning
  • Captioning and providing audio controls for room audio
  • Improving movement mechanics and controls
  • Continue learning and adapting to new information

This post primarily speaks to visual accessibility considerations for 3D content in a scene, but as an industry, we will also have to incorporate solutions for captioning, granting audio control to users, and improving navigation techniques in mixed reality contexts. Above all, we need to ensure that we continue learning and adapting to new information and practices as they evolve, in large part as a result of events like this one.