Development, Machine Learning, Spatial Computing

Captioning Three.js WebGL Canvas Elements with Machine Learning

I just skimmed back into an earlier blog post and realized that while I posted about leaving High Fidelity, I never actually updated with where I was headed. I didn’t post it here, though I did an announcement on Twitter with the news: I joined Mozilla back in March to work on their Mixed Reality initiatives, and finally have been able to go full-time into open source, web-based, asymmetric social 3D / VR spaces.

Working at Mozilla has given me the chance to dive back into all things web and getting extremely excited about the opportunities to make the internet safer, more inclusive, and more accessible. A couple of weeks ago, I gave a talk at the You Gotta Love Frontend conference about building virtual and augmented reality on the web – and as much as I love the chance to ramble on about all things VR, what I was really left with was the takeaway that we (collectively, as creators of XR content) need to be far more proactive about building accessibility and access into our work.

These things take a lot of different forms, and coming up with standards that will make immersive applications accessible for a wide range of users, especially given device/platform fragmentation, will take time. I was very much humbled by Adrian Roselli’s talk on Selfish Accessibility , which got me thinking about two things:

  1. The web right now excludes large numbers of users, even with 2D content
  2. 3D content on the web is a mess of inaccessible drawing

Now, there’s no hate here that the realm of VR and 3D content on the web is not accessible – collectively, there are a lot of standards development changes happening across file formats, the WebXR API spec, WebGPU, etc. A lot of practices are still getting worked out, and I think that makes it an excellent time to start thinking about how we’ll make 3D content more accessible to users.

Despite all of the ongoing changes, and the fact that I am not any kind of accessibility expert, I decided to spend some time prototyping an idea that I’ve had about one particular element of WebGL canvases, and that’s how content drawn in a canvas isn’t made visible to the DOM on a web page. This means that while the page you look at would understand what an <h1> header is, and be able to read that text out to you, anything inside of a <canvas> element is invisible.

You can get a very quick overview of what this feels like by enabling Narrator in Windows, then trying to use Sketchfab without looking at your screen.

The Scene Reader Demo

Scene Reader is an example application that illustrates how the Microsoft Azure Cognitive Services Computer Vision API can be used with Three.js to capture the contexts of a WebGL context and attempt to estimate what is on the screen.

A screenshot of the application. It is a rendered red chair with a caption that says 'A WebGL canvas that shows: a close up of a red chair.
A screenshot of the scene reader demo. The red chair is correctly identified.

As you can see if you play around with this, the captions are not particularly accurate, but in cases where the confidence of the recognition model is fairly high, the site provides some degree of assistance in capturing what is displayed in the canvas and attempts to surface this to the page in a readable DOM element. This could easily be adapted to any other WebGL drawing context, since it’s using the built-in HTMLCanvasElement.toDataURL() function on the WebGL canvas to convert whatever is currently rendered on the canvas into a byte array that can be sent as a blob to Azure for processing.

This was relatively straightforward to set up, despite Microsoft not having a JavaScript implementation sample on their site for sending locally stored data to the Cognitive Services API. The client uses the canvas’s toDataURL()function to create a byte array, which is sent to the server for relaying onto the Azure endpoint. The Azure service (you could do this with another cloud provider, or, better yet, train a custom service) tries to identify what’s in the image and, when successful, returns an auto-generated caption which is then displayed in the header of the document.

The scene itself is a glTF model that I downloaded from Sketchfab and loaded in using the three.js GLTFLoader. This took me a little while to get right, because glTF models can range in size so you may have to adjust the camera positioning if you want to try this with a different model. Hopefully in the future, I’ll be able to make some updates to handle this more reasonably.

Captioning three.js scenes with Azure Image Recognition

The reason that I chose to work with glTF for this project is because glTF is an extensible format, which means down the road, you could theoretically export accessibility captions directly into the file itself. This might not work well for large scenes, but I’m still thinking about how this could look down the line. In any case, glTF is a cool standard file format for using 3D models on the web, and I wanted a chance to use the glTF format in a project, so here you have it.

One of the wonderful things about the glTF format is that it’s extensible and built on JSON, so programs can write additional information into files depending on how it needs to be used. glTF files that are exported from Sketchfab, for example, have information about the author, license the content is released under, original source, and title.

"extras": {
"author": "GregX (",
"license": "CC-BY-4.0 (",
"source": "",
"title": "Chevrolet Camaro"

You can find the source code for the scene-reader project on GitHub. It’s built using Express, Node.js, and Three.js. You will need to install the dependencies using npm and set up your own .env file with the API_KEY variable set to your own Cognitive Services API key, which requires an Azure account. You can sign up for a free tier (F0) which gets you 10 requests per minute.

Thinking ahead

There are a lot of things that could improve upon this prototype, but I don’t want to rely too heavily on automatic image recognition or stay committed to any particular idea . Some of the things that I’m thinking about, in no particular order:

  • Testing with screen readers. I am in the process of learning about screen readers and low-vision functionality on the web. While I haven’t gotten to it today, I want to explore what it would take to make this read well. I did try it out with the built-in Windows Narrator, and it read out the text, but I suspect there’s a lot here that I could dive into and learn about to make this work more effectively.
  • Training image recognition algorithms on low-poly content. Most image recognition programs are trained on photos, and the Azure algorithm seems to work better on models that are higher-poly or from photogrammetric models. By creating an algorithm that specifically learns about concepts of objects and artistic representations of those objects, instead of relying on one that has a heavy need for “realism”, we might see improvements in how well auto-generated captions work.
  • Adding accessibility attributes directly into the glTF file format. As I mentioned above, the extensible aspect of glTF means that programs can insert additional information into glTF models. While I’m still getting my head around glTF, loading them, and generating them, I can imagine that there’s a potential route forward where glTF files can include more descriptive elements for entire scenes or parts of the scenes, so that more accurate content can be shared with the DOM. A future project that I would like to explore is modifying the GLTFLoader code to support reading in custom-defined caption or description elements from a scene.
  • Detecting significant changes in camera movement, and triggering an automatic re-captioning. Right now with scene-reader, you have to press the ‘P’ button to regenerate a caption and if you try it out with the included apartment scene, you’ll see that different views return different results. It would be interesting to determine a threshold that would automatically re-capture the new view and generate a new caption.
  • Implementing a directional context and creating interactive, responsive content. Right now, this works pretty well with a single, static image – but there are no affordances to surface any kind of interactive elements into the scene or have those trigger new elements. I have to do a lot of digging here to better understand what that could look like, but it’s something on my mind.
  • Thinking about universal solutions. This app works in a very narrow context, and to be more useful and ever actually be helpful, a more universal library or framework will have to emerge. Whether that’s through the web graphics libraries, the models themselves, browsers, plug-ins or extensions – something that can handle existing 3D content, and work well, will have to emerge.
  • Adapting scene-reader to work in a VR demo. What it says on the tin – I’d love to adapt this demo to support WebVR and use a text-to-speech solution to read aloud what is being looked at in the scene. This would tie in nicely with detecting camera movement.
  • Creating a browser extension. An extension could be a nice way to provide this service for pages that have WebGL canvases of existing content, but since writing the canvas information to PNG requires that the renderer preserve its drawing buffer, this is a stop gap for the time being unless I can find some other way around.

If you have any thoughts, comments, or suggestions about things that could be useful to explore related to accessibility in virtual / augmented reality, please drop me a line – I’d love to learn more about what can be done here.

Resources & Links