Say What You See, Facebook

Posted by on Apr 5, 2016 in Voice-Over | 0 comments

When Matt King signed up for Facebook, in 2009, he had already lost so much of his vision that he navigated the Internet using a screen reader, a piece of software that reads a Web page’s architecture and content aloud. At the time, King was an engineer at I.B.M. and an internationally ranked competitor in the sport of tandem track cycling. (He and a sighted companion took fourth place at the 1996 Paralympic Games, in Atlanta.) Nevertheless, for King, the process of creating an account and finding friends—something that might take a sighted person fifteen minutes—consumed an entire Saturday morning. Worse, once he made his way over to his friends’ walls, they were mostly silent. The majority of people’s posts consisted of photographs, which, without an explanatory caption, were invisible to him. “I thought, Great, here’s one more space that is kind of useless to me,” he said.

King is now an accessibility specialist at Facebook. With Jeff Wieland, who leads the company’s accessibility efforts, he is behind Automatic Alt Text, a new technology that relies on artificial intelligence to generate spoken descriptions of photographs. The feature begins rolling out today to Facebook users whose iOS screen readers are set to English, and will gradually make its way into other languages and platforms.

On a recent Monday morning, King performed a demonstration of Automatic Alt Text on his iPad. At the top of his simulated news feed, a colleague, Chelsea Kohler, had posted a picture of a puffy, cheesy pizza, topped with pepperoni and olives. A female robotic voice began to narrate the page’s contents at high speed. (Synthesized voices can cram up to six hundred words into a minute without losing their crispness, which allows the user to skim through pages at a reasonably quick pace.) “Heading level five, Chelsea Kohler, photo,” the voice said. Then it read Kohler’s caption: “Sunday night splurge.” At that point, as far as King was concerned, the image could have depicted anything from a Tesla Model 3 to a packet of Baked Lays. He reloaded the page with Automatic Alt Text enabled, and the synthesized voice intoned, “Image may contain pizza, food”—equipping King with enough context to type “See you at the gym, lol,” or “Yum, looks good,” or any number of appropriate responses. (If you use Facebook on an iOS device, you can try this yourself: enable VoiceOver in Settings, open the Facebook app, and scroll through your news feed. Each time you swipe past a photo, the voice will tell you what it thinks that photo contains.)

Automatic Alt Text describes a pizza
Courtesy Facebook

Accessibility has been a persistent problem on the Web, particularly as it has grown more visual. Two years ago, dozens of tech companies, including Facebook, came together for what Wieland described as “a giant handshake,” signing on to a set of recommendations known as the Accessible Rich Internet Applications Suite (ARIA), which aims to make the Web not just technically accessible but also enjoyable to all. (The companies’ support is reflective of more than a high-minded commitment to equal opportunity: the World Health Organization estimates that nearly four per cent of the planet’s population is visually impaired, a substantial base of potential customers.) According to the ARIA standards, any non-text content on a Web page should have a meaningful text equivalent that can be spoken by a screen reader. HTML provides a straightforward solution: the alt attribute, a short description that is embedded in an image’s underlying code. Any designer who wishes to create an accessible site, then, need only specify an alt attribute when she uploads a photo.

For the code to be useful, though, it has to be implemented consistently across an entire site. In the case of Facebook that’s not so easily achieved, because photos are typically uploaded by individual users, who are not required to add tags or a descriptive caption—and don’t usually bother. With more than two billion images shared on Facebook, Instagram, Messenger, and WhatsApp every day, it would take a team of some 1.6 million full-time human taggers to keep up. Fortunately, in the past few years, advances in machine learning have led to the creation of artificial-intelligence agents that are capable of object recognition. Google Plus began automatically tagging its users’ photos in 2013, and Flickr followed suit in 2014. In both cases, the algorithms were implemented primarily to improve searchability. They were remarkably accurate, although there were some appalling failures. (Both programs were initially prone to adding the tag “ape” to photographs of black people.) Facebook is the first to implement the technology for accessibility, and it has taken a more cautious approach. King explained that their A.I. is programmed not to add a tag unless it is at least seventy per cent sure that it is right. If an acne-riddled face comes back labelled “pizza” in a Google image search, no one’s feelings are hurt. But if King reacts to a friend’s photograph of her spotty-faced teen-ager with a comment about pizza, he is likely to feel diminished rather than empowered. “This is definitely a situation where bad data is worse than no data,” King said.

The A.I. currently recognizes about a hundred objects and concepts, including food words and appearance descriptors such as “beard” and “smiling.” This limited vocabulary, combined with the technology’s built-in aversion to making errors, means that its descriptions lack the richness of visual content. “It can be a bit of a tease,” King said. “People will write ‘Wow,’ but the automatic description doesn’t make me say wow.” A photograph of Northern California’s redwoods, soaring skyward like a cathedral, will be translated as “Image may contain trees, sky, outdoors.” An engagement portrait becomes “two people, smiling, jewelry.” Kohler’s splurge was simply “pizza,” rather than “pepperoni and olive pizza”: the system’s object recognition is not yet fine-grained enough to specify toppings. Nonetheless, King said, Automatic Alt Text has made his news feed “a lot more entertaining,” and, although he has hardly become a high-volume commenter, he is now more likely to “like” an image with confidence.

King, Wieland, and their colleagues began implementing the code behind Automatic Alt Text last June, and a few months ago they recruited a group of alpha testers. Among them was Marco Salsiccia, a former animation and visual-effects artist in the film industry, who lost his eyesight completely two years ago, over the course of just forty minutes, because of a retinal occlusion. “I used to use Facebook to share memes,” Salsiccia told me. “Afterward, it became much more personal. It felt like my only source of contact while I was learning how to navigate the world as a blind person—but I couldn’t understand ninety per cent of what my friends were posting.” After playing with Automatic Alt Text, Salsiccia has a few complaints. As an artist, he is particularly annoyed by the fact that the A.I. apparently cannot differentiate between a drawing and a photo. “There’s room for improvement,” he said. Still, he told me that he was looking forward to the technology’s public début.

Wieland was quick to point out that Automatic Alt Text is still in its early days. A group of five thousand beta testers began using it last month, providing feedback and feature requests through a survey. (Their unusually high response rate was evidence, Wieland suggested, of the hunger for better image descriptions among the visually impaired.) The testers were particularly keen on the A.I. learning to recognize text and identify people’s faces. “Almost every single questionnaire, if face recognition wasn’t the first request, it was the second,” Wieland said. “That’s the thing I’m itching for, because I know we can do it,” King said. Unfortunately, Wieland noted, privacy concerns have made automatic face tagging off limits, at least for now. Internally, however, the team is beginning to experiment with both features. At the same time, they are making the A.I. capable of answering questions, so that users can glean more information from talking to it. “This is just the first step,” Wieland said.

Sign up for the daily newsletter.Sign up for the daily newsletter: the best of The New Yorker every day.

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.