Machine Vision and Augmented Reality

By Elizabeth Losh

The camera opens on a group of musicians preparing to play a string quartet by Debussy. As the music swells from their instruments, rectangles appear and begin to follow their faces.

Guesses from biometric algorithms flicker over the circle of players, trying to classify the musicians by gender, age, and current mood. The system is clearly fallible, shifting between genders as musicians turn their heads and cycling through wildly improbable ages with each changing expression.

Soon frames appear all over the scene. A female violinist is identified incorrectly as a “black chair with a red umbrella.” A stand with sheet music is mis-tagged as a “white wooden table.” Clearly, the computer is struggling to make sense of complex moving images, using its training data to interpret the patterns it thinks it perceives.

As the film by artist Trevor Paglen progresses, the computer vision becomes more abstract, and the images are captured by software progressively more divorced from the domains of human sight. The viewer to Paglen’s installation in the Smithsonian is now watching the performers through algorithms used in self-driving cars, guided missiles, and aerial drones.

Although this footage may seem bizarre and otherworldly, it is important to remember that today, words and images are much more likely to be read or viewed by machines than by human beings, as enormous quantities of data are filtered, correlated, aggregated, and sorted by algorithms designed for digital archives, search engine portals, social network sites, and systems policing intellectual property, national security, public safety, civic propriety, medical normality, and gender conformity. Use of these algorithms as a substitute for human interpreters tends to raise many anxieties, particularly among those who worry either about omnipresent surveillance or about a total abdication of oversight possible in the very near future.

The Trouble with Algorithms

Another of Paglen’s artworks is a black-and-white silver gelatin print of Shoshone Falls, Idaho, an expanse of sublime nature representing the American West that was also captured by the lens of Timothy O’Sullivan in 1874. In Paglen’s version of the waterfall, two different computer vision technologies have been applied: a program to recognize faces and one to demarcate road boundaries. Although the image deploys a traditional aesthetic associated with antiquated technologies, rectangles appear around ghostly faces and perspective lines highlight hallucinated thoroughfares. Thus the placid image hints at two radically opposed nightmares about computer vision: A dystopian environment of constant monitoring with no sanctuaries of privacy, or an equally dysfunctional world of self-driving cars running amok, haphazard medical diagnoses, and other attempts to automate the labor of preserving safety and health. Between the Scylla and Charybdis of surveillance and abdicated oversight, how are humanists and educators supposed to navigate the contemporary machine vision landscape?

Machine Vision Revolution

Critics like Jill Walker Rettberg have noted that the last great technological change in visual culture during the Early Modern period connected the sciences and the humanities closely, as humanist thinkers considered the philosophical, aesthetic, and cultural ramifications of techniques for representing linear perspective and anatomical proportions and optical devices like the camera obscura, the microscope, and the telescope. Yet the machine vision revolution has been relatively unexamined in the humanities, even in the digital humanities. As Rettberg points out in one of her Snapchat Research Stories, interacting with biometric grids and machine vision algorithms has become a normal part of day-to-day communication using augmented reality technologies available as filters on smart phones. For many people, augmented reality offers a way to try on new identities or engage in social performance and play.

Ethical Considerations

Given the obvious limitations of current machine vision technologies, it is understandable in academic contexts that their expert analysis could be co-opted by complicated black-boxed machines—either hobbled by the hubris of immature artificial intelligence technologies not ready to be launched, or endowed with inhumane efficiency that nullifies consent and creates fear.

Companies like SenseTime are already encroaching on the prerogatives of higher education. Using machine learning algorithms and training data derived from the profiles of over a billion Chinese citizens, SenseTime promises that misbehavior could be eliminated, along with anonymity. When I visited one of their labs at the Chinese University of Hong Kong in September, researchers boasted of their university connections, particularly a recent high-profile alliance with MIT. During my tour, they appealed to my identity as a college instructor. Just as SenseTime could automate taking attendance, thereby freeing up faculty for higher order tasks, by using the same software used to identify shoplifters in a mall, the company’s products could also identify students who were inattentive or sleepy, using the same algorithms being tested on the faces of drivers in city traffic. Just as the danger of causing accidents could be overcome, bored or drowsy students could be alerted that they were at risk of missing critical material.

The Potential of Machine Vision

Bethany Nowviskie has celebrated the potential of such advanced visual recognition algorithms as a boon to scholars of the environmental humanities, who will be able to mine the millions of images in the Biodiversity Heritage Library that are drawn from centuries of gorgeous notebook sketches and lavish book illustrations. Nowviskie encourages humanists not to hide from machine vision, even if our natural tendency might be to try to make ourselves invisible to its gaze—a camouflage strategy focused on outsmarting the machine, which has been literalized in the work of digital artists like Zach Blas and Hito Steryl.

From the perspective of my own campus, William & Mary, I get to see an international team of faculty, librarians, and students undertaking the daunting task of interpreting and curating over 300,000 pages of archival materials from the British Royal Archives to produce the Georgian Papers Programme, an ambitious digital humanities project aimed at analyzing a complex era of exploration, colonialism, cultural diffusion, and revolution with primary sources that include essays, letters, reports, inventories, recipe books, menus, and didactic material for the children of the royal family.

Some tasks are made more manageable by using Transkribus, a cursive writing recognition tool that can be trained on a sample of the individual’s handwriting. Although many of the manuscript pages with the script of George III show his erratic state of mind during periods of mental disorder, the large samples available with his penmanship make it possible to automatically code large collections with his handwritten documents.

Rather than study the static paper documents that memorialize the Georgian kings, I scrutinize multimedia digital artifacts that compose the record of today’s political leaders. As channels for content multiply, new computational and visualization techniques can foster new forms of humanities scholarship and public access to historical records. Now that the speeches of contemporary political leaders are recorded and archived, humanists have a rich record of public rhetoric to analyze that includes facial expression, bodily gesture, vocal performance, and frequently the use of sets and props.

Conclusion

I am enthusiastic about using machine vision technology in my own research on digital rhetoric and incorporating it into my teaching to help students interpret complex moving images. Visual rhetoric has a long tradition in the humanities that includes analysis of symbolic objects in portraits of world leaders or the choreography of their oratorical performances. This approach to digital humanities can be integrated into more traditional rhetorical analysis, because elected office holders also produce memoirs, letters, editorials, and other forms of written discourse. Speaking personally, machine vision doesn’t make me feel alienated by technology, because it opens up new forms of collaboration, new ways to approach our objects of study, and new evidence for arguments that can enhance our civic understanding.

Resources

Rettberg, Jill W. Seeing Ourselves Through Technology: How We Use Selfies, Blogs and Wearable Devices to See and Shape Ourselves. Palgrave Macmillan: UK, 2014.

Nowviskie, Bethany. Reconstitute the World. http://nowviskie.org/2018/reconstitute-the-world/. 12 June 2018.

ELIZABETH LOSH, PHD (emlosh@wm.edu) is Associate Professor of English and American Studies at William & Mary, specializing in rhetoric, digital publishing, feminism & technology, digital humanities, and electronic literature. She is author of Virtualpolitik: An Electronic History of Government Media-Making in a Time of War, Scandal, Disaster, Miscommunication, and Mistakes and co-author of Understanding Rhetoric: A Graphic Guide to Writing, Second Edition.