Thoughts: Voice Assistant of The Future
Sep 01, 2020

Let’s talk science fiction for a moment.

What is the pinnacle of true accessibility technology? A fully fledged virtual assistant, driven by humankind’s never-ending pursuit to free ourselves from the mundane menial tasks of life.

Think anywhere from the power fantasy of Jarvis to dystopian interpretation of Her. This functioning assistant birth an age of a fourth basic human right, the right to life, liberty, property, and a virtual personal assistant. But philosophizing aside, how practical are these Hollywood-esque aspirations?

Can a virtual assistant handle human sensory information? If so, how well?

Let’s take a single human sense: hearing and audio. We’ll break down the components that are required for a virtual assistant to reach a human level of understanding, first.

  1. Human ears are always listening. Physically it’s already difficult to create an independently powered microphone array device that is listening all the time (and more importantly processing all the time), not to mention the social implications of always-listening devices that we would need to overcome. Difficult but possible

  2. Human brains are always processing. Audio is one of the many sensory sources that the human brain is constantly interpreting. Not just interpreting spoken words, but interpreting all forms of audio context, such who is speaking and at what tone. Virtual assistants current ability to understand broader spoken contexts such as emotion is fuzzy at best. But on the other hand linguistically natural language processing has come a long way, and a machine's ability to turn speech into text is arguably better than some human beings. Difficult but possible

  3. Human brains are always cross referencing. Arguably one of the most difficult human tasks for a virtual assistant to replicate, is the ability for an assistant to understand [audio] context. Context within a sentence, context within a conversation, and context within a whole person’s being. Our ability to cross-reference a vague memory (whether falsified or not) if an experience during a conversation is both powerful and currently not replaceable. With the release of GPT-3, a neural-network-powered language model, perhaps we’ll start to see a shift in contextual understanding, but as it stands virtual assistants barely remember your last sentence. Currently not possible

    Given Origami Labs audio first (though not limited to) design mentality, I’ve used voice and audio as a measuring stick to assess the current state of virtual assistants. But this same structure of 1) Type of data input, 2) Data translation 3) Data interpretation can be used to assess the other silos of virtual intelligence. I would wager that a virtual assistant's inability to reach human level understanding of audio, visuals, touch/motor is at a similar, still learning to read, write, talk and walk.

    But that’s not to say that virtual assistants are not ready today. There are three practical ways which I believe virtual assistants could make and are already making a powerful impact.

    Focusing on a domain that the virtual assistant is strong in

    Virtual assistants have a sixth sense: the raw data that flows in the electronic ether. A digital assistant’s ability to sift through digital information is vastly superior to ours;they already shine, running in the background unknowingly to us. Google search engine itself is able to sift through information in a way we would never be able to. The more defined this data domain is – the more powerful a virtual assistant could potentially be in this space. We will mostly likely have virtual assistants that control our digital world well before we virtual assistants that understand our real world.

    Focusing on a limited domain

    Virtual assistants that stick to a given domain tend to have more success. Train a virtual assistant to recognize only faces, for example, easily surpassing the average person's ability to recognize people (myself included). In the audio space, understanding language outside of emotional context could very well soon surpass our own abilities, and we’ll also soon live in a world where an autonomous car is a better driver than your average human (limited domain in this case means virtual assistant is only focused on information about driving).

    Focusing on data input and data translation

    Finally, an accessibility project like Aira is a unique proposition: where the virtual assistant is relegated to collecting information and highlighting information, while the human still plays an important role in facilitating. Essentially, Aira lifts the burden of tasks on humans  before completely taking it on. I’ve been following Aira since 2017, given they are also a start-up aimed at helping the visually impaired. Aira’s core software that streams video, through either a phone or a pair of smart glasses, of what visually impaired person’s surroundings and connects said person with a real breathing Aira assistant. The human assistant, through a combination of information they are receiving from the camera feed, is able to then dynamically handle a wide range of situations that the visual impaired person may be encountering.  Aira’s focus on data pipeline and not on data interpretation is what accelerates the potential benefits of their accessibility solution, providing full virtual assistants through combination of technology and human interaction.

    There’s a common notion that virtual assistants are useless; I understand the sentiment. In this hype cycle we’re still in the phase of over promising and under delivering. But the building blocks of a powerful entity are there. As the technology sector as a whole focuses on separate domains, and executable useful versions of assistants, there is a day soon where their disparate parts will consolidate. A virtual assistant that can see and hear and act at an intelligible level.

    Origami Labs started with accessibility technology, so I often wonder: what is the best way in which I can help my visually impaired dad?

    In most of these cases it is for me or someone to be by his side. See what he can’t see, hear and speak to him. Perhaps there will one day be a world where when someone can not be by his side, there will be an enabling virtual assistant. And that is the power accessibility, because technology that uplifts a single person is technology that will change the world for the rest of us. 

    By: Kevin Johan Wong / CEO

    Find your voice