Virtual assistant 1.0 technology is weak in these five common metrics in software engineering:
Virtual Assistant 2.0 Technology address the above metrics with these key concepts:
Power and Effectiveness. We give virtual assistants a powerful natural-language programming interface. We use a neural semantic parser to translate natural language dialogues into code; the agent can be programmed to respond in a controlled manner.
Formal language for collaboration. The target (executable) language is called ThingTalk, an open high-level language designed specially for virtual assistants.
Cost efficiency through training data synthesis. We eliminate the need for massive manual annotation by synthesizing most of the training data (NL + Thingtalk pairs) using natural language grammars, which we refer to as templates. This teaches the neural network compositionality and coverage from scratch. We throw in a bit of real training data so the neural network can generalize, and we are good to go.
Abstraction & Tools: supporting reuse and refinement. We have built a programming tool, called Genie, that supports reuse (a domain-independent library of templates and an open Thingpedia library of domain knowledge) and refinement (developers can debug the errors and add to the templates). Our trained network, LUInet, is also made openly available.
Our Genie tool suite accepts knowledge base schemas and API signatures to generate a multi-turn conversational agent that answers questions and performs transactions. The agent is a rule-based system so developers can control what the agent does and says based on the user inputs.
User inputs are parsed by a neural semantic parser, that translates natural language into ThingTalk, a formal virtual assistant programming language. Having an open well-defined formal language is the first step in facilitating collaboration in building common datasets, knowledge bases, and toolsets, no different from how Java, Javascript, Python supports collaboration.
We reduce expensive data acquisition cost for neural networks by synthesizing most of the training data. Such data covers a wide variety of conversations around the given schemas and signatures. A small number of annotated real conversations are included in training to add to the diversity of data set.
The synthesis is grammar driven; grammar rules, or templates, map fragments of natural-language utterances into their corresponding ThingTalk representations. The grammars can be divided into three levels:
An abstract, domain-independent dialogue model. Dialogues are modeled as state transitions consisting of pairs of agent and user utterances. The model captures the typical flow of the conversation: the user greets the agent, then queries data or requests actions such as making a reservation, perhaps asks for recommendations, reiterates some request, and then cancels or executes a transaction. Note that the model is used only in generating training data, the neural model can generalize to transitions that are not included in the model.
Generic sentence variety. Genie has a library of about 800 generic query templates that cover many different ways of asking who, what, when, where questions, expressing concepts in different parts of speech, as well as types and measurements.
Domain-specific expressions of database and API fields. Genie has heuristics and uses pre-trained models to automatically identify different ways to refer to given fields. Here, developers are encouraged also to supply domain-specific annotations.
Once errors are detected on input data, developers can examine the error and add the missing concept as templates, so it can be combined with all other concepts to teach the neural network generality. This refinement to improve accuracy is familiar to software engineers, which is much more direct than simply rely on annotations of lots more utterances. Note that any improvement made to the two specification levels are applicable to all domains, thus allowing language information to accumulate and be reused as more and more agents are built using Genie.
Besides lower cost, our training data has better coverage over the knowledge base, thus our approach is better at handling long-tail questions than the traditional approach, which is often manually tuned to answer popular questions.
Today many websites includes structured data to facilitate search engines on their webpages, using the standard Schema.org schemas. We convert the Schema.org representation into a relational representation that can be accepted by Genie, and the format is publicly available in the Thingpedia repository. We can refine the annotations and update the generic templates for each of the domains in Schema.org. We have to date worked on five Schema.org domains: restaurants, people, movies, books and music. This information, available on Thingpedia, can be used to create a conversational agent for the structured Schema.org data for any website in these domains. Active steps to improve the accuracy include improving the entity recognizers and improving the generic templates.
Our generate agents achieve an overall accuracy between 64% and 75% on crowd-sourced, long-tail questions based on available properties for these domains. On such questions in the restaurant domain, Genie achieves 70%, whereas Siri achieves 51%, Google Assistant is at 42%, and Alexa is at 41%. Here are some questions answered correctly only by Genie: "Show restaurants near Stanford rated higher than 4.5" or "Show me restaurants rated at least 4 stars with at least 100 reviews."