Silei Xu, Giovanni Campagna, Jian Li, and Monica S. Lam
In Proceedings of the 29th ACM International Conference on Information and Knowledge Management, October 2020 (To Appear).
Building a question-answering agent currently requires large annotated datasets, which are prohibitively expensive. This paper proposes Schema2QA, an open-source toolkit that can generate a Q&A system from a database schema augmented with a few annotations for each field. The key concept is to cover the space of possible compound queries on the database with a large number of in-domain questions synthesized with the help of a corpus of generic query templates. The synthesized data and a small paraphrase set are used to train a novel neural network based on the BERT pretrained model.
We use Schema2QA to generate Q&A systems for five Schema.org domains, restaurants, people, movies, books and music, and obtain an overall accuracy between 64% and 75% on crowdsourced questions for these domains. Once annotations and paraphrases are obtained for a Schema.org schema, no additional manual effort is needed to create a Q&A agent for any website that uses the same schema. Furthermore, we demonstrate that learning can be transferred from the restaurant to the hotel domain, obtaining a 64% accuracy on crowdsourced questions with no manual effort. Schema2QA achieves an accuracy of 60% on popular restaurant questions that can be answered using Schema.org. Its performance is comparable to Google Assistant, 7% lower than Siri, and 15% higher than Alexa. It outperforms all these assistants by at least 18% on more complex, long-tail questions.
Giovanni Campagna, Agata Foryciarz, Mehrad Moradshahi, and Monica S. Lam
In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, July 2020.
This paper proposes new zero-short transfer learning technique for dialogue state tracking where the in-domain training data are all synthesized from an abstract dialogue model and the ontology of the domain. We show that data augmentation through synthesized data can improve the accuracy of zero-shot learning for both the TRADE model and the BERT-based SUMBT model on the MultiWOZ 2.1 dataset. We show training with only synthesized in-domain data on the SUMBT model can reach about 2/3 of the accuracy obtained with the full training dataset. We improve the zero-shot learning state of the art on average across domains by 21%.
Giovanni Campagna, Silei Xu, Mehrad Moradshahi, Richard Socher, and Monica S. Lam
In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, Phoenix, AZ, June 2019.
To understand diverse natural language commands, virtual assistants today are trained with numerous labor-intensive, manually annotated sentences. This paper presents a methodology and the Genie toolkit that can handle new compound commands with significantly less manual effort.
We advocate formalizing the capability of virtual assistants with a Virtual Assistant Programming Language (VAPL) and using a neural semantic parser to translate natural language into VAPL code. Genie needs only a small realistic set of input sentences for validating the neural model. Developers write templates to synthesize data; Genie uses crowdsourced paraphrases and data augmentation, along with the synthesized data, to train a semantic parser. We also propose design principles that make VAPL languages amenable to natural language translation.
We apply these principles to revise ThingTalk, the language used by the Almond virtual assistant. We use Genie to build the first semantic parser that can support compound virtual assistants commands with unquoted free-form parameters. Genie achieves a 62% accuracy on realistic user inputs. We demonstrate Genie's generality by showing a 19% and 31% improvement over the previous state of the art on a music skill, aggregate functions, and access control.
Giovanni Campagna, Silei Xu, Rakesh Ramesh, Michael Fischer, and Monica S. Lam
In Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT), 2018.
This paper proposes a novel approach to let consumers share data from their existing web accounts and devices easily, securely, and with fine granularity of control. Our proposal is to have our personal virtual assistant be responsible for sharing our digital assets. The owner can specify fine-grain access control in natural language; the virtual assistant executes access requests on behalf of the requesters and returns the results, if the requests conform to the owner's access control policies.
Specifically, we allow a virtual assistant to share any ThingTalk command--an event-driven task composed of skills drawn from Thingpedia, a crowdsourced repository with over 200 functions currently. Access control in natural language is translated into TACL, a formal language we introduce to let users express for whom, what, when, where, and how ThingTalk commands can be executed. TACL policies are in turn translated into SMT (Satisfiability Modulo Theories) formulas and enforced using a provably correct algorithm. Our Distributed ThingTalk Protocol lets users access their own and others' data through their own virtual assistant, while enabling sharing without disclosing information to a third party.
The proposed ideas have been incorporated and released in the open-source Almond virtual assistant. 18 of the 20 users in a study say that they like the concept proposed, and 14 like the prototype. We show that users are more willing to share their data given the ability to impose TACL constraints, that 90% of enforceable use cases suggested by 60 users are supported by TACL, and that static and dynamic conformance of policies can be enforced efficiently.
Giovanni Campagna, Rakesh Ramesh, Silei Xu, Michael Fischer, and Monica S. Lam
In Proceedings of the 26th International World Wide Web Conference (WWW), Perth, Australia, April 2017.
This paper presents the architecture of Almond, an open, crowdsourced, privacy-preserving and programmable virtual assistant for online services and the Internet of Things (IoT). Included in Almond is Thingpedia, a crowdsourced public knowledge base of open APIs and their natural language interfaces. Our proposal addresses four challenges in virtual assistant technology: generality, interoperability, privacy, and usability. Generality is addressed by crowdsourcing Thingpedia, while interoperability is provided by ThingTalk, a high-level domain-specific language that connects multiple devices or services via open APIs. For privacy, user credentials and user data are managed by our open-source ThingSystem, which can be run on personal phones or home servers. Finally, we create a natural language interface, whose capability can be extended via training with the help of a menu-driven interface.
We have created a fully working prototype, and crowdsourced a set of 187 functions across 45 different kinds of devices. Almond is the first virtual assistant that lets users specify trigger-action tasks in natural language. Despite the lack of real usage data, our experiment suggests that Almond can understand about 40% of the complex tasks when uttered by a user familiar with its capability.
Michael H. Fischer, Giovanni Campagna, Euirim Choi, and Monica S. Lam
While Alexa can perform over 100,000 skills on paper, its capability covers only a fraction of what is possible on the web. To reach the full potential of an assistant, it is desirable that individuals can create skills to automate their personal web browsing routines. Many seemingly simple routines, however, such as monitoring COVID-19 stats for their hometown, detecting changes in their child's grades online, or sending personally-addressed messages to a group, cannot be automated without conventional programming concepts such as conditional and iterative evaluation. This paper presents VASH (Voice Assistant Scripting Helper), a new system that empowers users to create useful web-based virtual assistant skills without learning a formal programming language. With VASH, the user demonstrates their task of interest in the browser and issues a few voice commands, such as naming the skills and adding conditions on the action. VASH turns these multi-modal specifications into skills that can be invoked invoice on a virtual assistant. These skills are represented in a formal programming language we designed called WebTalk, which supports parameterization, function invocation, conditionals, and iterative execution. VASH is a fully working prototype that works on the Chrome browser on real-world websites. Our user study shows that users have many web routines they wish to automate, 81% of which can be expressed using VASH. We found that VASH Is easy to learn, and that a majority of the users in our study want to use our system.
Jackie (Junrui) Yang, Monica S. Lam, James A. Landay
In Proceedings of the ACM Symposium on User Interface Software and Technology (UIST), October 2020.
Many computing tasks, such as comparison shopping, two-factor authentication, and checking movie reviews, require using multiple apps together. On large screens, "windows, icons, menus, pointer" (WIMP) graphical user interfaces (GUIs) support easy sharing of content and context between multiple apps. So, it is easy to see the content from one application and write something relevant in another application, such as looking at the map around a place and typing walking instructions into an email. However, although today's smartphones also use GUIs, they have small screens and limited windowing support, making it hard to switch contexts and exchange data between apps.
We introduce DoThisHere a multimodal interaction technique that streamlines cross-app tasks and reduces the burden these tasks impose on users. Users can use voice to refer to information or app features that are off-screen and touch to specify where the relevant information should be inserted or is displayed. With DoThisHere, users can access information from or carry information to other apps with less context switching.
We conducted a survey to find out what cross-app tasks people are performing or wish to perform on their smartphones. Among the 125 tasks that we collected from 75 participants, we found that 59 of these tasks are not well supported currently. DoThisHere is helpful in completing 95% of these unsupported tasks. A user study, where users are shown the list of supported voice commands when performing a representative sample of such tasks, suggests that DoThisHere may reduce expert users' cognitive load; the Query action, in particular, can help users reduce task completion time.
Jackie Yang, Gaurab Banerjee, Vishesh Gupta, Monica S. Lam, and James A. Landay
In CHI '20: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, Honolulu, HI, April 2020.
Although state-of-the-art smart speakers can hear a user's speech, unlike a human assistant these devices cannot figure out users' verbal references based on their head location and orientation. Soundr presents a novel interaction technique that leverages the built-in microphone array found in most smart speakers to infer the user's spatial location and head orientation using only their voice. With that extra information, Soundr can figure out users references to objects, people, and locations based on the speakers' gaze, and also provide relative directions.
To provide training data for our neural network, we collected 751 minutes of data (50x that of the best prior work) from human speakers leveraging a virtual reality headset to accurately provide head tracking ground truth. Our results achieve an average positional error of 0.31m and an orientation angle accuracy of 34.3 degrees for each voice command. A user study to evaluate user preferences for controlling IoT appliances by talking at them found this new approach to be fast and easy to use.
Michael H. Fischer, Richard R. Yang, and Monica S. Lam
This paper presents ImagineNet, a tool that uses a novel neural style transfer model to enable end-users and app developers to restyle GUIs using an image of their choice. Former neural style transfer techniques are inadequate for this application because they produce GUIs that are illegible and hence nonfunctional. We propose a neural solution by adding a new loss term to the original formulation, which minimizes the squared error in the uncentered cross-covariance of features from different levels in a CNN between the style and output images. ImagineNet retains the details of GUIs, while transferring the colors and textures of the art. We presented GUIs restyled with ImagineNet as well as other style transfer techniques to 50 evaluators and all preferred those of ImagineNet. We show how ImagineNet can be used to restyle (1) the graphical assets of an app, (2) an app with user-supplied content, and (3) an app with dynamically generated GUIs.
Michael Fischer, Giovanni Campagna, Silei Xu, and Monica S. Lam
In 20th International Conference on Human-Computer Interaction with Mobile Devices and Services. (MobileHCI), 2018.
This paper presents Brassau, a graphical virtual assistant that converts natural language commands into GUIs. A virtual assistant with a GUI has the following benefits compared to text or speech based virtual assistants: users can monitor multiple queries simultaneously, it is easy to re-run complex commands, and user can adjust settings using multiple modes of interaction. Brassau introduces a novel template-based approach that leverages a large corpus of images to make GUIs visually diverse and interesting. Brassau matches a command from the user to an image to create a GUI. This approach decouples the commands from GUIs and allows for reuse of GUIs across multiple commands. In our evaluation, users prefer the widgets produced by Brassau over plain GUIs.