In a previous short entry, I gave an introduction to chatbots: their current high popularity, some platform options and basic design suggestions.
In this post, I am going instead to illustrate what I believe is a more intriguing scenario: a deep-learning-based solution for the construction of a chatbot off-topic behavior and “personality”. In other words, when confronted with off-topic questions, the bot will try to automatically generate a possibly relevant answer from scratch, based only on a pre-trained RNN model.
What follow are four self-contained sections, so you should be able to jump around and focus on just the one(s) you are interested in without problems.
- Short intro on chatbots tasks and types.
- Details of the RNN model used: high-level model architecture, info on training data sources and pre-processing steps, as well as a link to the code repository. I’m not going into details for RNN, but I’m including what I believe are some of the best-related resources and tutorials.
- Architecture of the final working solution: the working chatbot involves separate and heterogeneous components. I will illustrate them and their interactions, while describing all tools and resources involved.
- Showcase of the chatbot in action: this is technical-details free, and pure entertainment, so just jump here if you not interested in the rest, or need motivation for checking the other sections.
Chatbots (or conversational agents) can be decomposed into two separate but dependent tasks: understand and answer.
Understanding is about interpretation and assignment of a semantic and pragmatic meaning to user input. Answering is about providing the most suited response, based on the information obtained during the understanding phase and based on the chatbot tasks/goals.
This post provides a very good overview of two different models for the answering task, as well as going into great details for the application of deep learning in chatbots.
For retrieval-based models, the answering process consists mostly of some kind of lookup (with various degrees of sophistication) from a predefined set of answers. Chatbots currently used in production environments, presented or handled to clients and customers, will most likely belong to such category.
On the other hand, generative-based models are expected to, well… generate! They are most often based on basic probabilistic models or on machine learning ones. They don’t rely on a fixed set of answers, but they still need to be trained, in order to generate new content. Markov chains have originally been used for the task of text generation, but lately, Recurrent neural networks (RNN) have gained more popularity, after many promising practical examples and showcases (Karpathy’s article)
Generative models for chatbots still belongs to the research sector, or to the playfield of the ones that simply enjoy building and demoing test application of their own models.
I believe that, for most business use cases out there, they are still not suited for a production environment. I cannot picture a client who wouldn’t bring up Tay if proposed with a generative model option.
A Recurrent Neural Network is a deep learning model dedicated to the handling of sequences. Here an internal state is responsible for taking into consideration and properly handle the dependency that exists between successive inputs (crash course on RNN).
Apart from the relative elegance of the model, it’s impossible not to get captured and fascinated by it, simply from the many online demos and examples showcasing its generative capabilities. From handwriting to movie script generation.
Given its properties, this model is really well suited for various NLP tasks, and exactly in the text generation context I started exploring it, playing with basic concepts using Theano and Tensorflow for then moving to Keras for the final models training. Keras is a high-level neural networks library, that can run on top of either Theano or Tensorflow, but if you are willing to learn and play with the more basic mechanisms of RNN and machine learning models in general, I suggest to give a try to one of the other libraries mentioned, especially if following again the great tutorials by Denny Britz.
For my task I trained a sequence-to-sequence model at word-level: I feed to the network a list of words and expect the same as output. Instead of using a vanilla RNN, I used a long/short term memory (LSTM) layer, which guarantees better control over the memory mechanism of the network (understanding LSTM). The final architecture includes just two LSTM layers, each followed by dropout.
As for now I still rely on one-hot encoding of each word, often limiting the size of the vocabulary (<10000). A highly advised next step would be to explore the option of using words embedding instead.
I trained the model on different corpora: personal conversations, books, songs, random datasets and movie subtitles. Initially the main goal has been pure text generation: starting from nothing and generating arbitrarily long sequences of words, exactly like in Karpathy’s article. With my modest setup, I still obtained fairly good results, but you can see how this approach doesn’t work on the same assumptions of text generation for chatbots, which is at the end a question-answering scenario.
Question-answering is another big NLP research problem, with its own ecosystem of complex and component-heterogeneous pipelines. Even when focusing only on deep learning, different solutions with different levels of complexity exists. What I wanted to do here is to first experiment with my baseline approach and see the results for off-topic questions handling.
I used the Cornell Movie — Dialogs Corpus, and built a training dataset based on the concatenation of two consecutive interactions that resembled a question-answer situation. Each of such q-a pair ends up constituting a sentence of the final training set. During training the model gets as input a sentence truncated from the last element, while the expected output is the same truncated from the first word.
Given such premises, the model is not really learning what is an answer and what is a question, but should build an internal representation that can coherently generate text. This either by generating a sentence from scratch starting with a random element, or simply by proceeding in the completion of a seed sentence (the potential question), one word at a time, until predefined criteria are met (e.g. a punctuation symbol is produced). All the newly generated text is then retained and provided as candidate answer.
You can find additional details and WIP implementation in my Github repository. All critiques and comments are more than welcome.
Interfacing with the chatbot is as simple as sending a message on Facebook Messanger, but the complete solution involves different heterogeneous components. Here a minimal view of the current architecture
Data processing and RNN model training have been operated on a Spark instance hosted on IBM Data Science platform. I interfaced with it directly via a Jupyter Notebooks, which simply rocks!
Using Keras callbacks system I automatically kept track of the model performances during training, and backed-up the weights when appropriate. At the end of each training the best snapshot (model weights) was then persistently moved to the Object Storage connected with the Spark instance, together with Keras model architecture and additional data related to the training corpus (e.g. vocabulary indexing).
The second piece is the model-as-a-service component; a basic Flask RESTful API that exposes the trained models for text generation via REST calls. Each model is a different endpoint and accepts different parameters to use for the generation task. Examples of parameters are
- seed: seed text to use for the generation task
- temperature: an index of variance, or how much “liberty” you want to give to the model during prediction
- sentence minimum length: minimum length acceptable for a generated sentence
Internally this application is responsible for retrieving and loading in memory the models from the remote Object Storage, such that they are ready to generate text when corresponding endpoints are called.
The final component is a Java Liberty web application which acts as a broker for the Facebook Messanger Platform. This is responsible for handling Facebook webhooks and subscriptions, storing users chat history and implement the answering-task logic. From one side it relies on the system described in my previous article, using IBM Watson services like Language Recognition and Conversation, on the other, when specific requirements are met or when no valid answer has been provided, can rely on the text generation bit, and call the Flask API at the most convenient endpoint.
Both the Java and the Python app are hosted on Bluemix, and regarding the former one, I’m currently working on the coverage of additional messaging platforms like Slack, Whatsapp and Telegram.
Show Me the Bot!
You can interact with the chatbot simply via Facebook Messanger, but making your bot public (usable by whoever reaches its page) requires some work, demonstration videos and successive official approval from Facebook. As for now, I have to manually add people as testers of my app to allow them to use it, so in case you are interested, just drop me a note.
Nevertheless, I have to admit that already seeing the interactions of the few current testers was a pretty entertaining experience, a good emotional roller-coaster of amusement, shame, creepiness, pride…
Let’s start with some mixed results (on the right the bot replies, on the left some friends’ inputs, nicely anonymized).
Notice that here there is mixed behavior: the “marry me” response is based on Natural Language Classification, while the rest are all generated. There are already grammatically wrong sentences, but at the same time I was for example nicely impressed by the second answer, being fooled into reading a sense of conscious omnipotence in it.
Sometimes replies seem totally random, but still build a nice interaction, a simulation of a shy, confused and ashamed personality, a bit romantic too maybe.
Given the training data, it also learned proper punctuation, so it’s likely to reply with a sentence starting with a punctuation symbol if the previous input was not ending with one itself.
It can also come up with some seemingly deep stuff, for then failing miserably especially given the momentary new high expectations.
Notice that in no way there is context retention between answers, it’s simply not built in the models nor system, is just the interaction flow that gives this illusion, and sometimes can coincidentally give a really good impression:
I am aware that there is no breakthrough in all this, and results might me “mhe”, but after all, so many people get crazily excited about their “baby’s first words”, which from my knowledge are way below the bar I set here…
My decent understanding of the mechanisms behind it, while observing it talking, makes everything even more fascinating. Feels irrationally surprising all it can formulate, with no actual semantic knowledge of what it’s being told and what it’s replying, just statistical signals and patterns… and I wonder, after all, how many people might be actually working in pretty much the same way, just with a more powerful borrowed computation instance in their skull, and a longer and richer running training on their shoulder.