Since the success of deep learning in computer vision in 2012, we have seen its rapid extension to many AI domains: From speech recognition to machine translation, from game playing to robotics, deep learning has shown versatile capabilities of tackling different types of AI problems.
The most notable development is in natural language processing (NLP), where BERT, a deep learning model based on transformer, has demonstrated super-human performance for almost all NLP problems, such as machine translation, named entity detection, sentiment analysis and question answering. In the SQuAD (Stanford Question Answering Dataset) competition, the winning programs on the leaderboard have surpassed human performance (92% vs. 89% in F1 score).
Despite the rapid progress in many AI fields, one domain remains elusive for AI researchers — chatbot. To apply deep learning to chatbot design requires us to have good training data. In addition, we also need to know how to evaluate the performance of a chatbot. Is it the number of turns it has done, or more human-like it is? Finally, we need to have confidence that such an end-to-end system is possible: Can we really ditch human design in chatbots, let go all the rules and still generate reasonable conversations?
Meena, a system designed by Google researchers, answered the above questions with flying colors. This system is described in the paper “Towards a Human-like Open-domain Chatbot”, posted on January 31, 2020 in Arxiv . Meena is an end-to-end system based on deep learning, trained purely human chat data, and has achieved near-human performance and outperformed all existing chatbots. Its performance is shown below:
Meena is a breakthrough in chatbot design. For the first time, we are seeing an end-to-end chatbot that performs like a human. Today’s open-domain chatbots such as XiaoIce and Cleverbot, still depend on a lot of manual rules. Those end-to-end chatbots are limited to small domains with a limited vocabulary.
The achievement of Meena is attributed to 3 innovations: (1) A large training data it collects; (2) A solid evaluation criterion and (3) A neural network architecture that is generated through NAS (Neural architecture search).
For the training data, Meena used public domain social medial data. The data contains 341 GB of text that has 4 billion words.
For chatbot evaluation, the Meena team proposed a metric called Sensiblity and Sensitivity Average (SSA), which measures human-like quality of an answer.
Evaluation metric: Sensibility and Sensitivity Average
For each response by Meena, a human evaluator gives 2 labels that answer “is it sensible?” and “is it specific?”. For example, for a user utterance “I like swimming”, a sensible answer is “wonderful”, but a specific answer can be “How often do you swim?”. Each response is evaluated by five human evaluators, and the final score is the average of their scores.
What’s exciting about this new evaluation metric SSA is that it is closely correlated with the perplexity score, which is an unsupervised way of measuring text generation quality. Perplexity is the inverse probability of generating a segment of text. The lower perplexity, the better quality of the generated text is.
The author of Meen showed that a higher SSA score corresponds to a lower perplexity score. Since it is must cheaper (without human evaluator) to calculate perplexity, we can train many versions of chatbots and evaluate them with perplexity.
Meena used Neural Architecture Search (NAS) to find the best neural network. The type of neural network it uses is a transformer.
A quick review of the transformer
A transformer  is a feed-forward neural network that applies attention to elements in a sequential input. The beginning hidden layers are called “encoder” that encodes the input, the later hidden layers are called “decoder” that generates sequential output. (see figure below).
Figure 2. A standard transformer
Each encoder layer has 2 sub-layers, as show in figure 3. The first sub-layer uses attention, and the second sub-layer is a simple feedforward layer. In addition, between sub-layers, the output is added with the original input. This is called residual operation, introduced in ResNet (Residual Network) architecture. After the addition, the result is normalized before sending to the next layer. Layer normalization has shown to speed up training.
Figure 3. Inside each encoder layer of a transformer
The original transformer had 6 encoder layers and 6 decoder layers. Later researchers expanded the transformer to more layers. BERT base model has 12 encoder layers, and BERT large model has 24 encoder layers. It’s shown that the larger model has better performance. However, there is a limit on performance improvement when the model is large. The real question is: how many layers should a transformer have to achieve optimal performance?
The search for the best architecture (number of layers and required layer components) of a neural network is called neural architecture search.
Neural Architecture Search with Evolution
Invented in late 2016, Neural architecture search (NAS)  is a systematic way of discovering the best configuration of a neural network for a given dataset. The configuration parameters include the number of layers, number of nodes (or filters for convolutional neural network) at each layer, and activation function.
There are two major approaches to the architecture search. One is using reinforcement learning, where a controller explores the configuration space by generating child networks and evaluating their performance in the training data. (The accuracy of a child network is its reward and its actions are the choices of the configuration parameter. ) The other approach is through evolution (traditionally called genetic algorithm). The system evolves the architecture from random networks. Through mutation and selection (choosing the best one), a network will generate a group of child networks that continue to evolve.
The basic evolution algorithm for NAS  is as the following:
Table 1. The evolution algorithm for neural architecture search
Meena adopted the evolution approach, and apply it to find the best transformer. The search space of a transformer includes: The number of encoder layers, and the number decoder layers, the component choices inside an encoder or decoder. Meena ends up with the 1 encoder layer, 13 decoder layers, and structures inside its encoder as the following (shown on the right side):
The final Meena architecture contains 2.6 billion parameters, and its training took 30 days on a TPU v3 Pod (2,048 TPU cores).
Here is a sample dialog with a trained Meena (see below), which is surprisingly smooth and human-like.
We are seeing the first successful attempt in building an end-to-end open-domain chatbot. The results are very promising. With a solid evaluation measure that is corresponding to perplexity, the future development of chatbot could speed up.
The team for Meena has not released their dataset or the code. Questions remain for the AI community: How can we duplicate Meena’s results? How well would Meena do when we adapt it to specific domains?
With the rapid development of AI field (New results are released almost on weekly basis), we expect to see a super-human chatbot very soon. Such a chatbot would make sensible and specific conversations, and fool a human into thinking it is a real person. By then, passing the Turing Test is within our grasp. This is an exciting event that can happen within our lifetime.
 Adiwardana, Daniel, Minh-Thang Luong, David R. So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu and Quoc V. Le. “Towards a human-like open-domain chatbot.” arXiv preprint arXiv:2001.09977 (2020).
 David R. So, Chen Liang, and Quoc V. Le. 2019. “The evolved transformer”. In ICML
 Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V. Le. 2018. “Regularized evolution for image classifier architecture search”. In AAAI.
 Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. “Attention is all you need.” In Advances in neural information processing systems, pp. 5998–6008. 2017.
 Zoph, Barret, and Quoc V. Le. 2017 “Neural architecture search with reinforcement learning.” arXiv preprint arXiv:1611.01578 in ICLR .