The tide is high for deep learning in human language understanding applications This blog post is part 2 of a series – see part 1 for an overview of natural language processing and seminal work in the field

The watershed moment

Deep learning – the area of machine learning generating its models from deep neural networks – has revolutionized the way we think of machine learning problems. While its disruptive force initially hit the field of computer vision with the massive adoption of convolution neural nets in the mid 2000s, NLP got involved slightly later and with a more “cautious” approach: the introduction of embeddings.

When in doubt, embed

Embeddings are compact word meaning representations cleverly obtained as a by-product of the training of neural networks aimed to solve a specific task. This task is typically predicting the missing word given a word sequence with a “blank” or the most likely surrounding words given a word. The resulting embedding – i.e. the hidden layer of the neural network – is basically a condensed, optimal representation of the input’s meaning as determined by the context it has “seen”.
If you think of one-hot vectors mentioned in part 1 of this post, these somehow defeated the intuition of similarity given by context: the vector for president would be as far away from the vector for chairman as that for dinosaur.
In contrast, embedding vectors are not only much smaller and dense: when visualized in the vector space, they have been found effective in capturing meaningful word associations such as male-female, verb tense, etc. – see Figure 1.  
Figure 1 Interesting word associations captured by embeddings (credits)
This perk comes from the fact that they can be trained in an unsupervised fashion, hence hidden meanings and associations have a chance to surface in the resulting representation because very large datasets can be used.
Using word embeddings instead of one-hot vectors as the starting point of a machine learning task has proven extremely effective in most NLP problems, de facto replacing the need for complex feature engineering and providing elegant, fast solutions even for languages lacking pre-annotated linguistic resources – see Figure 2 and [1].
Sources like FastText allow to download ready-made embeddings from large datasets (e.g. Wikipedia, web pages) or train custom ones, then integrate them conveniently within NLP pipelines. 
Figure 2 Embeddings as a convenient starting point to carry out “leaner” NLP tasks based on neural nets (credits)

ELMo and BERT: from Sesame Street favorites to enablers of transfer learning in NLP

Come to think of it, computing embeddings and then training task-specific models is a bit like using deep learning to identify edges in images instead of leveraging its full power as currently done in computer vision. Here, large neural networks are trained on generic tasks such as ImageNet, whose goal is to classify an image into one of 1000 possible categories. As a result, they learn complex image features (shapes, patterns) and then need only to be slightly tuned by transferring the learning to a business-specific task (say, recognize car models). So, what is the ImageNet task equivalent for NLP? It turns out that language modelling is a very fitting candidate, as predicting the next word in a sequence implies a good understanding of syntax and semantics.This intuition, paired with the adoption of specific neural network models borrowed from machine translation, paved the way in 2018 to language model embeddings. In contrast to classic embeddings, these are context-dependent representations that better account for polysemy, i.e. the many senses a word may have depending on its context. Such embeddings, creatively named ELMo or BERT after Sesame Street characters, have recently disrupted NLP practice, achieving excellent results in many core tasks [2,3] and benefitting the business of organizations creating them.

Where this leaves us

If we take a look at the progress of word representation as summarized in the timeline of Figure 3, our ability to programmatically exploit a means to represent the meaning of words has come a long way during the last 60 years or so, from Firth’s maxim to the pervasiveness of deep learning.
Figure 3 A timeline of word meaning representation As researchers, it would seem that there is no way back: hardly any recent NLP paper reports a non-deep learning approach.
The new normal in the field is to carry out semi-supervised training on large amounts of texts (from books, Wikipedia, etc.) and then adapting the resulting neural network to do supervised training for a specific task with a labeled dataset.
But what changes for high-street practitioners, many of which may lack the know-how or infrastructure to train extremely deep neural networks? When is it worthwhile to adopt deep learning approaches in industrial applications? One clear case is machine translation, as converting a sentence into another is a perfect match for a specific kind of neural networks called sequence-to-sequence with attention [4]. Since these were introduced, their adoption was sensational, as exemplified by Google’s current neural machine translation. Another example comes from the related AI field of automatic speech recognition: since generating the most likely word sequence given an input signal is a specialty of sequence to sequence neural networks, there is simply no alternative to achieve state-of-the-art results. For the above applications, it would be sensible for a practitioner to subscribe to a solution from a market leader rather than attempt to train custom models. Conversely, it is perhaps not (yet) necessary to relinquish classic approaches to text categorization in favor of deep learning methods: especially in situations where datasets are relatively small, methods like XGBoost or Support Vector Machines can do a surprisingly good job – in many cases paired with word embeddings as features, hence creating a “middle ground” between the classical NLP of part 1 of this post and the deeper NLP of Figure 2. This “shallow usage of deep learning” hits a great balance between convenience, efficiency and effectiveness and is now deeply rooted in many industrial applications. The same goes for a number of information extraction tasks such as named entity recognition: here, conditional random fields trained on shallow linguistic representations – perhaps again in combination with embeddings – can still be considered state of the art.   Silvia Quarteroni, Principal Data Scientist – Industry Collaborations, Swiss Data Science Center    

References (non-exhaustive)

  1. Mikolov et al., 2013. Efficient Estimation of Word Representations in Vector Space. In proceedings of ICLR, 1–12.
  2. Peters et al., 2018. Deep contextualized word representations. In proceedings of NAACL, 2227-2237.
  3. Devlin et al., 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
  4. Vaswani et al., 2017. Attention is All you Need. In proceedings of NIPS, 6000-6010.