The tide is high for deep learning in human language understanding applications

This blog post is part 1 of a series dedicated to how the use of machine learning is evolving in the field of natural language processing

Natural language processing, i.e. the automated processing of human language with computers, is certainly not a new discipline. Some date it back to 1950, with Alan Turing’s famous test which a machine would pass by holding a convincingly “human” conversation; others with the theory of distributional semantics: “You shall know a word by the company it keeps” – a 1957 quote by J.R. Firth – posits that to understand the meaning of a word, we must look at its context. However, the way we leverage the ever-increasing amount of text generated by humans (and sometimes bots!) on the internet, media and within organizations has radically evolved and accelerated during the last few years.

Let us think of some of today’s most innovative products and services: smart speakers, virtual assistants, automatic translators and contextual ad servers. Why is it only recently that these language-related technologies have taken off? Is it a coincidence that other fields of artificial intelligence such as computer vision have seen a surge of new applications (self-driving cars, face recognition) in the same timespan?

NLP: a primer

Natural language processing (NLP in short) is generally defined as the field of artificial intelligence that aims to understand and generate human – i.e. natural– language. It is no wonder that dialogue with a human was regarded by Turing as the paramount of AI complexity: for a machine to hold a natural conversation requires the ability to overcome the intrinsic ambiguity of human language. Consider the Guardian headline: Mutilated body washes up on Rio beach to be used for Olympics beach volleyball– cf. Figure 1. We easily understand its intended meaning because as humans, we are very good at “guessing” missing context and estimating event probabilities. These abilities are key to understanding why NLP problems are tackled using machine learning, which is also based on probability estimation and inference given a limited context.

Figure 1 An ambiguous headline and its sarcastic reception on social media – source

Seminal work: vector spaces and one-hot vectors

As mentioned before, one foundational idea of NLP is that the meaning of a word is determined by the words around it, since similar words will appear in similar contexts.

To exploit this intuitive notion of similarity in practical (and lucrative) applications such as search engines, the community has come up with the vector space model, a representation of words as vectors in a multi-dimensional space enabling an easy computation of word similarity as the dot product of two vectors. This representation, proposed in 1975 [1], is the basis of NLP applications even today.

Based on it, sentences and documents would be represented as unordered sets – or bags– of words, in turn converted to vectors. This allowed to compute the similarity between question vectors and candidate answer vectors: given the question When was Shakespeare born? a document containing Shakespeare was born in 1564 would get a greater score than Blake was born in 1757– cf. Figure 2.

Figure 2 Vectors representing two documents and a query in a simplified 2D vector space. The distance between the query and each document can be used as an approximation of how well each document answers the query – Wikipedia
What has been evolving over time is how to obtain those vectors. Initially, words would be represented as “one-hot” vectors, i.e. vectors defined in a space containing as many dimensions as words in the vocabulary of reference and exhibiting a one at the position corresponding to the word, zeros elsewhere.

“Bags” of words such as sentences and documents would be “averages” of such word vectors, featuring non-zero values at positions where the corresponding words were present. Of course, this representation is all but efficient, as its resulting vectors are very large in size (there are thousands of words in a vocabulary) and sparse (i.e. mainly made of zeros). However, it worked extremely well in practice and – with small variations in how vector dimensions and weights were chosen – allowed to reach very convincing results in NLP applications during the 1990s and 2000s. 


Three core tasks, dozens of end-user applications 

Enumerating all NLP applications is not easy, also due to the field’s overlap with other domains of artificial intelligence. When analyzed carefully however, most such applications are combinations of one or more of the following basic tasks:





determining the type or class of a unit of text Sentiment analysis, news categorization by topic, spam filtering


analyzing the content of a text to extract entities and relationships of interest named entity recognition (i.e. the identification of people, places and organizations), topic extraction


generating a sequence of tokens (typically, words) based on a context (e.g. text or speech) machine translation, document summarization

Table 1 Basic NLP tasks


Consider an IT helpdesk virtual assistant: its approach is to classify the user’s first message into one of the issues it knows about, then to guide her to a solution by asking questions and extracting useful concepts from the ensuing answers (broken printer, Ubuntu, since yesterday). Other noteworthy NLP applications include:

  1. Spam filters, sentiment analysis, automated document workflows, contact center email and call routers, resulting from classification;
  2. Question answering, search engines and automatic FAQs, resulting from a combination of classification and extraction;
  3. Metadata extraction, news tagging and contextual ad serving, resulting from a combination of extractors (named entity recognizers, topic extractors);
  4. Automatic translators, resulting from generation.

Feature-rich approaches

To achieve this variety of applications, research and industrial practice until a few years ago has focused on “feature-rich” approaches: each task involved the creation of dedicated supervised machine learning models, usually trained on task-specific annotated datasets and challenges.

These were generally promoted by government or academic entities (TREC-QA for question-answering, CoNLL for named entity recognition [2], etc.). Training such models would require intense pre-processing pipelines, aiming to normalize text and extract its most useful characteristics – or features – in view of the task itself (see Figure 3). For instance, a sentiment analysis task would call for pre-processors able to extract named entities, syntax trees and similar constructs from reviews to use them as features for the model to learn from.

Figure 3 Classical approach to NLP: language detection, intense pre-processing, feature extraction and modelling deployed on a task-specific basis (credits)

This intensive preparation and task customization approach became old news in the late 2000s with the flourishing of deep learning – see part 2 of this post.


Silvia Quarteroni, Principal Data Scientist – Industry Collaborations, Swiss Data Science Center




References (non-exhaustive)

  1. Salton et al., 1975. A vector space model for automatic indexing. Communications of the ACM, Vol. 18 (11), 613-620.
  2. EF Tjong Kim Sang and F. De Meulder, 2003. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. In proceedings of CoNLL-2003, 142-147