Embed your items, not just words!
Following our previous posts on recent progress in Natural Language Processing, we discuss a follow-up idea: can we extend the concept of word embeddings to any collection of items, possibly unordered? More precisely, can we learn representations from item sets, such as the product baskets in online retail or music playlists on streaming platforms? As we will see, the answer is yes, representation learning can also be applied on such datasets.
While dense – i.e. compact and meaningful – representations for words have been around for two decades , two major works have drastically changed how we define and train them. In 2013, Mikolov et al. proposed word2vec , an efficient training procedure that approximates the objective without compromising quality. In 2018, “deeper” approaches have been developed, including BERT  and its transformer-based language model.
In this post, we focus on word2vec and the whole family of methods that it inspired thanks to its simplicity. Each new method builds on top of the same base approach, tackling different challenges. To cite only a few:
- Bojanowski et al.  propose fastText, which also models sub-word information, useful to handle words unseen during training; This is particularly useful to handle out-of-vocabulary words like colloquial expressions (“faaaabulousss”!) and typos found on social media.
- Iacobacci, Pilehvar and Navigli  propose SensEmbed, in order to handle polysemy, the notion that a single word may relate to different concepts (e.g. bank means both the border of a river and the financial institution);
- Smith et al.  show that multiple mono-lingual embedding sets can be aligned – one of the many ways to address multi-lingual scenarios, especially when multi-lingual dictionaries are expensive;
- Nickel and Kiela  use a different distance metric, in order to capture hierarchical structures in the language, which can help generate taxonomies semi-automatically.
Now that we have an overview of common challenges, let us focus on our objective: train item embeddings from unordered collections. We will explore how word2vec can be extended and applied to question tags.
In its original form, word2vec handles words in an ordered manner, as language typically relies on the location of its building blocks. While the method does not model complex linguistic relationships (e.g. how a negation affects the upcoming statements), it does consider distances: two words will influence each other only up to a certain point. This field-of-view, typically referred as a window, is key to handle long sentences where the meaning may evolve.
However, when generalizing to non-textual data, ordering and windowing may not be relevant, or not even properly defined. For instance, while grocery purchases may be registered in a specific order during checkout, such order is unlikely to convey any specific meaning¹. In the representation of movie preferences, where each user may be “seen” as a collection of viewed content, order might be relevant in the long run (taste does change), but we may get a more comprehensive representation by ignoring it – indeed, it is unlikely that past taste will become completely irrelevant².
Therefore, we argue in favor of item embeddings trained without any limitation on the field-of-view. This is equivalent to defining an infinite window that covers the entire item sequence and does not apply any attenuation over the distance. To make this more concrete, we recently proposed itembed as an approach to the creation of dense vectors for any collection of items. Itembed is a pure Python implementation of a word2vec variant for itemsets. It adapts concepts from previous approaches to item representation learning, such as item2vec  (developed at Tel Aviv University and Microsoft) and StarSpace  (developed at Facebook), both of which handle discrete labels³.
A practical example
At the Swiss Data Science Center, we use itembed in a number of industrial contexts such as online retail and fragrance & flavor design; the resulting embeddings represent various kinds of items (jewelry, timepieces, chemical formulas). These dense item representations are useful to tackle higher-level tasks, including recommending relevant next items to view in e-commerce or searching for similar items in a catalog.
In this post, we will exemplify itembed by focusing on a more generic example. We will base our item embeddings on data from Stack Overflow, a community-based question & answer website. Every day, “askers” post “questions” to get help on a wide range of technical areas. To help experts identify questions matching their competences, a set of tags can be specified (see Figure 1).
Figure 1: On StackOverflow, users ask questions to the community. To increase visibility, both a short and long version is available for each question, and tags are used to highlight and filter concepts. An upvote and reputation system regulates content quality on the whole platform.
Unlike taxonomy terms, tag values are not fixed by a central authority, hence they can be diverse and usually increase over time – dedicated tags may emerge to account for one-time events.
Regardless of their medium, askers need to maximize the visibility of their post, and fine-grained post categorizations tags have an obvious role to play in the information filtering process. When choosing tags, askers effectively deal with a trade-off between precision and coverage, typically solved using both broad and narrow tags that are recognized by the community – a practice incentivizing tag co-optation.
In this experiment, we seek to get a dense representation of such tags. As it is customary when building embeddings, we exploit the idea that a (key)word is defined by its surrounding context. While we could argue about the relevance of the order of tags in a question (e.g. is it from broader to narrower?), we follow the simple assumption that order (and therefore distance) does not matter in a set of tags.
As a picture is worth a thousand words, Figure 2 depicts the learnt embedding space for question tags on Stack Overflow. The first (w.r.t. identifier) one million questions with at least four tags were extracted⁴. Itembed was then applied, to train an embedding space.
Figure 2: 2-dimensional projection (using UMAP ) of the 64-dimensional embedding space, trained by itembed on Stack Exchange question tags. Distinct clusters are visible, which are associated to separate topics.
A quick glance at the latent space in Figure 2 reveals a notable regions, such as:
- Java UI components: this isolated cluster is very specific, as it covers user interface development in Java (e.g. Swing, JavaFX).
- Project management: also pretty self-contained (but not too far from logging systems), code versioning (e.g. Git) is loosely related to project management theory.
- Low-level programming: part of a dense network, many low-level concepts (from an operating system point-of-view) are laid out together, including compilers, assembly code, and exploits.
To summarize, resorting to a dense representation seems to overcome the sparsity of individual tags and suggest a meaningful way to represent which questions “fit together” thanks to their askers’ tags of choice.
In this post, we have discussed how co-occurrences can be leveraged to learn dense representations of unordered items. Collections of discrete items are widespread, from chemical recipes to product baskets. We invite you to explore with your own data, by applying one of the many word2vec variants.
Shallow word embeddings have proven very effective in practice, especially compared to sparse representations that do not share information between symbols. However, one should not forget the relative simplicity of this model, which cannot model complex non-linear behaviors. Composition of molecules is well known to be non-trivial. Still, embedding vectors are likely to exhibit useful properties and allow downstream tasks to generalize with fewer examples.
In this quest for representation learning, unannotated data should be leveraged as much as possible. While we have showcased a single population of symbols, namely question tags, the approach can be extended to multiple vocabularies. This is akin to multi-task learning, where multiple embedding sets are trained jointly, for instance to represent both customers and products.
In conclusion, we argue that the word2vec family can harvest low-hanging fruits from many collection-based datasets. Our proposed solution, itembed, is an easy-to-use and flexible implementation in Python, which takes inspiration from item2vec and StarSpace.
³ For example, care must be taken not to overweight long sequences, as the number of pairs is increasing quadratically with the length. In item2vec, the effect of a single pair is down-weighted, while itembed applies down-sampling to pairs. ↩︎
 Mikolov et al., 2013, “Efficient Estimation of Word Representations in Vector Space”, arXiv preprint arXiv:1301.3781. ↩︎
 Devlin et al., 2018, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, arXiv preprint arXiv:1810.04805. ↩︎
 Barkan and Koenigstein, 2016, “Item2Vec: Neural Item Embedding for Collaborative Filtering”, IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP). ↩︎
 McInnes, Healy and Melville, 2018, “UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction”, arXiv e-prints 1802.03426. ↩︎
 Iacobacci, Pilehvar and Navigli, 2015, “SensEmbed: Learning Sense Embeddings for Word and Relational Similarity”, Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). ↩︎
 Smith et al., 2017, “Offline bilingual word vectors, orthogonal transformations and the inverted softmax”, arXiv preprint arXiv:1702.03859. ↩︎
 Nickel and Kiela, 2017, “Poincaré embeddings for learning hierarchical representations”, arXiv preprint arXiv:1705.08039. ↩︎