We have worked on many Natural Language Processing (NLP) projects in which we have leveraged textual data either as main data or as an alternative data source. In this blog post we focus on one particular NLP technique: text clustering, and how this unsupervised method provides valuable insights about your data without having to label it first. The post has been written in collaboration with our NLP experts Luiza Sayfullina (PhD in NLP) and Jaakko Vainio (PhD in Theoretical and Mathematical Physics).
Textual data is everywhere: contracts, news, company filings, social media
Textual data has been underutilized for a long time. In most cases this type of data has required manual labour and therefore it has been slow and expensive to use. However, the development of advanced AI methods, notably NLP, has made this data a frequently used element in machine learning projects.
One of the many techniques used today is text clustering, which means grouping objects into different sets or “clusters”. In this article we focus on text, but the object can be images, audio, numerical or mixed data. The idea is to categorize objects according to their type or characteristics, in a way that objects within a cluster are more closely related to one another, than objects in a different cluster. In other words, objects or pieces of text similar to each other should fall under the same cluster.
Clustering is not to be confused with classification. Even if clustering can be used for classification tasks such as anomaly detection, the approach is quite different. Text classification, typically done with convolutional or recurrent neural networks, is a supervised learning method, where the learning happens from examples and their labels. Clustering, however, is an unsupervised method, meaning that you don’t need labels as the model learns “without a teacher”. In a sense, the data speaks for itself which makes it possible to pick any set of data samples and learn directly from them.
What can we do with text clustering: typical use cases
As described above, clustering is a way of intuitive data description where we try to identify regions with high density of points or the highest concentration of points in order to group objects or data samples. One way to get value from clustering is to group the data you have and then use these groups as basis for different scenarios.
For instance, your customer service department can save time by identifying key clusters (in other words, key topics/areas of concern) from the customer tickets and directly forward these to the corresponding customer service agents. The same approach can be applied to any business function from sales, sorting invoices, making recruiting more efficient to analyzing long text documents and thousands of emails.
However, the clusters themselves are not descriptive of its contents, and therefore it is not obvious what kinds of objects cluster represents. To understand the nature of a cluster you need to look at several samples from it. In the case of customer service, we could use clustering to discover various topics, but the topic names or descriptions should be assigned to each cluster manually or semi-manually.
Let’s explore some typical use cases for clustering.
1. Client segmentation
A classical use case for clustering is client segmentation according to either a narrow measure (type of products they prefer to buy) or broader criteria (the sum of demographic characteristics). After clustering the data into different client segments, it is easier to develop different strategies for them and customize the offering accordingly.
2. Finding anomalies without labeling in advance
Another important application is finding anomalies in the data without having any labels in advance. Candidates for data anomalies are typically data samples that fall outside of the main clusters. One example of this approach could be finding anomalous transactions from customer transaction data.
3. Data exploration and visualization
Clustering is also a good tool for data exploration and visualization. For example, instead of going through all the samples, you could look at only several representative samples from each cluster group. Since samples within the same cluster are assumed to be similar, it can be enough to explore only a few samples from the same cluster. The key representative samples from a dataset can also be used as a basis of template for contract drafting, for example.
Utilize AI in the right places
As we’ve seen, clustering is a great way to gain a better understanding of your data and to get additional information for data classification purposes.
However, the results of applying clustering depend on the distance metrics between the data samples and number of clusters chosen. Identifying the number of clusters might be challenging and could sometimes lead to grouping objects together which should have been in different clusters. In this case clusters would be refined and some objects could be regrouped.
As with any machine learning solution, it is important to keep in mind that these technologies and machine learning models are not that valuable per se. The value comes out of their usage. By utilizing AI in the right place, with a clear, measurable and impactful goal you can reach real long lasting improvements in your business.
If you would like to discuss how clustering can help your business, get in touch with our NLP specialist Jaakko Vainio, email@example.com.