Using NLP to Find “Interesting” Collections of Hotels

Craig Schmidt posted October 5, 2015

There are a lot of hotels on TripAdvisor. At the moment, there are 1790 hotels listed for Paris, 1054 hotels for London, and 466 hotels in New York City. We have been working on better ways to explore these hotels, and find an interesting place to stay. In this blog post, I’ll describe an upcoming feature on TripAdvisor, that uses Natural Language Processing (NLP) to find groups of hotels that have an interesting theme. It is a semi-automated process that can be scaled across many cities. There is some interesting technology behind it.

For New York City, we might show a list of collections like this…


We have collections for “Times Square Views”, “Trendy Soho”, “Catch a Show”, and “Art Deco Classic”, among others. Each of those is a group of about 10 hotels that fit the theme. How can we automate finding these interesting groups, even for cities we’re not familiar with?

Topic Modeling is Boring

The most textbook approach might be to build a topic model based on the review text for the hotels. I first tried a Latent Dirichlet Allocation (LDA) model. It assumes that each review is a mixture of a number of latent topics. You can represent each review as a vector with a “bag of words” representation of the review text.

I tried building an LDA, model, but got very boring results. Here are some of the first topics for New York, which a mix of a bunch of general hotel related topics.

Single words just don’t have the richness to represent interesting topics. What about ‘breakfast’? Was it good, bad, free, or wonderful? Most hotel review mix up talking about the room, the location, the staff and more all at once,and these topics show that mixing. But we want to just focus on a single notable aspects of a hotel, not summarize it completely.

Phrases as raw material

Since single words aren’t very expressive, I wanted to find some good phrases in our review text. What is a phrase, exactly? In this case, we mean an n-gram of words that has a specfic semantic meaning. We tried the approach from the paper “Mining Quality Phrases from Massive Text Corpora”, by Liu et. al. at the Univeristy of Illinois and Microsoft (some interesting slides). Because they released an open source implementation of their SegPhrase algorithm, we were able to try it out.

It is a multiple step process, but the output is that you get a large set of phrases, and a score on how likely it is to be a phrase. We dubbed this score the “phrasiness” metric. So what are the strongest phrases in New York?

The number shown is the phrasiness score. All of these are clear semantic concepts, not just a sequence of words that frequently co-occur.

This is a list of “phrases” with much lower “phrasiness” scores:

You can imagine that these phrases frequently occur in reviews, but don’t really represent a single concept.

We used a threshold of 0.5 for phrasiness, which reduced the approximately 250 thousand potential phrases for New York to around 35 thousand. While people often talk about air_conditioning in a review, and it is clearly a phrase, that’s not the kind of aspirational topic that people will want to browse through. Importantly, there are plenty of phrases that have more interesting emotional content, like art_deco, fairy_tale, andradio_city_music_hall. So we have the raw material we need to find collections. In fact, we may have too much raw material, as there are tens of thousands of a good phrases just for New York hotels.

word2vec Party Tricks

We have a bunch of phrases, and many of them are very similar. These phrases …

seem like a great basis for making a hotel collection. But they are mixed in with tens of thousands of other phrases. It would be very helpful to cluster the related phrases into groups. But what features should we use to do the clustering? We need a similarity metric that puts these phrases close to each other.

The word2vec algorithm from Mikolov and other Google researchers is perfect for this case. Given a corpus of text, it maps each word into a numerical vector of say 100 dimensions. The mapping is done in a way that put words that are used in a similar way in the corpus are near each other in the vector space.

The papers on word2vec give lots of party tricks you can do by adding and subtracting the vectors. For example vector(‘Paris’) – vector(‘France’) + vector(‘Italy’) results in a vector that is very close to vector(‘Rome’), and vector(‘king’) – vector(‘man’) + vector(‘woman’) is close to the vector(‘queen’).

I took all 9 billion words of review text at TripAdvisor, and build a word2vec model with around 1 million unique words. I used the C implementation of word2vec which can be easily run from the command line.

Here are some example words, and their nearest neighbors:

The numbers in the table are the cosine similarity of the top word with the near neighbors. We can see that word2vec is doing a nice job of mapping related words to be near each other.

For our short phrases, we just added the individual word vectors to create a phrase vector, ignoring stop words. So now we have a numerical vector of features for each phrase, and we can use any clustering technique.

With word2vec, people usually use the cosine distance, where a larger value is better. However, in clustering we usually work with a distance, where smaller is better. If we normalize our word2vec vectors so they have a unit norm, then we can just minimize the Euclidean distance, as this is equivalent to maximizing the cosine similarity (see the Properties section here for the simple derivation). So going forward, we will use the Euclidean distance of the two normalized word2vec phrase vectors as our distance metric.

Clustering phrases into tight topics

Usually, in a clustering problem we are interested in all of the clusters. Here, we have a slightly different situation. We want to group similar keywords into very focused groups, like our “view of central park” phrases, above. A small fraction of the clusters, say 100 or so, will be the basis for our collections. The rest of the clusters will be focused on more boring topics like air conditioning, and we’ll just ignore those.

What clustering technique will do a nice job of giving us those tight clusters? The default k-means approach for a fairly small k isn’t probably going to work too well. It will produce broad groups, and concentrate on getting each phrase into a good cluster, even though we don’t really care about most of the phrases.

I chose the agglomerative clustering routine in the python sklearn library to do the clustering. The “complete linkage” option will try the hardest to keep each cluster pure.

Agglomerative clustering starts with each phrase in its own cluster, so we have N clusters of size 1. It then goes through and tries to combine each pair of current clusters into a new combined one. With complete linkage, the score of the combined cluster is the maximum of the distances between every pair of phrases in the combined cluster. The two clusters with the smallest maximum score after combination are actually combined, leaving us with N-1 clusters. This process is repeated until we have 2 clusters left.

The overall combination process forms a tree of nodes. At internal nodes, a cluster is composed of the phrases that are leaves above the node in the tree. If we pick a threshold score, we can prune off subtrees where every cluster has a score better than the threshold. We only need to make the agglomerative tree once, and then we can quickly experiment with different thresholds.

Let’s look at an example for our New York city hotel data. I chose 18 phrases, such that they formed some natural clusters. There are 4 keywords about the view, 6 about being comfortable, 2 about the wi-fi, 1 about breakfast, 3 about friendly staff, and 2 about feeling safe. With luck, the phrases should be clustered into these natural clusters at some point. Here is the clustering, with the natural clusters shown as different colors.


The number shown in internal nodes provide the order of the merge, where smaller numbers were merged first, and hence more similar. The leaf nodes were given numbers 0 to 17, and so internal node 18 was the first combined cluster with comfortable_beds_and_pillows and comfy_bed_and_pillows. The next node 19 combined friendly_accommodating_staff and friendly_assistance, etc.

In general, the clustering did an excellent job on the test case. A phrase like sooooo_comfy is the least like the other phrases talking about comfort, so it goes in last among the red group.

With the right threshold, we can get natural clusters by themselves. In practice, we can get many useful clusters to use for collections.

We can also go back and find the hotels that frequently mention the phrases in a cluster. These hotels would be the member of the collection formed by a given cluster.

The human curation post-process

Up to this point we have had a totally automated process, that use phrase detection, word2vec features, and agglomerative clustering to find clusters of phrases. The remaining step is to find the interesting ones. At this point we rely on human curation to pick out the interesting ones, and give them a snappy name.

The manual process is fairly easy. We provide a list of phrases for a cluster, the hotels that are the most related to the cluster, even some examples sentences from the review text for those hotels. It is fairly easy to look at that information, and determine if it would a good way to explore the hotels of a city. It is hard to mathematically define what is interesting, but easy for a human to know when they see it. The human also comes up with a clever name, which is also simple given the list of phrases.

Some interesting collections

The “Catch a Show” collection has phrases like this:

My personal favorite when I’m in New York, the “Near The High Line” collection has:

The whole process provides insight into a particular city, picking out interesting neighborhoods, features of the hotels, and nearby attractions. All the things someone staying in the hotel might be interested in. Different cities will result in different collections.

Once the user has some ideas, they can focus in on a few hotels, and see if they would be a great place to stay. It wouldn’t be possible without the insights that Natural Language Processing provides.

One response to “Using NLP to Find “Interesting” Collections of Hotels”

  1. Arnab Dutta says:

    Nice post!Came across this post as I was searching for a POC for a large micro-blogging site. Just to add, did you consider adding Semantic Markup/Annotations to the concepts/entities extracted using other information sources?

Leave a Reply

Your email address will not be published. Required fields are marked *