January 15th, 2014

The Inherent Weakness of Likes and Star Ratings

People have rich knowledge and information about which content, products, services, people, items they prefer, find attractive or enjoy. A number of services on the web attempt to somehow uncover this knowledge and turn them into products or features: Facebook likes, LinkedIn endorsements, restaurant reviews on Yelp or TripAdvisor are all examples of this.

In the data scientist’s mind, the sole purpose of these rating systems is to elicit users’ knowledge about the world, then incorporate this knowledge into a product or service. For example, Yelp is a utility that tells you which restaurant is best in your neighbourhood. To do this it needs to have information about the quality of restaurants. To build this knowledge, it collects restaurant reviews from users.

But is the way Yelp build up their knowledge efficient? The main message of this post is this:

From a data perspective, most rating systems on the Internet are inefficient and poorly designed.
What is common in all the examples mentioned above is they ask humans to rate things on an absolute scale, wether it be a simple binary like-or-not scale such as Medium and Facebook, or a star-rating system like IMDB or TripAdvisor. There are multiple problems with this approach.

  1. People’s baseline level on the absolute scale may differ. My 4-star rating may describe the same level of satisfaction as someone else’s 5-stars. This makes aggregating opinions from many different users a non-trivial task.

  2. The variance of responses may differ across people: some more conservative reviewers would never use the extreme 1 star or 5 star ratings and really only use 2-3-4 stars, whilst other respondents may see things black and white and their opinions may be more polarised.

  3. To give informed ratings, the user has to know the distribution of the quality of items ahead of time. If I don’t know much about restaurants in Barcelona, I would not to give a maximal 5-star rating to the first restaurant I visit, because I don’t know if better restaurants exist, or how others compare. Others may give 5-star to a mediocre restaurant, because they have never seen a better alternative.

These problems always effect ratings when people are asked to rate things on an absolute scale. They can be accounted for using well thought out statistical machine learning techniques. For example reviews for papers submitted to the NIPS (Neural Information Processing Systems) conference this year were post-processed using a bilinear Gaussian graphical model which is able to cope with some of the problems mentioned above.

But let’s be honest, the sad reality is that what most internet products do these days is counting or taking averages. Medium ranks articles based on number of recommendations, Yelp based on average review, LinkedIn displays the total number of endorsements. When such simple averages are taken, the problems above introduce biases to your estimates, which makes the whole process inefficient and inaccurate.

Another way: Relative rating
There is an alternative: Ask users to rate items relative to each other. After a trip to Rome, TripAdvisor could have asked me: “We saw you visited 3 restaurants in Rome, which one did you like the most?”. LinkedIn could show me two of my connections and ask me "Whose expertise do you rate more in Big Data, John’s or Mary’s?" Or if you were reading this post on Medium, it could prompt you, “Did you like this post better than the one you read before?” instead of just displaying a Recommend button.

This process maintains a relatively low cognitive load to the users, if not lower than star ratings. I can even envision a neat swish swipe-touch-tap interface for pairwise comparisons implemented as a mobile app. Sure, it requires a little more brainpower from the data team to turn the pairwise data into recommendations and absolute ratings, but it’s not rocket science. Here is an example, and another using very simple techniques to get started.