A Practical Guide to LambdaMART in LightGbm

Lightgbm lambdaMART for Learning to rank

Akash Dubey
DataDrivenInvestor

--

https://lightgbm.readthedocs.io/en/v3.3.2/

Introduction

LambdaMART is the boosted tree version of LambdaRank, based on RankNet. Boosted trees especially LambdaMART have been proved to be very successful in solving real-world learning to rank problems.

RankNet, LambdaRank, and LambdaMART have proven to be very successful algorithms for solving real-world ranking problems. For example, an ensemble of LambdaMART rankers won Track 1 of the 2010 Yahoo! Learning To Rank Challenge.

Now, there are two popular publicly available implementations of LambdaMART: one provided by the RankLib library that is part of the Lemur Project and the LightGBM implementation provided by Microsoft.

Motivation

While there are plenty of posts already available on the internet explaining the LightGBM classification and regression, I could only find a few posts that touch upon LGBMRanker. In this post, I’ll show how we can leverage LightGBM to train our learning-to-rank models, along with some pitfalls that we could fall into while training our model and of course how to get out of those.

Implementation

1. Importing required libraries

2. Loading the data

So, consider I have a dummy data stored in a file called dummy_ltr_data.csv with the following columns : query, product_id, 8 features and a column for relevance.

3. Splitting the data into train and test

Note that, I’ll split the data based on the query (qid’s) and not on a random split based on rows. However, you can of course choose the queries to be assigned to train and test at random.

The dummy data that I’m using, has 100 queries. So, I’ll straight away do a split of 80% for train and 20% for the test based on unique queries(qid’s).

4. Preparing the data for LightGBM

Before we move on to train the LightGBM LambdaMART model on our dummy data, we would need to split the data into the features and the relevance label which are essentially called (X_train ,y_train) for training set and (X_test, y_test) for test set.

In addition to this, we would also need the group , both for train and test set (evaluation set). group is essentially a numpy array which basically associates the queries to the features during training.

For example, if you have a 100-document dataset with group = [10, 20, 40, 10, 10, 10], that means that you have 6 groups, where the first 10 records are in the first group, records 11-30 are in the second group, records 31-70 are in the third group, etc.

5. Training the LightGBM LambdaMART Model

Unlike RankLib where you have dependency on java and jar files, LightGBM’s LambdaMART is pretty easy to train. All we need to do is plug in the right set of parameters.

While most of the parameters are very well documented on the LightGBM’s official website, I’ll discuss a few here which aren’t.

First thing first, the only objective which is available for Ranking in LightGBM is lambdarank and LambdaMART is the boosted tree version of LambdaRank. So, In essence lambdarank objective along with gbdt boosting_type is what LambdaMART is.

Second, Ranking objectives in LightGBM use label_gain_ to store the gain of each label value. By default, label_gain_[i] = (1 << i) - 1. So the default label gain only works with a maximum label value 31. So, in case your label value exceeds 31 , you will have to specify your customized label_gain .

Third, in the fit method, while setting the parameters eval_set and eval_group , if you wish to evaluate your model only on test set, then you will only have to pass (X_test, y_test) and qids_test to eval_set and eval_group parameters respectively.

5. Feature Importance

Another good reason to use LightGBM’s LambdaMART is its out of the box support for feature importance's. All it takes is one line of code to look at the feature importance.

Note that, the plot shows only 5 features because rest of the features have feature importance of 0. We can also verify this by one more line of code, ranker.feature_importances_ which would return the feature importance score of each features.

Conclusion

I have used RankLib and LightGBM extensively for training my learning to rank models and I can confidently say that LightGBM has made my life easier. Not only that LightGBM has better implementation of LambdaMART than RankLib but also has an easy to use scikit-learn like API, an out of the box support for feature importance, plots and SHAP values, the list is endless.

In the next post, I’ll discuss about hyper parameter tuning for learning to rank using grid search and state-of-the-art Bayesian optimization framework for hyper parameter tuning called Optuna. I’ll also discuss about some important hyper parameters to pay special attention to while performing hyper parameter tuning in LightGBM for learning to rank.

Credits and Sources

  1. https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMRanker.html#

Subscribe to DDIntel Here.

Join our network here: https://datadriveninvestor.com/collaborate

--

--