I would like to know which features contributed the most to given results.
Can I parse the model leading to these results to find which features are dominant? Or is there another way?
In other words, how can I perform a feature ranking?
Thanks a lot!
Last edit: Nicolas Fiorini 2016-02-17
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I don't actually know but have wondered the same thing myself.
If one looks at the LambdaMART model definition you'll notice that not all features are used by the model and in fact, some features are used quite frequently across multiple trees, so this may imply which features are the more useful (in positive or negative ways).
I wonder if a simple frequency distribution of model feature use would mean anything?
I'll ask and look around if anyone else has a more definitive answer.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
To further clarify, I was using LambdMART as an example. Obviously different models have different structure but all have associated weights with the features.
The problem with weight values is they may not vary much and separation between weight values (positive or negative) may be too fine to make much use of.
But perhaps one could use methos to determine if weights were statistically different from one another or significant.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Another issue is that any comparison of weights assumes that what you are weighting is comparable, e.g. your features are normalized. If you have a feature that's always negative, a large negative number would reflect its importance, not its weakness, or merely a negative correlation.
If you were to use "-norm zscore" and look at a linear model, such as coordinate ascent (4) or regression (9), then you might be able to treat the data in the output model as weights on features in a ranked sense. You still may have the negative correlation problem.
Typically, people train models leaving one feature out at a time to determine the performance lost if each feature is left out. (Quantified with the measure of your choice) This way, you also learn which features are redundant. But it requires a lot of training time. This way, you can say that your "title" feature represents 5% of your total MAP or something like that.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks a lot for your guidance. I am using LambdaMART too. I've read an article that consisted of counting which features appear at which node of every tree in the model and representing it as a heatmap. The idea is the same as the distribution you suggest, as we try to find the most represented features.
Yeah, John's proposition is what I was heading to, seeing that there exists no such thing apparently (I'm not only talking about RankLib right now). A lot of papers I've read just define categories of features (if there are many of them, otherwise a category is 1 feature) and remove them for the training. Then they study the performance. It takes time, but in a way it's also quite easy to do and at least we get a reliable estimation of the contribution of each feature (rather than summing them up).
Thanks a LOT for your answers and your time!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Apparently use of frequency of occurrence for features has been used to determine feature usefulness.
But that only works where a subset of features are selected by the model and given weights. Most of the other type LTR algorithms have weights defined for all the features, so it will be less clear how to determine feature usefulness.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Here's a summary of the solutions for this specific problem, since I've been ask to post one in a private message. Maybe this will be useful for some other people.
Feature ablation: train/evaluate the model on the full set of features, then remove one of them, train/evaluate on the new set, put it back and remove another one, etc. You can remove either 1 feature at a time or a group of similar features to speed up the process at the cost of less precision. This will basically give an idea of "how important is the feature in the set). If the score drops while removing it, it means it's important, obviously. This particularly help figuring out which features are useless (no change in the score) or redundant (another feature provides the same information).
Single feature test: this is just the other way around. Test the results with no re-ranking, then train the model with only one feature (or group of feature, such as above). This way, you can assess the individual contribution of each feature. Be careful though, a feature with a low contribution is not necessarily useless, because maybe its association with other feature convey a lot of information. So this experiment allows you to detect good features but not bad ones.
Feature frequency: it still makes sense to study the frequency of the features in the model, although this is a bit more complex. I'm not an expert in LambdaMART so I didn't try this one and I was happy enough with 1 and 2. This would require understanding how the model works exactly (so you certainly need to dig in the code a bit) and making sure you're comparing comparable things.
Other packages provide some information as well. GBM in R for instance gives the relative contribution of features (or something like that) as well as other interesting plots. Currently we're working with both RankLib and GBM in our experiments. One is more production-oriented, the other is more analysis-oriented (although I'm sure you can do many things with RankLib, it's just not out of the box).
I hope this will help anyone having the same question. One thing is sure, whatever solution you pick, it takes time. Because you generally have lots of feature 1 and 2 require to repeat the process a lot. 3 requires to have a good knowledge or have a deep look into the files. 4 is boring because it's R and GBM isn't on CTAN so you pretty much have to install it manually, then learn how to use it.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It would be nice to add this to RankLib itself with proper command line argument settings. It's probably easier to do as a script that uses repetitive calls to RankLib with a different feature file for the multiple runs.
And yes, given the number of features that might be present, it will definitely be something the user would just kick off, then go home for the night and see the results the next day.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Nicolas, thank you very much for the feedback.
Very valuable !
I agree with Stephen, this could be an amazing feature.
I will sort out if I have time to try to contribute it.
Not sure if immediately, but maybe in the next month I could have time.
I will update the ticket with some study and idea.
Cheers
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thank you so much for you interest in the few points I highlighted. I'm really happy these suggestions could become new features of RankLib!
I'm sorry I don't have time to contribute though, otherwise I would have made a proposal for this.
Again thanks a lot, I think this is going to improve a lot the already fantastic library. :-)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I would like to know which features contributed the most to given results.
Can I parse the model leading to these results to find which features are dominant? Or is there another way?
In other words, how can I perform a feature ranking?
Thanks a lot!
Last edit: Nicolas Fiorini 2016-02-17
I don't actually know but have wondered the same thing myself.
If one looks at the LambdaMART model definition you'll notice that not all features are used by the model and in fact, some features are used quite frequently across multiple trees, so this may imply which features are the more useful (in positive or negative ways).
I wonder if a simple frequency distribution of model feature use would mean anything?
I'll ask and look around if anyone else has a more definitive answer.
To further clarify, I was using LambdMART as an example. Obviously different models have different structure but all have associated weights with the features.
The problem with weight values is they may not vary much and separation between weight values (positive or negative) may be too fine to make much use of.
But perhaps one could use methos to determine if weights were statistically different from one another or significant.
Another issue is that any comparison of weights assumes that what you are weighting is comparable, e.g. your features are normalized. If you have a feature that's always negative, a large negative number would reflect its importance, not its weakness, or merely a negative correlation.
If you were to use "-norm zscore" and look at a linear model, such as coordinate ascent (4) or regression (9), then you might be able to treat the data in the output model as weights on features in a ranked sense. You still may have the negative correlation problem.
Typically, people train models leaving one feature out at a time to determine the performance lost if each feature is left out. (Quantified with the measure of your choice) This way, you also learn which features are redundant. But it requires a lot of training time. This way, you can say that your "title" feature represents 5% of your total MAP or something like that.
Hi,
Thanks a lot for your guidance. I am using LambdaMART too. I've read an article that consisted of counting which features appear at which node of every tree in the model and representing it as a heatmap. The idea is the same as the distribution you suggest, as we try to find the most represented features.
Yeah, John's proposition is what I was heading to, seeing that there exists no such thing apparently (I'm not only talking about RankLib right now). A lot of papers I've read just define categories of features (if there are many of them, otherwise a category is 1 feature) and remove them for the training. Then they study the performance. It takes time, but in a way it's also quite easy to do and at least we get a reliable estimation of the contribution of each feature (rather than summing them up).
Thanks a LOT for your answers and your time!
Apparently use of frequency of occurrence for features has been used to determine feature usefulness.
But that only works where a subset of features are selected by the model and given weights. Most of the other type LTR algorithms have weights defined for all the features, so it will be less clear how to determine feature usefulness.
Hi,
Here's a summary of the solutions for this specific problem, since I've been ask to post one in a private message. Maybe this will be useful for some other people.
I hope this will help anyone having the same question. One thing is sure, whatever solution you pick, it takes time. Because you generally have lots of feature 1 and 2 require to repeat the process a lot. 3 requires to have a good knowledge or have a deep look into the files. 4 is boring because it's R and GBM isn't on CTAN so you pretty much have to install it manually, then learn how to use it.
Nicolas, thank you for the suggestions.
I have added this as a feature request, https://sourceforge.net/p/lemur/feature-requests/141/
It would be nice to add this to RankLib itself with proper command line argument settings. It's probably easier to do as a script that uses repetitive calls to RankLib with a different feature file for the multiple runs.
And yes, given the number of features that might be present, it will definitely be something the user would just kick off, then go home for the night and see the results the next day.
Nicolas, thank you very much for the feedback.
Very valuable !
I agree with Stephen, this could be an amazing feature.
I will sort out if I have time to try to contribute it.
Not sure if immediately, but maybe in the next month I could have time.
I will update the ticket with some study and idea.
Cheers
Thank you so much for you interest in the few points I highlighted. I'm really happy these suggestions could become new features of RankLib!
I'm sorry I don't have time to contribute though, otherwise I would have made a proposal for this.
Again thanks a lot, I think this is going to improve a lot the already fantastic library. :-)
for your reference - attached is a script that analyses the model and print the frequencies of each feature.