Learning outside the black-box: at the pursuit of interpretable models

(Jonathan Crabbe et.al)

General Summary

  • This Paper is an extension of symbolic metamodelling work done by Ahmed et. al.

  • The main aim of the paper is to develop symbolic regression models which provide global nature of the black-box model using generalized hyperbolic functions (Meijer G-functions)

  • The paper utilizes a projection pursuit based approach to build the expression term by term to have a parsimonious symbolic expression

  • Authors also provide github implementation here

Background

  • Projection Pursuit: is an extension of the additive model, where the smoothing function (ridge function) is applied to the projected vector rather than a data directly. The data is first projected onto an optimal plane, and a smoothing/non-linear function is applied on top of this projected plane. The data’s dimensionality is reduced on each step of projection, and the step is repeated till the required precision is achieved. To minimize projection loss, it is in general required to perform optimization term-by-term, which results in residual-based optimization.

  • Ridge Functions: are the smoothing/non-linear functions used in projection functions. In the general projection pursuit algorithm, these functions are selected from the standard set of functions, polynomials or splines, or any other trigonometric functions. The aim is to transform the projected vectors in some function-space.

  • Meijer G-functions: are a class of generalized hyperbolic functions. G-functions are characterized by 4 set of hyperparameters, which results in the final expression to take any form (polynomial, bessel, trigonometric, exponential, log, and many more). Meijer G-functions are defined over an entire complex plane, with the exceptions on the unit circle and origin.

  • Properties of Meijer G-functions: The paper proves some important Meijer G-functions properties, which reduces the entire search space for hyperparameters to just 5 settings.

Method

  • As discussed previously, the projection pursuit involves ridge functions to transform the projected data into any given functional-space. In this work, the authors use Meijer G-functions as ridge functions, which reduces the burden of manually fixing on a specific class of functions in pursuit of optimization.

  • The input operating range of Meijer G-function is selected to be between (0, 1), which is achieved by transforming the projected information by dividing by the norm of projection plane and the root of max dimensionality of input vector (Cauchy-Schwartz inequality)

  • The function is optimized term-by-term by using previous estimates as residuals. The final step of the back fitting is applied at the end of the optimization of every term to correct the parameters obtained.

  • This process of optimization is repeatedly applied until the required precision is achieved.

  • As the obtained symbolic expression is continuous, it can be differentiated and expanded as a polynomial by using the Taylor series. This is done to analyze the feature importance and feature interaction.

Results

  • The efficacy of the proposed method is tested using multiple UCI datasets on MLP and SVM based black-box models.

  • The performance of proposed symbolic regression is quite similar to MLP and SVM’s performance, the evaluation metric used is MSE and R2 coefficient.

Advantages, Limitations, and Future Directions

  • The proposed method provides the mathematic expression for any given data, which is quite interesting and can be used to analyze the data in greater detail.

  • The paper claims the method to be a post-hoc interpretability method, which explains any given black-box model, which doesn’t seem right as the method just tries to fit the low confidence ground-truths obtained by a black-box model with an input data, it’s hard to associate obtained expression to model rather that altered data generating distribution.

  • The operating range is selected randomly; paper doesn’t provide any reason or show any ablation study to back the claim of (0, 1) as operating range, Which inturn results in loss of information.

  • The experiments performed are quite a uni-dimensional. All the datasets selected are of uni-variate regression tasks. A possible extension on classification and multi-dimensional analysis is required.

  • The proposed method requires changes from an implementational sense to adapt it to larger datasets(maybe mini-batch based learning or massive parallelization)

  • In one of my work, I’m exploring the effect of mini-batch based optimization, higher-dimensional data, and classification task formulation. The analysis report can be found here and code implementation can be found here.

  • If interested in exploring any of the above ideas together, feel free to contact me @ koriavinash1@gmail.com

Leave a Comment