
.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/plot_diabetes_variable_importance_example.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_auto_examples_plot_diabetes_variable_importance_example.py>`
        to download the full example code.

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_plot_diabetes_variable_importance_example.py:


Variable Importance on diabetes dataset
=======================================

Variable Importance estimates the influence of a given input variable to the
prediction made by a model. To assess variable importance in a prediction
problem, :footcite:t:`breimanRandomForests2001` introduced the permutation
approach where the values are shuffled for one variable/column at a time. This
permutation breaks the relationship between the variable of interest and the
outcome. Following, the loss score is checked before and after this
substitution for any significant drop in the performance which reflects the
significance of this variable to predict the outcome. This ease-to-use solution
is demonstrated, in the work by
:footcite:t:`stroblConditionalVariableImportance2008`, to be affected by the
degree of correlation between the variables, thus biased towards truly
non-significant variables highly correlated with the significant ones and
creating fake significant variables. They introduced a solution for the Random
Forest estimator based on conditional sampling by performing sub-groups
permutation when bisecting the space using the conditioning variables of the
buiding process. However, this solution is exclusive to the Random Forest and is
costly with high-dimensional settings.
:footcite:t:`Chamma_NeurIPS2023` introduced a new model-agnostic solution to
bypass the limitations of the permutation approach under the use of the
conditional schemes. The variable of interest does contain two types of
information: 1) the relationship with the remaining variables and 2) the
relationship with the outcome. The standard permutation, while breaking the
relationship with the outcome, is also destroying the dependency with the
remaining variables. Therefore, instead of directly permuting the variable of
interest, the variable of interest is predicted by the remaining
variables and the residuals of this prediction are permuted before
reconstructing the new version of the variable. This solution preserves the
dependency with the remaining variables.

In this example, we compare both the standard permutation and its conditional
variant approaches for variable importance on the diabetes dataset for the
single-level case. The aim is to see if integrating the new
statistically-controlled solution has an impact on the results.

References
----------
.. footbibliography::

.. GENERATED FROM PYTHON SOURCE LINES 46-48

Imports needed for this script
------------------------------

.. GENERATED FROM PYTHON SOURCE LINES 48-69

.. code-block:: Python


    import matplotlib.pyplot as plt
    import numpy as np
    from sklearn.datasets import load_diabetes

    from hidimstat.bbi import BlockBasedImportance
    from hidimstat import compute_loco

    plt.rcParams.update({"font.size": 14})

    # Fixing the random seed
    rng = np.random.RandomState(2024)

    diabetes = load_diabetes()
    X, y = diabetes.data, diabetes.target

    # Use or not a cross-validation with the provided learner
    k_fold = 2
    # Identifying the categorical (nominal, binary & ordinal) variables
    variables_categories = {}








.. GENERATED FROM PYTHON SOURCE LINES 70-77

Standard Variable Importance
----------------------------
To apply the standard permutation, we use the implementation introduced by (Mi
et al., Nature, 2021) where the significance is measured by the mean of
-log10(p_value). For this example, the inference estimator is set to the
Random Forest learner.


.. GENERATED FROM PYTHON SOURCE LINES 77-97

.. code-block:: Python


    bbi_permutation = BlockBasedImportance(
        estimator="RF",
        importance_estimator="residuals_RF",
        do_hypertuning=True,
        dict_hypertuning=None,
        conditional=False,
        group_stacking=False,
        problem_type="regression",
        k_fold=k_fold,
        variables_categories=variables_categories,
        n_jobs=2,
        verbose=0,
        n_permutations=100,
    )
    bbi_permutation.fit(X, y)
    print("Computing the importance scores with standard permutation")
    results_permutation = bbi_permutation.compute_importance()
    pvals_permutation = -np.log10(results_permutation["pval"] + 1e-10)





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Processing: 1
    Processing: 2
    Computing the importance scores with standard permutation




.. GENERATED FROM PYTHON SOURCE LINES 98-104

Conditional Variable Importance
-------------------------------
For the conditional permutation importance based on the two blocks (inference
+ importance), the estimators are set to the Random Forest learner. The
significance is measured by the mean of -log10(p_value).


.. GENERATED FROM PYTHON SOURCE LINES 104-124

.. code-block:: Python


    bbi_conditional = BlockBasedImportance(
        estimator="RF",
        importance_estimator="residuals_RF",
        do_hypertuning=True,
        dict_hypertuning=None,
        conditional=True,
        group_stacking=False,
        problem_type="regression",
        k_fold=k_fold,
        variables_categories=variables_categories,
        n_jobs=2,
        verbose=0,
        n_permutations=100,
    )
    bbi_conditional.fit(X, y)
    print("Computing the importance scores with conditional permutation")
    results_conditional = bbi_conditional.compute_importance()
    pvals_conditional = -np.log10(results_conditional["pval"] + 1e-5)





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Processing: 1
    Processing: 2
    Computing the importance scores with conditional permutation




.. GENERATED FROM PYTHON SOURCE LINES 125-133

Leave-One-Covariate-Out (LOCO)
------------------------------
We compare the previous permutation-based approaches with a removal-based
approach LOCO (Williamson et al., Journal of the American Statistical
Association, 2021) where the variable of interest is removed and the inference
estimator is retrained using the new features to compare the loss for any drop in the
performance.


.. GENERATED FROM PYTHON SOURCE LINES 133-137

.. code-block:: Python


    results_loco = compute_loco(X, y, use_dnn=False)
    pvals_loco = -np.log10(results_loco["p_value"] + 1e-5)





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Processing col: 1
    Processing col: 2
    Processing col: 3
    Processing col: 4
    Processing col: 5
    Processing col: 6
    Processing col: 7
    Processing col: 8
    Processing col: 9
    Processing col: 10




.. GENERATED FROM PYTHON SOURCE LINES 138-140

Plotting the comparison
-----------------------

.. GENERATED FROM PYTHON SOURCE LINES 140-164

.. code-block:: Python


    list_res = {"Permutation": [], "Conditional": [], "LOCO": []}
    for index, _ in enumerate(diabetes.feature_names):
        list_res["Permutation"].append(pvals_permutation[index][0])
        list_res["Conditional"].append(pvals_conditional[index][0])
        list_res["LOCO"].append(pvals_loco[index])

    x = np.arange(len(diabetes.feature_names))
    width = 0.25  # the width of the bars
    multiplier = 0
    fig, ax = plt.subplots(figsize=(10, 10), layout="constrained")

    for attribute, measurement in list_res.items():
        offset = width * multiplier
        rects = ax.bar(x + offset, measurement, width, label=attribute)
        multiplier += 1

    ax.set_ylabel(r"$-log_{10}p_{val}$")
    ax.set_xticks(x + width / 2, diabetes.feature_names)
    ax.legend(loc="upper left", ncols=3)
    ax.set_ylim(0, 3)
    ax.axhline(y=-np.log10(0.05), color="r", linestyle="-")
    plt.show()




.. image-sg:: /auto_examples/images/sphx_glr_plot_diabetes_variable_importance_example_001.png
   :alt: plot diabetes variable importance example
   :srcset: /auto_examples/images/sphx_glr_plot_diabetes_variable_importance_example_001.png
   :class: sphx-glr-single-img





.. GENERATED FROM PYTHON SOURCE LINES 165-174

Analysis of the results
-----------------------
While the standard permutation flags multiple variables to be significant for
this prediction, the conditional permutation (the controlled alternative)
shows an agreement for "bmi", "bp" and "s6" but also highlights the importance
of "sex" in this prediction, thus reducing the input space to four significant
variables. LOCO underlines the importance of one variable "bp" for this
prediction problem.



.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 54.153 seconds)

**Estimated memory usage:**  620 MB


.. _sphx_glr_download_auto_examples_plot_diabetes_variable_importance_example.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: plot_diabetes_variable_importance_example.ipynb <plot_diabetes_variable_importance_example.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: plot_diabetes_variable_importance_example.py <plot_diabetes_variable_importance_example.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: plot_diabetes_variable_importance_example.zip <plot_diabetes_variable_importance_example.zip>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_
