.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "examples/40_advanced/example_resampling.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code or to run this example in your browser via Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_examples_40_advanced_example_resampling.py: ===================== Resampling Strategies ===================== In *auto-sklearn* it is possible to use different resampling strategies by specifying the arguments ``resampling_strategy`` and ``resampling_strategy_arguments``. The following example shows common settings for the ``AutoSklearnClassifier``. .. GENERATED FROM PYTHON SOURCE LINES 11-20 .. code-block:: default import numpy as np import sklearn.model_selection import sklearn.datasets import sklearn.metrics import autosklearn.classification .. GENERATED FROM PYTHON SOURCE LINES 21-23 Data Loading ============ .. GENERATED FROM PYTHON SOURCE LINES 23-29 .. code-block:: default X, y = sklearn.datasets.load_breast_cancer(return_X_y=True) X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split( X, y, random_state=1 ) .. GENERATED FROM PYTHON SOURCE LINES 30-32 Holdout ======= .. GENERATED FROM PYTHON SOURCE LINES 32-46 .. code-block:: default automl = autosklearn.classification.AutoSklearnClassifier( time_left_for_this_task=120, per_run_time_limit=30, tmp_folder="/tmp/autosklearn_resampling_example_tmp", disable_evaluator_output=False, # 'holdout' with 'train_size'=0.67 is the default argument setting # for AutoSklearnClassifier. It is explicitly specified in this example # for demonstrational purpose. resampling_strategy="holdout", resampling_strategy_arguments={"train_size": 0.67}, ) automl.fit(X_train, y_train, dataset_name="breast_cancer") .. rst-class:: sphx-glr-script-out .. code-block:: none AutoSklearnClassifier(ensemble_class=, per_run_time_limit=30, resampling_strategy_arguments={'train_size': 0.67}, time_left_for_this_task=120, tmp_folder='/tmp/autosklearn_resampling_example_tmp') .. GENERATED FROM PYTHON SOURCE LINES 47-49 Get the Score of the final ensemble =================================== .. GENERATED FROM PYTHON SOURCE LINES 49-54 .. code-block:: default predictions = automl.predict(X_test) print("Accuracy score holdout: ", sklearn.metrics.accuracy_score(y_test, predictions)) .. rst-class:: sphx-glr-script-out .. code-block:: none Accuracy score holdout: 0.958041958041958 .. GENERATED FROM PYTHON SOURCE LINES 55-57 Cross-validation ================ .. GENERATED FROM PYTHON SOURCE LINES 57-76 .. code-block:: default automl = autosklearn.classification.AutoSklearnClassifier( time_left_for_this_task=120, per_run_time_limit=30, tmp_folder="/tmp/autosklearn_resampling_example_tmp", disable_evaluator_output=False, resampling_strategy="cv", resampling_strategy_arguments={"folds": 5}, ) automl.fit(X_train, y_train, dataset_name="breast_cancer") # One can use models trained during cross-validation directly to predict # for unseen data. For this, all k models trained during k-fold # cross-validation are considered as a single soft-voting ensemble inside # the ensemble constructed with ensemble selection. print("Before re-fit") predictions = automl.predict(X_test) print("Accuracy score CV", sklearn.metrics.accuracy_score(y_test, predictions)) .. rst-class:: sphx-glr-script-out .. code-block:: none Before re-fit Accuracy score CV 0.965034965034965 .. GENERATED FROM PYTHON SOURCE LINES 77-82 Perform a refit =============== During fit(), models are fit on individual cross-validation folds. To use all available data, we call refit() which trains all models in the final ensemble on the whole dataset. .. GENERATED FROM PYTHON SOURCE LINES 82-87 .. code-block:: default print("After re-fit") automl.refit(X_train.copy(), y_train.copy()) predictions = automl.predict(X_test) print("Accuracy score CV", sklearn.metrics.accuracy_score(y_test, predictions)) .. rst-class:: sphx-glr-script-out .. code-block:: none After re-fit Accuracy score CV 0.958041958041958 .. GENERATED FROM PYTHON SOURCE LINES 88-96 scikit-learn splitter objects ============================= It is also possible to use `scikit-learn's splitter classes `_ to further customize the outputs. In case one needs to have 100% control over the splitting, it is possible to use `scikit-learn's PredefinedSplit `_. .. GENERATED FROM PYTHON SOURCE LINES 98-101 Below is an example of using a predefined split. We split the training data by the first feature. In practice, one would use a splitting according to the use case at hand. .. GENERATED FROM PYTHON SOURCE LINES 101-118 .. code-block:: default selected_indices = (X_train[:, 0] < np.mean(X_train[:, 0])).astype(int) resampling_strategy = sklearn.model_selection.PredefinedSplit( test_fold=selected_indices ) automl = autosklearn.classification.AutoSklearnClassifier( time_left_for_this_task=120, per_run_time_limit=30, tmp_folder="/tmp/autosklearn_resampling_example_tmp", disable_evaluator_output=False, resampling_strategy=resampling_strategy, ) automl.fit(X_train, y_train, dataset_name="breast_cancer") print(automl.sprint_statistics()) .. rst-class:: sphx-glr-script-out .. code-block:: none auto-sklearn results: Dataset name: breast_cancer Metric: accuracy Best validation score: 0.964789 Number of target algorithm runs: 25 Number of successful target algorithm runs: 25 Number of crashed target algorithm runs: 0 Number of target algorithms that exceeded the time limit: 0 Number of target algorithms that exceeded the memory limit: 0 .. GENERATED FROM PYTHON SOURCE LINES 119-121 For custom resampling strategies (i.e. resampling strategies that are not defined as strings by Auto-sklearn) it is necessary to perform a refit: .. GENERATED FROM PYTHON SOURCE LINES 121-123 .. code-block:: default automl.refit(X_train, y_train) .. rst-class:: sphx-glr-script-out .. code-block:: none AutoSklearnClassifier(ensemble_class=, per_run_time_limit=30, resampling_strategy=PredefinedSplit(test_fold=array([0, 0, ..., 1, 1])), time_left_for_this_task=120, tmp_folder='/tmp/autosklearn_resampling_example_tmp') .. GENERATED FROM PYTHON SOURCE LINES 124-129 Get the Score of the final ensemble (again) =========================================== Obviously, this score is pretty bad as we "destroyed" the dataset by splitting it on the first feature. .. GENERATED FROM PYTHON SOURCE LINES 129-133 .. code-block:: default predictions = automl.predict(X_test) print( "Accuracy score custom split", sklearn.metrics.accuracy_score(y_test, predictions) ) .. rst-class:: sphx-glr-script-out .. code-block:: none Accuracy score custom split 0.958041958041958 .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 6 minutes 35.274 seconds) .. _sphx_glr_download_examples_40_advanced_example_resampling.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/automl/auto-sklearn/master?urlpath=lab/tree/notebooks/examples/40_advanced/example_resampling.ipynb :alt: Launch binder :width: 150 px .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: example_resampling.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: example_resampling.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_