.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "examples/40_advanced/example_text_preprocessing.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code or to run this example in your browser via Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_examples_40_advanced_example_text_preprocessing.py: ================== Text preprocessing ================== The following example shows how to fit a simple NLP problem with *auto-sklearn*. For an introduction to text preprocessing you can follow these links: 1. https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html 2. https://machinelearningmastery.com/clean-text-machine-learning-python/ .. GENERATED FROM PYTHON SOURCE LINES 14-22 .. code-block:: default from pprint import pprint import pandas as pd import sklearn.metrics from sklearn.datasets import fetch_20newsgroups import autosklearn.classification .. GENERATED FROM PYTHON SOURCE LINES 23-25 Data Loading ============ .. GENERATED FROM PYTHON SOURCE LINES 25-40 .. code-block:: default cats = ["comp.sys.ibm.pc.hardware", "rec.sport.baseball"] X_train, y_train = fetch_20newsgroups( subset="train", # select train set shuffle=True, # shuffle the data set for unbiased validation results random_state=42, # set a random seed for reproducibility categories=cats, # select only 2 out of 20 labels return_X_y=True, # 20NG dataset consists of 2 columns X: the text data, y: the label ) # load this two columns separately as numpy array X_test, y_test = fetch_20newsgroups( subset="test", # select test set for unbiased evaluation categories=cats, # select only 2 out of 20 labels return_X_y=True, # 20NG dataset consists of 2 columns X: the text data, y: the label ) # load this two columns separately as numpy array .. GENERATED FROM PYTHON SOURCE LINES 41-47 Creating a pandas dataframe =========================== Both categorical and text features are often strings. Python Pandas stores python stings in the generic `object` type. Please ensure that the correct `dtype `_ is applied to the correct column. .. GENERATED FROM PYTHON SOURCE LINES 47-54 .. code-block:: default # create a pandas dataframe for training labeling the "Text" column as sting X_train = pd.DataFrame({"Text": pd.Series(X_train, dtype="string")}) # create a pandas dataframe for testing labeling the "Text" column as sting X_test = pd.DataFrame({"Text": pd.Series(X_test, dtype="string")}) .. GENERATED FROM PYTHON SOURCE LINES 55-57 Build and fit a classifier ========================== .. GENERATED FROM PYTHON SOURCE LINES 57-66 .. code-block:: default # create an autosklearn Classifier or Regressor depending on your task at hand. automl = autosklearn.classification.AutoSklearnClassifier( time_left_for_this_task=60, per_run_time_limit=30, ) automl.fit(X_train, y_train, dataset_name="20_Newsgroups") # fit the automl model .. rst-class:: sphx-glr-script-out .. code-block:: none AutoSklearnClassifier(ensemble_class=, per_run_time_limit=30, time_left_for_this_task=60) .. GENERATED FROM PYTHON SOURCE LINES 67-69 View the models found by auto-sklearn ===================================== .. GENERATED FROM PYTHON SOURCE LINES 69-72 .. code-block:: default print(automl.leaderboard()) .. rst-class:: sphx-glr-script-out .. code-block:: none rank ensemble_weight type cost duration model_id 3 1 0.34 mlp 0.022959 12.225609 2 2 0.56 random_forest 0.040816 12.765663 4 3 0.10 extra_trees 0.079082 11.489445 .. GENERATED FROM PYTHON SOURCE LINES 73-75 Print the final ensemble constructed by auto-sklearn ==================================================== .. GENERATED FROM PYTHON SOURCE LINES 75-78 .. code-block:: default pprint(automl.show_models(), indent=4) .. rst-class:: sphx-glr-script-out .. code-block:: none { 2: { 'balancing': Balancing(random_state=1), 'classifier': , 'cost': 0.04081632653061229, 'data_preprocessor': , 'ensemble_weight': 0.56, 'feature_preprocessor': , 'model_id': 2, 'rank': 1, 'sklearn_classifier': RandomForestClassifier(max_features=10, n_estimators=512, n_jobs=1, random_state=1, warm_start=True)}, 3: { 'balancing': Balancing(random_state=1, strategy='weighting'), 'classifier': , 'cost': 0.022959183673469385, 'data_preprocessor': , 'ensemble_weight': 0.34, 'feature_preprocessor': , 'model_id': 3, 'rank': 2, 'sklearn_classifier': MLPClassifier(activation='tanh', alpha=1.103855734598575e-05, beta_1=0.999, beta_2=0.9, early_stopping=True, hidden_layer_sizes=(229, 229, 229), learning_rate_init=0.00014375616988222174, max_iter=32, n_iter_no_change=32, random_state=1, verbose=0, warm_start=True)}, 4: { 'balancing': Balancing(random_state=1), 'classifier': , 'cost': 0.07908163265306123, 'data_preprocessor': , 'ensemble_weight': 0.1, 'feature_preprocessor': , 'model_id': 4, 'rank': 3, 'sklearn_classifier': ExtraTreesClassifier(max_features=9, min_samples_split=4, n_estimators=512, n_jobs=1, random_state=1, warm_start=True)}} .. GENERATED FROM PYTHON SOURCE LINES 79-81 Get the Score of the final ensemble =================================== .. GENERATED FROM PYTHON SOURCE LINES 81-84 .. code-block:: default predictions = automl.predict(X_test) print("Accuracy score:", sklearn.metrics.accuracy_score(y_test, predictions)) .. rst-class:: sphx-glr-script-out .. code-block:: none Accuracy score: 0.982256020278834 .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 1 minutes 5.383 seconds) .. _sphx_glr_download_examples_40_advanced_example_text_preprocessing.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/automl/auto-sklearn/master?urlpath=lab/tree/notebooks/examples/40_advanced/example_text_preprocessing.ipynb :alt: Launch binder :width: 150 px .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: example_text_preprocessing.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: example_text_preprocessing.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_