Note

Click here to download the full example code or to run this example in your browser via Binder

Text preprocessing¶

The following example shows how to fit a simple NLP problem with auto-sklearn.

For an introduction to text preprocessing you can follow these links:

from pprint import pprint

import pandas as pd
import sklearn.metrics
from sklearn.datasets import fetch_20newsgroups

import autosklearn.classification

Data Loading¶

cats = ["comp.sys.ibm.pc.hardware", "rec.sport.baseball"]
X_train, y_train = fetch_20newsgroups(
    subset="train",  # select train set
    shuffle=True,  # shuffle the data set for unbiased validation results
    random_state=42,  # set a random seed for reproducibility
    categories=cats,  # select only 2 out of 20 labels
    return_X_y=True,  # 20NG dataset consists of 2 columns X: the text data, y: the label
)  # load this two columns separately as numpy array

X_test, y_test = fetch_20newsgroups(
    subset="test",  # select test set for unbiased evaluation
    categories=cats,  # select only 2 out of 20 labels
    return_X_y=True,  # 20NG dataset consists of 2 columns X: the text data, y: the label
)  # load this two columns separately as numpy array

Creating a pandas dataframe¶

Both categorical and text features are often strings. Python Pandas stores python stings in the generic object type. Please ensure that the correct dtype is applied to the correct column.

# create a pandas dataframe for training labeling the "Text" column as sting
X_train = pd.DataFrame({"Text": pd.Series(X_train, dtype="string")})

# create a pandas dataframe for testing labeling the "Text" column as sting
X_test = pd.DataFrame({"Text": pd.Series(X_test, dtype="string")})

Build and fit a classifier¶

# create an autosklearn Classifier or Regressor depending on your task at hand.
automl = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=60,
    per_run_time_limit=30,
)

automl.fit(X_train, y_train, dataset_name="20_Newsgroups")  # fit the automl model

AutoSklearnClassifier(ensemble_class=<class 'autosklearn.ensembles.ensemble_selection.EnsembleSelection'>,
                      per_run_time_limit=30, time_left_for_this_task=60)

View the models found by auto-sklearn¶

print(automl.leaderboard())

          rank  ensemble_weight           type      cost   duration
model_id
3            1             0.34            mlp  0.022959  12.225609
2            2             0.56  random_forest  0.040816  12.765663
4            3             0.10    extra_trees  0.079082  11.489445

Print the final ensemble constructed by auto-sklearn¶

pprint(automl.show_models(), indent=4)

{   2: {   'balancing': Balancing(random_state=1),
           'classifier': <autosklearn.pipeline.components.classification.ClassifierChoice object at 0x7f05d2001d60>,
           'cost': 0.04081632653061229,
           'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice object at 0x7f05d4452cd0>,
           'ensemble_weight': 0.56,
           'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice object at 0x7f05d2001eb0>,
           'model_id': 2,
           'rank': 1,
           'sklearn_classifier': RandomForestClassifier(max_features=10, n_estimators=512, n_jobs=1,
                       random_state=1, warm_start=True)},
    3: {   'balancing': Balancing(random_state=1, strategy='weighting'),
           'classifier': <autosklearn.pipeline.components.classification.ClassifierChoice object at 0x7f05d684e550>,
           'cost': 0.022959183673469385,
           'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice object at 0x7f05d0fd7e50>,
           'ensemble_weight': 0.34,
           'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice object at 0x7f05d3f33d30>,
           'model_id': 3,
           'rank': 2,
           'sklearn_classifier': MLPClassifier(activation='tanh', alpha=1.103855734598575e-05, beta_1=0.999,
              beta_2=0.9, early_stopping=True,
              hidden_layer_sizes=(229, 229, 229),
              learning_rate_init=0.00014375616988222174, max_iter=32,
              n_iter_no_change=32, random_state=1, verbose=0, warm_start=True)},
    4: {   'balancing': Balancing(random_state=1),
           'classifier': <autosklearn.pipeline.components.classification.ClassifierChoice object at 0x7f05d3dfde50>,
           'cost': 0.07908163265306123,
           'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice object at 0x7f05e9b267f0>,
           'ensemble_weight': 0.1,
           'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice object at 0x7f05d198f2b0>,
           'model_id': 4,
           'rank': 3,
           'sklearn_classifier': ExtraTreesClassifier(max_features=9, min_samples_split=4, n_estimators=512,
                     n_jobs=1, random_state=1, warm_start=True)}}

Get the Score of the final ensemble¶

predictions = automl.predict(X_test)
print("Accuracy score:", sklearn.metrics.accuracy_score(y_test, predictions))

Accuracy score: 0.982256020278834

Total running time of the script: ( 1 minutes 5.383 seconds)

Download Python source code: example_text_preprocessing.py

Download Jupyter notebook: example_text_preprocessing.ipynb

Gallery generated by Sphinx-Gallery