Data Discussion and Preprocessing

Overview

Teaching: 10 min
Exercises: 5 min
Questions
  • What dataset is being used

  • How must we organize our data such that it can be used in the machine learning libraries?

Objectives
  • Briefly describe dataset

  • Prepare the dataset for machine learning.

Data Set Used

The dataset we will use in this tutorial is simulated ATLAS data. Each event corresponds to 4 detected leptons: some events correspond to a Higgs Boson decay and others do not (background). Various physical quantities such as lepton charge and transverse momentum are recorded for each event. The analysis in this tutorial loosely follows the discovery of the Higgs Boson.

Setting up the dataset for machine learning

Here we will format the dataset \((x_i, y_i)\) so we can use it for machine learning sci-kit learn and tensorflow. First we need to open our dataset and separate it into a training and test set.

df = pd.read_pickle('data.pkl')
df_train = df.iloc[0:800000]
df_test = df.iloc[800000::]
df.head()

The data type is currently a pandas DataFrame: we now need to convert it into a numpy array so that it can be used in sci-kit learn and tensorflow during the machine learning process. Note that there are many ways that this can be done: in this tutorial we will use the sci-kit learn pipeline functionality to format our dataset. For more information, please see Chapter 2 of Geron (pg 70-71) and the sci-kit learn documentation. We will briefly walk through the code in this tutorial.

First we import all required modules from sci-kit learn.

from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

Next a special DataFrameSelector class is defined. This class is defined such that it can operate with sci-kit learn pipeline functionality; specifically, it takes in a pandas DataFrame and outputs a numpy array. Some essential class methods are defined here such that the class can operate in the OOP framework of sci-kit learn: specifically, the fit and transform methods are defined. For detailed information on transformers in sci-kit learn see Chapter 2 or the sci-kit learn documentation.

class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values

Now a pipeline is created. This takes in df_train or df_test and spits out a numpy array consisting of only the columns of the DataFrame corresponding to headers in attribs. Note also the StandardScaler: this ensures that all numerical attributes are scaled to have a mean of 0 and a standard deviation of 1 before they are fed to the machine learning model. This type of preprocessing is common before feeding data into machine learning models and is especially important for neural networks.

attribs = ['lep_pt_1', 'lep_pt_2', 'mllll']
pipeline = Pipeline([
    ('selector', DataFrameSelector(attribs)),
    ('std_scaler', StandardScaler())
])

Now we will use the pipeline to generate the subset \((x_i, y_i)\) used for training and the subset \((x_i, y_i)\) used for testing the model. Note that fit_transform is called on the training dataset but transform is called on the test dataset. We keep the labels \(x_i\) separate from the targets (i.e. signal/background) \(y_i\).

X_train = pipeline.fit_transform(df_train)
y_train = df_train['type'].values

X_test = pipeline.transform(df_test)
y_test = df_test['type'].values

Now we are ready to examine various models \(f\) for predicting whether an event corresponds to a Higgs decay or a background event.

Key Points

  • One must properly format data before any machine learning takes place.

  • Data can be formatted using sci-kit learn functionality; using it effectively may take time to master.