Data Discussion and Preprocessing
Overview
Teaching: 10 min
Exercises: 5 minQuestions
What dataset is being used
How must we organize our data such that it can be used in the machine learning libraries?
Objectives
Briefly describe dataset
Prepare the dataset for machine learning.
Data Set Used
The dataset we will use in this tutorial is simulated ATLAS data. Each event corresponds to 4 detected leptons: some events correspond to a Higgs Boson decay and others do not (background). Various physical quantities such as lepton charge and transverse momentum are recorded for each event. The analysis in this tutorial loosely follows the discovery of the Higgs Boson.
Setting up the dataset for machine learning
Here we will format the dataset \((x_i, y_i)\) so we can use it for machine learning sci-kit learn and tensorflow. First we need to open our dataset and separate it into a training and test set.
df = pd.read_pickle('data.pkl')
df_train = df.iloc[0:800000]
df_test = df.iloc[800000::]
df.head()
The data type is currently a pandas DataFrame: we now need to convert it into a numpy array so that it can be used in sci-kit learn and tensorflow during the machine learning process. Note that there are many ways that this can be done: in this tutorial we will use the sci-kit learn pipeline functionality to format our dataset. For more information, please see Chapter 2 of Geron (pg 70-71) and the sci-kit learn documentation. We will briefly walk through the code in this tutorial.
First we import all required modules from sci-kit learn.
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
Next a special DataFrameSelector
class is defined. This class is defined such that it can operate with sci-kit learn pipeline functionality; specifically, it takes in a pandas DataFrame and outputs a numpy array. Some essential class methods are defined here such that the class can operate in the OOP framework of sci-kit learn: specifically, the fit
and transform
methods are defined. For detailed information on transformers in sci-kit learn see Chapter 2 or the sci-kit learn documentation.
class DataFrameSelector(BaseEstimator, TransformerMixin):
def __init__(self, attribute_names):
self.attribute_names = attribute_names
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.attribute_names].values
Now a pipeline is created. This takes in df_train
or df_test
and spits out a numpy array consisting of only the columns of the DataFrame corresponding to headers in attribs
. Note also the StandardScaler
: this ensures that all numerical attributes are scaled to have a mean of 0 and a standard deviation of 1 before they are fed to the machine learning model. This type of preprocessing is common before feeding data into machine learning models and is especially important for neural networks.
attribs = ['lep_pt_1', 'lep_pt_2', 'mllll']
pipeline = Pipeline([
('selector', DataFrameSelector(attribs)),
('std_scaler', StandardScaler())
])
Now we will use the pipeline to generate the subset \((x_i, y_i)\) used for training and the subset \((x_i, y_i)\) used for testing the model. Note that fit_transform
is called on the training dataset but transform
is called on the test dataset. We keep the labels \(x_i\) separate from the targets (i.e. signal/background) \(y_i\).
X_train = pipeline.fit_transform(df_train)
y_train = df_train['type'].values
X_test = pipeline.transform(df_test)
y_test = df_test['type'].values
Now we are ready to examine various models \(f\) for predicting whether an event corresponds to a Higgs decay or a background event.
Key Points
One must properly format data before any machine learning takes place.
Data can be formatted using sci-kit learn functionality; using it effectively may take time to master.