Part B is a report on the various models that are built using the data set, including any imputations and transformations you may want to perform. We will be build two sets of models, with different data partitions.
To the Data node from above, add a ‘Manage Variables’ node. The presence of this node is something required by Viya if we want to impute and/or transform variables. We do not need to set anything within it.
To the ‘Manage Variables’ node, add nodes for imputations (if desired) and transformations. Make the desired changes to the data, as you see fit, then execute the pipeline. The transformation node should be on the bottom, after the imputation node.
To the transformation node, add one node for each of the models that we want to examine:
Decision tree Forest (often referred to as a Random Forest) Neural Network (4 total, using some different parameters) 1 hidden layer, 50 neurons per layer, TANH hidden layer activation function 5 hidden layers, 50 neurons per layer, TANH hidden layer activation function 1 hidden layer, 100 neurons per layer, TANH hidden layer activation function 1 hidden layer, 50 neurons per layer, ReLU hidden layer activation function Logistic Regression (4 total, with different variable selection methods): Forward Backward Stepwise (none) – this method forces in all the variables This gives a total of 10 models on the data, with a training/validation partition ratio of 50/50.
Prepare a summary table with each method used and the misclassification rate on the validation partition. Which model is the champion, having the lowest misclassification on the validation partition?
Discuss any observations you have on the results. Were there any changes in the results for the neural network with the different parameter settings? Did all the models come up with the same variables as the ones found to be predictive? Were there any differences in the various regression methods?
Next, create a new project as done before, with the same data set. This time, set the training partition = 60, the validation partition = 40, and keep the test partition again equal to zero. By increasing the relative size of the training partition, we increase the amount of data available for training but still have enough data so that (hopefully) overfitting will not be a problem.
Except for the data exploration node, rebuild your pipeline in the new project just as you did before and execute it.
Which model is the champion here? Compare the two champions of the two different data partitions – are they the same method? How do the misclassification rates for the two different partition functions compare? Are any trends noticeable?
End Part B – model building and evaluation