The paper “ Classification of Chances of Defaulting to Pay” is an informative example of a lab report on logic & programming. This is an ensemble classifier that consists of several decision tree classifiers hence the name forest. As an ensemble of many trees, it is considered to be computationally efficient and it operates very quickly over large datasets (). As an ensemble technique, the random forest classifier constructs many decision trees. These decision trees are then used to classify a new instance by the majority vote. The classifier combines bagging with a random selection of attributes.
Figuratively speaking, this process can be thought of as a class of weak learners coming together to form a stronger group. Each class receives a bootstrapped construct of the same information put differently. In our case, the bootstraps are constructed by randomly drawing with replacements from the training data set. Bootstrap has the same number of instances as the training set. To bring out the aspect of the same information put differently, each bootstrap is fed as input to a base decision tree learner which uses a subset of attributes randomly selected from the original set of attributes hence the name random.
Below is a detailed procedure that was followed in the random construction as a guide by (). The random forest model was then constructed. The random forest classification modelThe random forest classification model was built using KNIME as shown below Figure 2: A construction of the KNIME nodes for a random forest classification model. Below are the steps that were involved in running the model Pre-processing processes Classifier construction process Test and evaluation Evaluation of the random forest methodCSV Reader Node: The node uploads dataset for further processingColumn Filter Node: The node deletes irrelevant and redundant attributes.
Feature selection occurs naturally as part of random forest data mining but the column filter node improves efficiency. Missing Value Node: The node applies the list-wise deletion method. Only data with data on all the variables were analyzed. It was however noted that this would reduce the statistical power of this data. However, this was done as a routine KNIME process otherwise all the data was available. Partitioning Node: The node divides the dataset into 2 one 70% and 30% for the other.
The training set was used by the RandomForest node to construct a random forest classifier while the test set was used by Weka Predictor Node to evaluate the random forest classifier. The test set was used to estimate the accuracy of the model. To ensure that the test set is independent of the training set “ Draw randomly” was selected in the setting. This avoided the problem of over-fitting Based on Breiman & Culter’ s advice, the test set error was estimated internally by out of bag error. Classifier Construction Process Random Forest Node: This node was used to create the random forest classifier.
As seen in the figure below, some very important features need to be set for the classifier. MaxdepthIt is a mechanism to avoid the possible individual decision tree learner from building an over-complex tree which would otherwise create an over-fitting problem. Number featuresUsing the Breiman method of log M/ log2, this feature helps in determining the number of randomly selected variables. A smaller subset would produce less correlation between each individual classifier but also indicates a lower predictive power.