The paper "Data Warehousing and Analytics" is a wonderful example of an assignment on logic and programming. PART A This part consists of three short answer questions and does not require the use of the computers. Answer all THREE questions. QUESTION A1 (6 marks) Explain the point of partitioning a dataset from which we want to build a model, into training and testing (or training/validation/testing) datasets. When fitting models to a large dataset, it is advisable to partition the data into training, validation, and testing datasets. This is done because the models normally have 3 levels of parameters.
The first parameter is referred to as model class (e. g. Decision tree, random forest, etc. ), the second parameters are the hyperparameters or regularization parameters (e. g. neural network structure, choice of the kernel) and finally, the third set is what is generally referred to as the parameters. Given a model class and a choice of regularization parameters, someone chooses the parameters by selecting the parameters which reduce error on the training set. Given a model class, someone tunes the regularization parameters by reducing error on the validation set.
Someone then chooses the model class by performance on the test set. The partitioning of the dataset is usually done randomly to make sure that each dataset represents the whole collection of observations. Usual splits are 70/15/15 or 40/30/30. QUESTION A2(7 marks) To identify interesting association rules the concepts of support, confidence, and lift are used. Explain the three terms: Support: The support of an item set is described as the proportionality of transactions in the dataset which hold the item set. That is, it is the percentage of groups that hold all of the items listed in an association rule.
The percentage value is obtained from all the groups that were considered. This value indicates how frequent the joined antecedent and the consequent occur among all the considered groups. Confidence: The confidence of an association rule is a percentage value that indicates how often the consequent occurs among all the groups containing the antecedent. The confidence value shows how this rule is reliable. Lift: The lift value of an association rule is the ratio of the expected confidence of the rule and confidence of the rule. The expected confidence of a rule is described as the product the support values of the antecedent and the consequent divided by the support of the antecedent.
The confidence value is described as the ratio of the joined antecedent and consequently divided by the support of the antecedent. Explain how you would use the concepts of support, confidence, and lift to identity interesting rules.
IBM. Data warehousing and analytics. 5 June 2012. Web. 30 July 2013.
Williams, G. Data mining with rattle and R: The art of excavating data for knowledge discovery. London: Springer, 2011. Print.