Data Warehousing and Analytics Assignment Example | Topics and Well Written Essays

STUDENT ID NUMBER: SCHOOL OF INFORMATION SCIENCES AND ENGINEERING ASSIGNMENT 2 2013 DATA ANALYTICS & BUSINESS INTELLIGENCE UG PERMITTED MATERIALS: any materials. INSTRUCTIONS: 1. Answer all questions. 2. Write your answers to the questions in a single Word document, clearly labelling each answer. You may cut and paste items from software into your document, and use Ctrl-Alt-PrtScn (on Windows) to create a screen shot for your document. 3. Submit the single Word document under Assignment 2. MARKS: Marks for each part of each question are shown. Total marks = 80. There are 8 pages in this examination booklet, including this cover page. PART A This part consists of three short answer questions, and does not require the use of the computers. Answer all THREE questions. QUESTION A1 (6 marks) Explain the point of partitioning a dataset from which we want to build a model, into training and testing (or training/validation/testing) datasets. When fitting models to a large dataset, it is advisable to partition the data into training, validation and testing datasets. This is done because the models normally have 3 levels of parameters. The first parameter is referred to as model class (e.g. Decision tree, random forest, etc.), the second parameters are the hyperparameters or regularization parameters (e.g. neural network structure, choice of kernel) and finally the third set are what are generally referred to as the parameters. Given a model class and a choice of regularization parameters, someone chooses the parameters by selecting the parameters which reduce error on the training set. Given a model class, someone tunes the regualrization parameters by reducing error on the validation set. Someone then chooses the model class by performance on the test set. The partitioning of the dataset is usually done randomly to make sure that each dataset represents the whole collection of observations. Usual splits are 70/15/15 or 40/30/30[GWi11]. QUESTION A2 (7 marks) To identify interesting association rules the concepts of support, confidence and lift are used. (1) Explain the three terms: a. Support: The support of an item set is described as the proportionality of transactions in the dataset which hold the item set. That is, it is the percentage of groups that hold all of the items listed in an association rule. The percentage value is obtained from all the groups that were considered. This value indicates how frequent the joined antecedent and the consequent occur among all the considered groups. b. Confidence: The confidence of an association rule is a percentage value that indicates how often the consequent occurs among all the groups containing the antecedent. The confidence value shows how this rule is reliable. c. Lift: The lift value of an association rule is the ratio of the expected confidence of the rule and confidence of the rule. The expected confidence of a rule is described as the product the support values of the antecedent and the consequent divided by the support of the antecedent. The confidence value is described as the ratio of the joined antecedent and consequent divided by the support of the antecedent[IBM12]. (2) Explain how you would use the concepts of support, confidence and lift to identity interesting rules. QUESTION A3 (7 marks) Cluster analysis is a data mining technique used to divide data into meaningful groups. (a) Describe (in your own words) the steps that the k-means algorithm goes through to generate clusters. The algorithm is made up of the following steps: 1. K points are placed into the space represented by objects being clustered. These points are a representation of the initial group centroids 2. Each object is assigned to the group that has the closest centroid 3. After assigning all the objects, positions of the K centroids is recalculated 4. Step 2 and 3 are repeated till centroids cannot move any more. This results to a separation of the objects into groups from which calculation of the metric to be minimized can be made. (b) Describe any pre-processing steps you might need to complete before generating a k-means cluster analysis (using the Euclidian distance measure). Some pre-processing steps that might need to complete before generating a k-means cluster analysis include deciding the number of clusters and randomly selecting the center for each cluster. (c) Imagine you are performing a k-means cluster analysis using Rattle. Describe the steps you would go through to determine the optimal number of clusters. To determine the optimal number of clusters, one can use a plot of the within groups sum of squares by number of clusters extracted. The analyst usually looks for a bend in the plot. Rattle also provides an Iterate Clusters option to help with identifying a good number of clusters. (d) Describe the measures / characteristics you would use to evaluate a k-means cluster analysis you have created in Rattle. Characteristics that I would evaluate a k-means analysis include: Cluster, a matrix of cluster centres, the total sum of squares, the number of points in each cluster, the between-cluster sum of squares, and total within-cluster sum of squares. PART B This part consists of two practical data analysis questions, which should be answered using software. Answer BOTH questions. QUESTION B1 (6 + 7 + 7 = 20 marks) Data mining techniques have been widely applied in the medical domain to assist in the diagnosis of various medical conditions. Researchers wish to be able to classify tissue samples taken from tumours as either benign or malignant samples. The following dataset scores each tissue sample on 10 characteristics. These characteristics have been established as differing between benign and malignant samples. Each sample is scored on a scale of 1 to 10 (with 1 being the closest to benign, and 10 being the most malignant) for each characteristic. No single characteristic or pattern of characteristics has been identified that can distinguish between benign and malignant samples. A neural network is a good candidate technique for identifying the complex relationship between the 10 characteristics, and the actual classification of the tumour (benign or malignant). The dataset can be found in the file CancerWisconsin.csv on Moodle. The variables in the file are as follows: Data Description Variable Name Values Clump Thickness 1-10 Uniformity of Cell Size 1-10 Uniformity of Cell Shape 1-10 Marginal Adhesion 1-10 Single Epithelial Cell Size 1-10 Bare Nuclei 1-10 Bland Chromatin 1-10 Normal Nucleoli 1-10 Mitoses 1-10 Class Benign or Malignant (a) Load the CancerWisconsin.csv dataset into Rattle. Set Class as the target variable. Partition your data using the default settings. Create a neural network, leaving the number of hidden layer nodes at the default value of 10. Record the performance of the network for the validation partition (choosing appropriate measures from the Evaluate tab, and the validation radio button). Error matrix for the Neural Net model on CancerWisconsin.csv [validate] (counts): Predicted Actual Benign Malignant Benign 61 3 Malignant 1 35 Error matrix for the Neural Net model on CancerWisconsin.csv [validate] (%): Predicted Actual Benign Malignant Benign 59 3 Malignant 1 34 Overall error: 0.04 Create another neural network model, this time with the number of hidden layer nodes set to 5. Record the performance of the network. Error matrix for the Neural Net model on CancerWisconsin.csv [validate] (counts): Predicted Actual Benign Malignant Benign 62 2 Malignant 1 35 Error matrix for the Neural Net model on CancerWisconsin.csv [validate] (%): Predicted Actual Benign Malignant Benign 60 2 Malignant 1 34 Overall error: 0.03 Cut and paste the performance measures into your Word document. Comment on the differences in the performance between the two models, and explain the likely causes(s) of any differences. There were differences in their true negatives and true positives as well as their false positives. The likely cause of the differences is that the two models used different number of hidden layer nodes. (b) Comment on the false positive and false negative rate (as provided by the Error Matrix) of the model with the best performance from part (a) above. Comment on whether it is most important to minimise the false positive or the false negative rate for this particular dataset. The false positive and negative rates of the model with the best performance are 1 and 2 respectively. It is important to minimize false positive and negative rates to avoid false predictions. (c) Experiment with different numbers of hidden layer nodes to identify the optimum size of the hidden layer. How many samples are required for a network of the size you have determined is optimal? Calculate the number of samples required, and provide your working. QUESTION B2 (4 + 4 + 4 + 6 + 6 + 8 + 8 = 40 marks) This dataset was collected to analyse the success of a marketing campaign conducted by a banking institution. The bank wishes to use the information to develop a model to predict whether a customer is going to sign up for a term deposit. Variable Name Variable Type ID Numeric Age Numeric Job Categorical Marital Categorical Education Categorical Default Categorical Balance Numeric Housing Categorical Loan Categorical Contact Categorical Day Numeric Month Categorical Duration Numeric Campaign Numeric Pdays Numeric Previous Numeric Poutcome Categorical TermDeposit Categorical (a) Load the bank.csv dataset into Rattle. Make sure TermDeposit is set as the target, ID as identity and the rest of the variables as inputs. Partition your data using the default settings. Create a decision tree. Change the value for “Complexity” to 0.0 before pressing Execute (note that is some versions of rattle, complexities of 0.0 are changed to 0.0001). Press “Draw” to draw the resulting tree. Copy and paste this chart into your answer Word document (one way to do this is to press ctrl + print screen while the chart is active, to copy a screen dump to your clipboard). Lowering the value for Complexity has caused Rattle to create a “full” decision tree – which is very complex. (b) Generate the Error Matrix for your model, using the validation data partition. Copy and paste the matrix into your Word document. Write down the overall error shown in your output. Error matrix for the Decision Tree model on bank.csv [validate] (counts): Predicted Actual no yes no 582 27 yes 49 20 Error matrix for the Decision Tree model on bank.csv [validate] (%): Predicted Actual no yes no 86 4 yes 7 3 Overall error: 0.1120944 (c) Generate a ROC curve for your model, using the validation data. Copy and paste the ROC chart into your Word document. Compare the error rate with the area under the ROC curve in a couple of sentences. The area under the curve is bigger than the error rate (d) Prune your tree to optimise your model. Press “Draw” to draw the resulting tree. Copy and paste this chart into your answer Word document. Describe the process you followed to arrive at the optimal decision tree model. Include any performance measures you used to evaluate the models. (e) Compare the performance of the full decision tree and your pruned tree on the validation data partition. Using your knowledge of the decision tree algorithm, explain the difference you have observed. (f) Examine the distribution of the target variable TermDeposit (Untick ‘Partition’ to examine the full dataset, click ‘Execute’ then go to Explore -> Distributions -> Bar Plot). Comment on whether the distribution of the target variable may have affected the ratio between false positives and false negatives in the pruned decision tree created in (d). Explain why there is / is not an effect. (g) Create a Random Forest model (leave the tuning parameters at their default values). Compare the performance of the Random Forest model and your pruned decision tree. Using your knowledge of the Random Forest algorithm, explain any difference in performance you observe. Summary of the Random Forest Model ================================== Number of observations used to build the model: 4521 Missing value imputation is active. Call: randomForest(formula = TermDeposit ~ ., data = crs$dataset[, c(crs$input, crs$target)], ntree = 500, mtry = 4, importance = TRUE, replace = FALSE, na.action = na.roughfix) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 4 OOB estimate of error rate: 9.95% Confusion matrix: no yes class.error no 3862 138 0.0345000 yes 312 209 0.5988484 Analysis of the Area Under the Curve (AUC) ========================================== Call: roc.default(response = crs$rf$y, predictor = crs$rf$votes) Data: crs$rf$votes in 8000 controls (crs$rf$y no) < 1042 cases (crs$rf$y yes). Area under the curve: 0.5 95% CI: 0.489-0.511 (DeLong) Variable Importance =================== no yes MeanDecreaseAccuracy MeanDecreaseGini duration 57.90 101.44 97.35 162.48 month 42.61 14.39 45.56 66.38 poutcome 22.60 17.66 30.94 34.24 day 28.53 4.58 28.41 50.36 contact 25.47 0.81 26.18 10.49 pdays 19.10 6.69 20.33 24.66 age 16.53 3.20 16.94 52.23 previous 14.25 10.24 15.26 14.27 housing 11.63 2.45 11.91 8.08 education 10.63 -1.77 8.96 14.45 job 10.27 -0.64 8.76 43.07 marital 6.65 3.51 7.68 13.14 campaign 3.96 2.60 4.81 21.40 loan 2.25 4.03 4.00 3.99 default 3.32 1.86 3.90 1.72 balance 2.32 1.24 2.82 53.82 Works Cited GWi11: , (Williams 60), IBM12: , (IBM), Read More

Data Warehousing and Analytics - Assignment Example

Extract of sample "Data Warehousing and Analytics"

CHECK THESE SAMPLES OF Data Warehousing and Analytics

Business Requirements for Data Warehouse

Data Warehousing for Business Intelligence

Benefits and Importance of Data Warehousing

The Issue of Management of Big Data

W1 Asig Datawarehouse

W2 Asign Determining Requirement in Data Warehousing

W6 Asig HMO Information Delivery Framework and Data Warehousing

Business analytics