Data Mining and How it Can Be Address Assignment Example | Topics and Well Written Essays

REVIEW QUESTIONS (Pg 22) 1. Data mining Data mining refers to the extraction of information from large data sets using association, segmentation, clustering and classification for the purpose of analysis and prediction. This is done by use of algorithm and other statistical techniques. Data mining collects searches and analyzes large databases in order to identify the relationship and patterns within the large databases. 2. Difference between database, data set and data warehouse A database refers to an organized set of information within a particular structure. Database contains several tables and the number of tables depends on the organization size. Data warehouse on the other hand refers to a store of data designed to aid analysis and retrieval of information. The data stored in data warehouse is merged from various systems in order to make analysis and retrieval easier and quicker in those systems. Data set refers to a collection of data within a storage device. Here the data warehouse is denormalized in order to obtain a single table unlike data warehouse where several tables are involved. Therefore data warehousing is an aggregation of data that supports data mining, while Data set is a data warehouse subset whereas Database refers to organized set of data in a given structure. 3. Limitations of data mining and How it can be address Data mining refers to the actual process of data sorting and analysis within a data warehouse. The process is an emerging technique and has become so popular; however it has some challenges that need be solved for it to be complete. These limitations include:- a. Reliability Data mining is a very powerful tool but it cannot stand alone. The tool needs an expert to feed accurate data to it and interpreted the output in order to give appropriate conclusion. b. Quality of data The data mining process is affected by the quality of data. The quality of data entered fully depends on the accuracy of the data entered. Wrong data entry due to human error will greatly affect the output and hence incredible forecast. Since everything depends on data entered, wrong data entry greatly affects reliability, decision making and even knowledge discovery process. The quality of data can still be affected by Missing data values, Presence of duplicate records, Lack of proper data standards, Presence of unneeded data fields or Lack of timely data updates. These can be improved by:- • Entering appropriate values for missing records • Removing duplicate and unneeded data fields • Standardizing data formats • Identifying and removing logically wrong values • Updating data fields in a timely manner c. Interoperability Interoperability is a challenge to data mining. This is because data mining information is collected from a variety of sources making data handling difficult due to the difference in data formats. d. Security and Privacy Privacy and security issue is a concern given that a large amount of information is stored for data mining purpose. The information could be private or even sensitive yet susceptible to illegal access or disclosure of classified information. Data mining needs a highly skilled expert to select the best alternative method of data mining. Here it makes it very hard for a novice user to apply security control and use the correct combination of data mining algorithm in order to produce the best output and no compromise to data security. 4. Difference between operational and organizational data with pros and cons of each? a. Operational Data This refers to data generated during the day to day running of an organization. The data is generated by the organizational transaction systems that capture the records of all daily operation activities. Pros • It is the primary data with full documentation • This provides a good start for data mining activities Cons • The daily transactions are prone to human error which inhibit quality data • Data limited to a given operation b. Organizational Data Organizational Data refers to data collected by an organization for the purpose of its day to day operations. The data is either in summery or in aggregate format. The origin of the data could be from the operational data or other secondary sources such as questionnaires, surveys or even tests. Pros • The data is collected from a variety of sources for effective analysis and output. • Data can be summarized for easier understanding. • The data is able to solve cross-functional processes Cons • Being secondary data, organizational data may lack detailed documentation • The context of data creation is dynamic and may have changed 5. Ethical issues in data mining AND How IT CAN be addressed Ethical issues Ethics refers to a set of guidelines or code of conduct developed by an individual in order to make decisions and judgments that are morally acceptable. Data mining uses people’s data which represents their lives. There are customer buying behaviors, ethnic background, medical records, financial status and even marital status. The information user may use individual’s information for discrimination or favourism. This could be discrimination against gender, sex, ethnic, religious or political orientation. The practices of data mining such as using decision trees to predict the ethnic background of individuals with an intention to profile such individuals for the purpose of discrimination is not only unethical but also illegal. Some countries fight data mining because of it capability to collect, analyze and produce information trends likely to profile some individuals. This is termed as unethical. How to address the ethical issues The people’s data many be private and confidential, hence the data mining specialists need to have the moral standards concerning the use and access of such information. Understanding ethical issues in data mining is the responsibility of management, staff and other web users. When ethical issues are identified and applied, individuals are treated equally and fairly. Data mining specialist should be cautious while handling the models of data mining that has the capability of branding and individual as a certain risk. They need to be sensitive to personal rights and feelings. Individuals whose data is to be collected and used should be informed of the same. The information should then be used for the specific purposes requested. 6. Out-of-synch data. How the situation can be remedied Out-of-synch refers to data copied out of transactional system to the data warehouse after denormalization. The record is Out-of-synch because there are subsequesnt changes that occur in the original record of the source database that are not reflected in the copied records. This makes the data mining activities executed on out-of-synch records misleading and worthless. This case can be solved by adopting the alternative method of archiving where records are moved out of transactional system to avoid further updating or alteration hence data won’t get-out of-synch. 7. Normalization. reasons for OLTP systems and not so OLAP systems Normalization is a process in relational database systems where data is broken in to small related tables to eliminate redundancy and improve performance. The related tables have columns common to each other. The tables will be related by the common column. Online Transaction Processing Systems (OLTP) is an example of relational database system designed to handle large volume of data with lots of updating and data retrieval. These systems are used commonly by supermarkets and banks due to high volume of data involved. Normalization is a good thing in OLTP because the data collected within a very short time through various methods is enormous. Normalization may not be good enough for Online Analytical Processing (OLAP) systems. These are data analysis and data mining systems. The data warehouse does denormalisation, merging all tables for easier retrieval and analysis from multiple tables at the same time. Normalization is therefore not good with OLAP systems. EXERCISE 1 (PG 22) 1. Relational Database The database has three Related Tables Relational Database 2. Data Warehouse The above three tables have been denormalized and merged to form one large data set. The denormalized data in a data warehouse will facilitate faster data analysis to avoid looking up from different tables. The data in the data ware house may be duplicated after denormalization; however the main concern is analysis which is key in this context. ID Name Reg No ID Code ID Hosp Bed Capacity Denormalised data 3. Data Security and Privacy Sites a) http://business.ftc.gov/privacy-and-security b) http://www.mofo.com/practices/services/litigation-trials--appeals/privacy--data-security c) http://www.theguardian.com/news/datablog/2013/jul/31/data-security-privacy-can-we-have-both Application to data mining Data mining refers to the extraction of information from large data sets using association, segmentation, clustering and classification for the purpose of analysis and prediction. Data mining collects searches and analyzes large databases in order to identify the relationship and patterns within the large databases. 4. Summary of a web article with explanation on how it might be related to data mining The internet being a large search area, the information can be collected, analyses and interpreted to provide the most relevant information. This is can be done by use of algorithm and other statistical techniques. 5. Data set description a. Contents The record has some medical records of patients in a government hospital b. Purpose The main purpose of the document is basically to store daily patient’s data on visit c. SIZE The file is 500KB with over 500 rows and 12 column d. Age The document was created three years go. Classification Given that the file has client’s medical records; it is there therefore ethical to classify it as confidential so that it is kept u under lock and key to avoid unethical behaviors where some records can be accessed by unauthorized persons. Any authorized individual who will need to access the file will therefore strive to prove either by system rights or permission from authorizing person. 6. Application for a grocery store shopping card. Requirement a. National Identification Card a) Name b) Residence c) Permanent address d) Secret question e) Next of kin details f) Telephone Number g) Zip Code h) Date The data above can aid in data mining activity where the grocery store wants to analyze the number of people from a given region applying for the card. Similarly the dates of the month where many applicants were received can be identified for future preparation. Privacy concerns. The information given is private and confidential. If such information is exposed to the public then individuals with malicious intentions can either make use of the data to swindle money or profile the individuals based on their political, religious or ethnic orientation. REVIEW QUESTIONS (PG 55) 1. Four main processes of data preparation, What they accomplish and their importance Data preparation in data mining is the third phase. This comes after data has been collected. The data preparation phase is meant to transform data for effective data mining algorithms application. a. COLLATION This is a process of merging data from multiple tables of a database in to a single location or data set. The data set will have easily mined data due to its richer and consistent information. Collation allows for preparation and organization of data in relational database effective data mining activities. b. DATA INTEGRATION This is the combination of data or records from different data sources. It involves merging and appending. c. DATA SCRUBBING Data scrubbing is also referred to as data cleansing. it is a process of removing or correcting the incorrect, improperly formatted , incomplete or duplicated data in the data in a database. Data collected has discrepancies such as missing data, inconsistent data and other anomalies introduced at some point of data collection process. This affects quality and integrity. Data scrubbing process allows for correction or handling of such anomalies to avoid misleading output during the process of data mining. This process is done in different ways; these include how to handling missing and inconsistent data, reducing attributes and reducing data. The data discrepancies once eliminated either by use of algorithms, rules and look-up tables can correct specific anomalies and identify missing records hence reducing time for data preparation. d. Handle Missing Data There could be missing records in the data collected. This therefore means that the data set will not have the data but rather null fields. The data mining objective will determine whether the null fields will be filled with some values or left just like that. It is important to handle missing data in order to avoid generation of misleading output from data mining results. e. DATA REDUCTION Large data sets are complex and can make the data mining process overwhelming and confusing. By use of data reduction, these data sets can be reduced to manageable sizes for effective use. Through effective filtering process, the data reduction can be achieved. Observations can be filtered out to ensure only relevant data is worked upon. f. HANDLING INCONSISTENT DATA Inconsistent data occurs when a value that exists in a data set is not valid and is not making any meaningful. This is affects quality of data mining process since the expected output will not be achieved. There are also chances where data could be interpreted wrongly by the software. g. ATTRIBUTE REDUCTION Data sets may contain attributes that are irrelevant to the task at hand. this call for removal of those attributes that are not needed immediately to facilitate faster handling of the most relevant attributes for a given task. To determine the extent of attribute value needs statistical assessment of its correlation with the data to be evaluated. There is need therefore for attribute reduction through evaluation of the correlation, or the magnitude of the relation between attributes. The removed attributes are not deleted completely because they may be useful in other tasks rather than the one at hand. 2. Ways to collate data from a relational database Data from relational database can be denormalisation or collated by creating a database view. This organizes the data in readiness of data mining activities. Several tables in the relational database can be merged together to create a large data set stored in the data warehouse for faster accessibility. 3. Need for data set scrubbing Data sets need to be scrubbed because of the following reasons:- To be able to handle missing data To reduce data or observation To be able to handle inconsistent data To reducing attributes 4. Reasons for performing reduction using operators and not excluding attributes or observations Data and attribute reduction is better performed using operators rather than exclusion or deleting. This is because the data reduced may not be important in the current task but it may be needed later. The use of operators and filters help eliminate the data in the current task but maintain it for a later task. 5. Data repository in RapidMiner and how it is created The Repository area is the place where you will connect to each data set you wish to mine. It is like a storage location. This area will hold all the data sets. Creating a data Once the RapidMiner application is launched, a message prompting you to set up a data repository is displayed. This is followed by another prompt to select whether to set up a local repository or a remote repository. This will depend on the organizational operations. For organizations running online business, remote repository is fine but for learning purposes, we shall use a local repository. 6. Problems of inconsistent data in data mining Inconsistent data refers to data in the data set with values that are not making any meaning within the normal range context due to its different nature or format. Data sets with inconsistent data will not produce the expected output. Inconsistent data will provide mislead information on data analysis. Such information will be unreliable and inaccurate. EXERCISE 2 (PG 55) 1. Downloaded data set 2. Downloaded data set 3. Imported the worksheet to RapidMiner repository Imported excel to the repository Running the imported excel 4. Created a new, blank process stream in RapidMiner and dragged the data set into the process window. Dragged Data Set into the Rapidminer New Process 5. Run the process and examined the data set in both meta data view and Data View. and noted if any attributes have missing or inconsistent data. 6. Checking missing or inconsistent data 7. Filtered out some observations based on some attibute’s value, and filter out some attributes. Filtering observations Filtering process Filtered Report REVIEW QUESTIONS (PG 68) 1. Limitations of correlation models Correlation refers to the measure of the strength of relationship between the data set attributes One of the major limitations of correlation models is that only numeric data can be conducted. 2. Correlation Coefficient and interpretation Correlation coefficients refer to that tool that can help us identify the relationship among attributes together with the strength of the relationship between the attributes. The correlation coefficient means the relationship strength measure between attribute sets in a data set. 3. Difference between a positive and a negative correlation The correlation can either be positive or negative. The positive correlation means the two attributes move in the same direction. Negative correlation on the other hand means two attribute moving in different direction. When two attributes have values decreasing or increasing at the same rate, then that is a positive correlation since they are moving in the same direction. This is because the coefficient is a positive number hence correlation is positive. This means that with the rise of one attribute value there is also rise in the other attribute. Same applies to attribute fall in the same direction for both attributes. This is a positive correlation. Correlation coefficients from 0 to 1 are positive correlations, while coefficients from 0 to -1 are negative correlations. Negative correlation Positive correlation 4. Measuring of correlation strength and ranges for strengths of correlation. Correlation strength is measured by the correlation coefficient. Since all correlations coefficients falls between -1 to 1, the closer the correlation coefficient is to 1 and -1, the stronger is the correlation. 5. Heating oil consuming devices Another very interesting attribute that could be added is the international consumer market segmentation. As the state of the economy keeps on tightening, every organization strives to expand its market in order to make profit. The dynamic market stability faced with stiff competition coupled with multicultural workforce has lead to internationalization and technological innovations that attempt to counter some of the market challenges. This is achieved as organizations strive to understand the market trends, customer behavior and competitors action, solution is crucial. This is therefore where data mining comes in handy in order to assist in data analysis and correlation in order to understand the market situations. EXERCISE 3 (pg 68) 1. Best Paid Athlete http://www.forbes.com/athletes/list/#tab:overall 2. Rank Name Pay_M Salary_M Age Sport 1 Mahendra Singh 30 4 19 Cricket 2 Fernando Alonso 31 29 18 Racing 3 Lewis Hamilton 32 29 22 Racing Best Paid Athlete Data portion 3. Best Paid Athlete filtered report 4. Imported Best Paid Athlete data set 5. Filtering to remove athlete’s name foe effective correlation matrix 6. 7. Correlation Matrix operator Model running Correlation Coefficients 8. Interpretation of the correlation coefficients as displayed on the matrix tab. Correlation strength is measured by the correlation coefficient. Since all correlations coefficients falls between -1 to 1, the closer the correlation coefficient is to 1 and -1, the stronger is the correlation. 9. Two-Dimensional Scatterplot REVIEW QUESTIONS (88) 1. Association rules and their importance. Association rules refer to data mining methodology seeking to establish the attribute frequent relationship in a data set. By use of this methodology, the data miners are able to establish the related products that are always purchased together. This makes it easier for them to place the items in an adjacent position to ease selection. 2. Main metrics to calculated association rules the calculation. The two main metrics calculated in association rules are the Confidence percentage and support percentage. Confidence Percentage is a measure of the likelihood of false positives in predictions. The collective confidence percentages should always be 100%. In this case Confidence Percentage will measure how confident we are that once one item is elected the related item will also be selected. To calculate confidence Percentage we divide the instances where the attributes coincided by the instances where they could have coincided Support percentage in an association rule refers to the percentage of the number of times two attributes are found together against the total number of times they could have been found together. It is calculated by taking the number of times the rule did occur, divided by the number of observations in the data set. The absolute number of times the association could have occurred is the number of items in the data set. 3. Data type for a data set’s attributes in order to use Frequent Pattern operators in RapidMiner. Data Type must be assigned to every attribute in a data set. The data type will determine the storage of data in the attribute. The data type can either be date, character, number among others. RapidMiner similarly has several data types for use, this include Polynomial and Binominal in the Character area, Real and Integer in the Numeric area. Therefore data types must be changed in Rapidminer from numeric to binominal. This is because the association rules operators need this data type in order to function properly. 4. Results interpretation Rule interpretation is a bit complex; however with the use of confidence percentage and support percentage metrics, this becomes very easy. Through the scenario, we have been able to identify the connection between various attributes and hence able to assist Roger on the existence of linkages between types of community groups. The exercise found that community’s churches, family, and hobby organizations have some common members. The association rule models in RapidMiner together with new operator enabled the change attributes’ data types. This has shown how data mining can identify linkages in data that can have a practical application. Exercise 3 (pg 88) 1. 2. Shopping Basket Analysis Process Window 3. References Mathew, North. (2012). Data Mining for the Masses. Retrieved from https://sites.google.com/site/dataminingforthemasses/ Ram, Sudha. (2002) "Data Mining." Computer Sciences. Retrieved June 26, 2014 from Encyclopedia.com: http://www.encyclopedia.com/doc/1G2-3401200509.html Read More

Data Mining and How it Can Be Address - Assignment Example

Extract of sample "Data Mining and How it Can Be Address"

CHECK THESE SAMPLES OF Data Mining and How it Can Be Address

Enterprise Data Warehousing and Data Mining

Benefits of Data Mining to the Businesses When Employing

Data Mining and Behavior of Customers

Use data mining tools (Weka) to enhance a marketing exercise

Data Mining, Its Purpose and Its Working Methodology

Data Mining of Customer Information by IT Workers

High Level ETL and Data Mining Requirements

The Use of Temporal Database in the Data Management System