Data Exploration and Processing Assignment Example | Topics and Well Written Essays

Data Exploration and Processing Name Professor Date Data Pre-processing When doing data modeling using such a large and diverse dataset, it can be hard to get the true picture of what it really depicts and the trends that are associated with it. That is why it is often necessary to perform a variety of pre-processing activities to provide the best dataset before it is used for such modeling. Some of the pre-processing techniques that can be employed when handling any given data set include discretization, binning, linearization, and normalization (Kamiran & Calders, 2012). Binning To bin the dataset, the Age attribute was selected over the rest and used in the analysis. There are two types of binning procedures that were performed on it. These are: Equi-Width binning In order to undertake the equi-width binning, there are different steps that had to be followed. They are as follows: 1. To begin with, the range was found to be 81 and represented the difference between the ages of the oldest and youngest subjects in the survey. This was done in Excel using the formula =B5-B3 whereby B5 and B3 represent the cells containing those minimum and maximum values, respectively. 2. Using the formula =MIN (A2:A2001), the minimum age was found to be 17 years. This is a logical point to start from since the target sample group for this survey consists of those people who are in one form of employment or another. People who fall below this age are categorized as minors and hence are not eligible for any form of employment. It would therefore be inappropriate to include them in this survey. 3. Based on the number of data points, it was established that the number of bins to be used would be 7. This is because the range of values is approximately 80 years and hence dividing it into 7 bins would be sensible. The various bins will divide the dataset into 8 decades, which are clearly easier to handle and hence different kinds of calculations can be effectively done using them. Another useful decision which was made was that instead of converting them to the mean of the boundary, each data point would be transformed into the nearest bin boundary. This is meant to keep the boundary sizes 11 years, which is a whole number and hence more appropriate for the purposes of this analysis. It enables the analyst to produce accurate results that would otherwise not be possible to develop. 4. As mentioned earlier in step 3, the individual bin ranges used were 11 and it was therefore possible to use this quantity in calculating the upper and lower boundaries of each bin within the Excel spreadsheet. These were left within an unused column that would be utilized for further calculations when needed. 5. This analysis focuses on the ages of participants and it is therefore necessary to find a way of representing it in a manner that is both efficient and easy to interpret. This was done by making use of a VLookup formula that was used in the determination of which lower bin boundaries were closest to the ages being transformed. It then compares the obtained quantity with the value of the upper and lower boundaries before returning the values of the one which is closest. This formula is given as: =VLookup(D2, $A$2:$A$2001, 1, FALSE). The formula consists of some absolute referencing since the ranges across which the references will be done does not change. However, the limits do and hence they are represented in a relative manner as seen by use of D2 to refer to the cell under consideration at this instance. 6. This formula was applied to all the relevant records and the results were tabulated in the corresponding excel tables. Equi-Depth binning This is the other type of binning that was utilized and just like the other one, involves a number of steps before the final outcome is obtained. They are as follows: 1. From the given dataset, it was determined that the first and last sets fell on the 2nd and 2001st rows, respectively. It was also noted that the number of records within the data set was 2000. 2. The selected number of bins was set at 7 and therefore the average bin size would be 286 (from 2000/7). 3. Using this background knowledge, the bin boundaries were calculated. One of the assumptions made was that in the event that a value falls close to a cluster with the same quantity, it would be incorporated into the next lower bin. This means that the boundary would have to begin with the subsequent lowest values at each instant. 4. Despite all the steps followed beforehand, it is still necessary to efficiently transform all the ages into a manner that is appropriate for this given situation. This would offer a way of identifying the best tabulated matching values and consequently assigning it to the most suitable age. Normalization Normalization is an essential data pre-processing that ensures all the records within the dataset are not redundant and that the related ones are stored in a logical manner that would be most useful. That is why the two different normalization techniques discussed below were used Min/Max Normalization This technique makes it quite easier to not only model but also to visualize the data set being used in creating the models. It does this by changing the scale upon which the data is based, therefore making it easier to interpret as there are no strains involved. Many analysts favor using it due to the fact that unlike some other normalization techniques, it does not bring about any bias. In addition to this, it maintains the integrity of the data set as it does not interfere with the existing relationships between different ones. Despite this, it also has some drawbacks. Top among these is its tendency to bring about errors if a value that is outside the range of this normalized data is introduced in the future. This is the formula that is normally associated with doing the min/max normalization: A’ = (newMax - newMin) + newMin Z-score normalization The mean and standard deviations of a data set are some of the most important parameters and can therefore be used in place of the maximum and minimum in the normalization calculations. The major advantage associated with this technique is that even if the minimum, maximum and outlier values, the normalization can be precisely done without any hindrances. This is the reason why it was used alongside the other technique. It is based on the following formula: A’ = Discretization From the previous sections, it is quite obvious that this is a large data set comprising of participants spread across a diverse range of ages. To make visualization and manipulation easier, it can be broken down into distinct categorical values through discretization. It is possible to discretize this dataset with reference to the attribute age using nested if statements. The following formula was used in the first record and applied in the rest: =IF(A3>65, "Old", IF(A3>45, "Mature", IF(A3>30, "Mid-Age", IF(A3>20, "Young", "Teenager")))) To obtain the actual picture depicted by the results of this nested if, the frequencies of the different age brackets had to be obtained. This was done as follows for the teenage category and varied accordingly for the rest: =COUNTIF(B2:B2001, "Teenager") Binarization There are certain types of data models which require a very rigid set of data. This eases their creation as it allows for the straightforward interpretation of the results. One of the techniques used for such an analysis is Binarization whereby the different values are assigned with a weighted binary quantity- either 0 or 1. In this situation, it was decided that a university degree would be quantified as 1 while the rest of the education qualifications would be quantified as 0. The following formula was used on the first record before being applied in the rest: =IF(I2="university.degree","1","0") Furthermore, the frequencies of the two different sets of education qualifications were calculated in the following manner: =COUNTIF(J2:J2001, 1). Summary In the course of the data analysis, there are several observations that were made. These reveal a lot of information about this specific data set and are summarized as follows: Age With over 1000 candidates, it was noted that a majority of the subjects were middle aged. This is a logical statement given that they are by far the most productive category of employees. They are more experienced than their younger counterparts yet energetic enough to perform tasks that would be hard for individuals in the mature and old age categories. Teenagers account for the lowest number of employees, possibly because most are still in school and hence incapable of being actively employed. The young and mature age brackets comprised of slightly differing numbers with the mature ones being more. This is because most of them are already established than the younger ones, and are therefore more stable in their positions. The young employees are still rising in the hierarchy. There are only about 30 old people under employment, given that most are likely to have attained the retirement age. Job There were various job categories that were covered within the scope of the study. These varied in terms of how frequent they occurred, based on the complexity and level of education required to perform the tasks associated with them effectively. Most of the employees (498) perform administrative roles followed by the 451 blue collar employees. This is understandable as they are in the lower employment categories which traditionally require a larger number of employees. The number of managers, 152, is nearly equal to the total number of the entrepreneurs that were under study. This reflects the fact that the number of managers and those who are courageous enough to chart their own paths in entrepreneurship will always be lower. There is a small number of retired individuals which is largely comprised of members of the older generation. Marital status More than half of the individuals that were under study were married. This is logical given that from an analysis of the age brackets, most of the participants were middle-aged. They are therefore settled and in stable relationships. 568 of the participants were single, which is roughly equal to the number of young and teenage participants- who are unlikely to be settled down. A smaller number comprises of divorcees, who are largely members of the mature and old generations. Education The education qualifications were broken into binary forms which revealed that only slightly above 25% of the participants had university degrees. This indicates that a large chunk of the employees had only the lower levels of qualifications and were therefore highly likely not able to qualify for positions that required such a degree. The workforce is mostly driven by basic 4-9y and high school graduates. It also reinforces the pyramidal nature of employment in which a small number of highly educated or skilled employees occupy the top positions while the less educated ones form the bulk of the organization at the lower level. Housing and loans The data was further analyzed to determine how the participants faired as far as housing was concerned. Slightly more than half of them had housing and hence they could be assumed to have achieved one of the core objectives of any employee. Furthermore, the small number of loanees (318) within the group reveals that most of them are financially stable. It can therefore be concluded that they are in stable employment or are enjoying their retirement benefits. It also reveals that regardless of the level of education, age or type of employment, it is possible for one to use their income for purchasing a nice house in addition to saving for financial stability. Reference Kamiran, F. & Calders, T. Knowl Inf Syst (2012). Data Preprocessing Techniques for Classification Without Discrimination. Read More

Data Exploration and Processing - Assignment Example

Extract of sample "Data Exploration and Processing"

CHECK THESE SAMPLES OF Data Exploration and Processing

Underwater Communication

A Framework for the Management of Oil Spillage Risks in Oil Exploration Programmes

The Exploration and drilling process

Oil and Gas Exploration and Production Contracts

Big Data Management

Seismic Hydrocarbon Exploration

How Technology Affects Our Society

IT Landscape - Data Mining