What is Data Science Math Problem Example | Topics and Well Written Essays

ITECH 2201 Cloud Computing School of Science, Information Technology & Engineering Workbook for Week 6 Part A (4 Marks) Exercise 1: Data Science(1 mark) Read the article at http://datascience.berkeley.edu/about/what-is-data-science/ and answer the following: What is Data Science? There exists no consensus on what the concept of data science means as the term means different these to different people. Nonetheless, data science refers to an emerging interdisciplinary field that can be found at the intersection of the several disciplines including social science, statistics, design, information and computer (Datascience@berkeley, n.d.). According to IBM estimation, what is the percent of the data in the world today that has been created in the past two years? IBM estimate that 2.5 quintillion of data is create every day, which is unprecedented rate that a full 90 percent of all the data that exist in the world today has been generated in the last two years. This data originate from diverse sources. ____________________________________________________________________________ What is the value of petabyte storage? The value of petabyte storage is 1024 terabyte or one million gigabytes For each course, both foundation and advanced, you find at http://datascience.berkeley.edu/academics/curriculum/briefly state (in 2 to 3 lines) what they offer? Based on the given course description as well as from the video. Research Design and Application for Data and Analysis is a foundation course that offers skills in how to apply disciplined, creative methods to enable them advance better questions, collect data efficiently, interpret findings, and present the findings to different audiences. Data Science W203: Exploring and Analyzing Data introduces learners to several quantitative research methods and statistical techniques employed in data analysis and covers inferential statistics, sampling, experimental design, tests of differences, measurement, and general linear models. Data Science W205: Storing and Retrieving Data covers data storage, management and retrieval necessary in analysis. It aims at providing learners with theoretical knowledge and practical experience to help students master bid data management, storage and retrieval. Data Visualization and Communication is a foundation course in which students learn how to communicate patterns found in data clearly and effectively. It focuses on design and implementation of complementary of both visual and verbal representations of analyses in presenting results, responding to questions, and driving decision. Data Science W251: Scaling Up! Really Big Data offers the students with an overview of the complementary toolkits for solving problems associated with big data and cloud computing. Date Science W231. Behind the Data: Humans and Values is an advanced course that introduces the students to legal, policy, and ethical implication of data, and covers related issues including data privacy, surveillance, security, classification, and discrimination, among others. Experiments and Causal Inference is an advanced course offering skills in experimental design, statistical analysis, communication findings, cleaning data and mining and exploring data. Also, it introduces learners to experimentation and designed based inference. Data Science W271. Applied Regression and Time Series Analysis is an advanced course that offer skills in application of more advanced methods derived from regression analyses and time series models. It stresses on selection, application, and implementation of statistical techniques to detect significant patterns and develop insights from the data. Machine Learning at Scale offers skills in code up machine learning algorithms focusing on both singe and clusters of machines, parallel computing, algorithmic design, working on problems related to terabytes of data, among others. Exercise 2: Characteristics of Big Data(2 marks) Read the following research paper from IEEE Xplore Digital Library Ali-ud-din Khan, M.; Uddin, M.F.; Gupta, N., "Seven V's of Big Data understanding Big Data to extract value," American Society for Engineering Education (ASEE Zone 1), 2014 Zone 1 Conference of the , pp.1,5, 3-5 April 2014 and answer the following questions: Summarise the motivation of the author (in one paragraph) Ali-ud-din Khan, Uddin and Gupta (2014) were motivated by the great potential Big Data presents particularly in real world industry as well as among scientific researchers. More so, the authors were motivated by the actual power and potential of Big Data in helping humans solve real world problems as it offers “raw ingredients” for constructing the future. However, the major concern of the authors is the need to have a better understanding of the concept of Big Data. Therefore, they sought to understand the term from the perspective of the “7 V’s.” What are the 7 v’s mentioned in the paper? Briefly describe each V in one paragraph. Volume (1st V) is the amount of data generated from diverse sources, such as text, audio, video, social networking, research studies, and medical reports among others. Another issue of arising from the volume is that Big Data is often disorganized and unknown, thus cannot be managed with traditional methods, including SQL. Velocity (2nd V) is the speed at which data is transferred or shared that makes working on Big data somewhat difficult. The unprecedented speed at which humans create data challenges human control. High speed also creates large data volume which complicates the problem of Big Data. Therefore, there is need for new technology that can efficiently manage Bid Data. Variety (3rd V) relates to the large variety of types in which data appears, including audio, video, digital content, text, images, and such. This variety complicates Big Data as it is a great challenge to develop a single system that can make direct integration of such data variety. People user different software, applications, platforms, and browsers and they share content differently to the cloud, which bring forth the problem of Big Data variety. Veracity (4th V) is the integrity or truthfulness of data. It concerns the level of certainty people have regarding the data they work with. Veracity stretches to the meaningfulness of the findings people derive from the particular data that they are manipulating or analyzing in order to solve a give issue. Today, data is not only big, but also disorganize and unstructured, hence becoming difficult for one to trust the data they are exploring. Validity (5th V) implies the correctness and accuracy of data based on the intended outcome or usage. Validity concerns how data is understood irrespective of whether it has veracity issues or not. This calls for the need to substantiate relationship between variables of data at hand in order to validate it against the target purpose. Volatility (6th V) is the expiry of data. That is, once data has been used and particularly satisfied the intended purpose, and the retention period expires, the data can easily be destroyed or relegated. Big Data, just as real world data are no exceptions. Value (special 7) is the particular value users always want or expect from any data they search or retrieve. It is the desired outcomes of manipulating data. People are often seek to derive maximum value from the date sets at their disposal. Given the volume and variety of Big Data, it becomes imperative to look for the actual value of data people is assigned. Explore the author’s future work by using the reference [4] in the research paper. Summarise your understanding how Big Data can improvise healthcare sector in 300 words. Big data can improve healthcare sector by becoming the intelligence for electronic or digital health data and records largely because it features the potential to link financial, operational, as well as clinical analytic mechanisms. In addition to the intelligence role of big data analytics, big data can also support evidence based healthcare. Evidence based healthcare is a contemporary practice referring to the systematic review of previous healthcare information with the objective of offering decision makers with critical information to direct their decisions and policies. Moreover, big data can also improve the health care sector by enhancing the detection of health conditions. Indeed, big data can be utilized to detect illnesses in specific play a critical role in clinical genome analysis of patients suffering from cancer, hypertension, diabetes, and HIV, among others. With careful data management, big data analytics deliver accurate and reliable forecasts or detection of such diseases. Similarly, big data can be used in mapping potential risk factors for various diseases or the outbreaks of diseases within different populations. In fact, there are a number of healthcare related information and databases that could be utilized by big data analytics to detect and forecast likely outbreaks of illnesses within the public. Big data can make this possible as data is mapped out. These operations can improve healthcare as types of clinical information, including genomics, patient data and clinical trials can be effectively analysed by big data analytics to help healthcare practitioners make well informed decisions. Additionally, big data is a critical resource and a valuable tool upon that stimulate research, innovation and developments in healthcare sector. The increased application and usefulness of big data in the healthcare system can further the understanding of complex processes issues currently facing the sector. Exercise 3: Big Data Platform(1 mark) In order to build a big data platform one has to acquire, organize and analyse the big data. Go through the following links and answer the questions that follow the links: http://www.infochimps.com/infochimps-cloud/how-it-works/ http://www.youtube.com/watch?v=TfuhuA_uaho http://www.youtube.com/watch?v=IC6jVRO2Hq4 http://www.youtube.com/watch?v=2yf_jrBhz5w Please note: You are encouraged to watch all the videos in that series from Oracle. How to acquire big data for enterprises and how it can be used? Big data for enterprise is acquired by capturing and storing of a large variety of data. When carefully distilled and analysed and blended with traditional enterprise information, big data can be used to establish a more comprehensive and understanding of an organization or firm’s business operations, which can translate to improved productivity, better strategic and competitive advantage, and a excellent innovation. How to organize and handle the big data? The big data is organized and handled transforming and integrating the different types of data captured. What are the analyses that can be done using big data? The three analyses that can be done using big data include: (i) running queries; (ii) modelling; and (iii) algorithms. Undertaking these analyses effectively can help an enterprise gain new insights regarding its business. Part B (4 Marks) Part B answers should be based on well cited article/videos – name the references used in your answer.For more information read the guidelines as given in Assignment 1. Exercise 4: Big Data Products (1 mark) Google is a master at creating data products. Below are few examples from Google. Describe the below products and explain how the large scale data is used effectively in these products. a. Google’s PageRank PageRank refers to a system of ranking web pages - counting link votes and establishing the most important web pages based on the link votes. Combined with other elements, the scores are employed to establish whether a page will rank well in a search. Big data is effectively used in PageRank because the system counts and ranks peoples’ posting links on sites to assist in establishing particular websites that avail content of value. b. Google’s Spell Checker Google Spell Checker is an application that is used to check and detect spelling errors in different types of typed texts, including word and PDF documents as well as emails and webpages. To check spelling, one highlights the content, copy, tap on the “Google Spell Checker” icon and choose the right spelling. Big data is used in this tool effectively because the tool has a large reservoir of correct words from which it detect errors in the text c. Google’s Flu Trends Google Flu Trends is a site that present estimates of influenza activity in several countries by aggregating search queries to present accurate forecasts regarding flu activity. The product was designed primarily to predict outbreaks of flu. The product utilizes big data effectively as it integrate many different types of data, including social media data, data from health organization such as CDC, and structural models to infer the spatial and temporal outbreaks and spread of flu. d. Google’s Trends Google Trends is a Google Search product used to indicate how often a certain search-term is searched relative to the whole volume of searches across different locations of the world and in different languages. Google Trend searches a large volume of data (big data) in order to show the most searched terms. Like Google – Facebook and LinkedIn also uses large scale data effectively. How? LinkedIn and Facebook use big data effectively through capturing and organizing millions of individual user postings, shares, likes, comments and such. According to IBM estimation, the volume of active Facebook users had surpassed 1 billion (Rainie, Smith & Duggan, 2013). Also, this data is acquired, organize and analyse it for enterprise to be used in decision making processes. Exercise 5: Big Data Tools(2 marks) Briefly explain why a traditional relational database (RDBS) is not effectively used to store big data? Despite being used to power many different business applications, RDBS feature several shortcomings. The major weakness of RDBS is their lack of potential to support big data. They are characterized with limited scalability, availability, fault tolerance, and ability to meet the requirements of today’s business models and applications (Pokorny, 2013). Also, data complexity is a major concern with RDBS because their data exist in multiple tables connected to each other via shared key values. This complexity calls for experience. Furthermore, the system is limited by broken keys and records as rational databases necessitate shared keys to connect data spread across various tables (Padhy, Patra & Satapathy, 2011). In the case a table/record lacks a unique key, the database may relay back inaccurate feedback. Additionally, some require more powerful servers to convey back results within an acceptable response time. What is NoSQL Database? An NoSQL databases is a non-rational and widely spread database that allows rapid, ad-hoc acquisition, organization, and analysis of considerably large-volume, disperate data types, and was particularly dsigned to satisfy the unique demands of current big data creation, storage and analysis (Pokorny, 2013). Name and briefly describe at least 5 NoSQL Databases (i) Document databases is a type of NoSQL that pairs each key with a complex data structure referred to as a document, and may hold varied key-value pairs, or nested documents among others. (ii) Graph stores is an NoSQL database utilized for storage of data regarding networks of data, for example social connections, and encompasses Neo4J and Giraph (Pokorny, 2013). (iii) Key-value stores constitute the simplest NoSQL database, with each single item held in the database stored in the form of a key or an attribute name alongside its value. Berkeley DB and Riak are good examples of key-value stores. (iv) Wide-column stores is an NoSQL database that is optimized specifically for queries over large-volume datasets, and instead of storing data in rows, data is kept in columns (Pokorny, 2013). Examples are Cassandra and HBase. (v) Multi-Model database is an NoSQL database type that unlike most database management systems grounded in s single data model, is designed to support several data models against one, integrated backend. What is MapReduce and how it works? MapReduce is a key element of the Apache Hadoop software model – a unique form of such a DAG that is utilized in a variety of use cases. MapReduce was designed like a ‘map” function that converts an element of data into some number of value pairs. After that, each of these value pairs are sorted by their respective key and reach to the same node, from which a ‘reduce’ functions is employed to integrate the values of the key into a single result (Dyer, Cordova, Mont, & Lin, 2008). Briefly describe some notable MapReduce products (at least 5) Inverted Index pattern is a notably utilized as an example for MalProduce, and is used to create an index from a data set to enable quicker data search or data enrichment. The tool is used when faster search query responses are needed. Counting with Counters employs the MapReduce model’s counters utility to compute a global sum wholly on the map side without necessarily yielding any output. The product is an efficient approach of retrieving count summarizations of big data or large data sets. Distributed Grep is a common text filtering application used to scan through a file line-by-line and only produces lines that match a particular pattern Top ten users by reputation is a product of MalReduce used to determine the top ten record of a given data. Every mapper establishes the top ten records of its respective input split and presents them to the reduce state. Amazon’s S3 service lets to store large chunks of data on an online service. List some 5 features for Amazon’s S3 service. Fhealthcare Access Control List (ACL) CloudFront Versioning Object Lifecycle management Hosting Static Websites Logging RRS, Tagging and Pricing Getting the concise, valuable information from a sea of data can be challenging. We need statistical analysis tool to deal with Big Data. Name and describe some (at least 3) statistical analysis tools. Apache Hadoop is an open source software analytical tool specifically designed to handle large-volume data. it comprises the Hadoop Distributed File System snad MapReduce. It stores data by dividing files into large blocks and spreading its across nodes. Apache Spark is an open source framework analytical tool built for cluster calculations. It is often employed as an alternative to MapReduce as it has the potential to analyse data up to 100 times faster for particular applications, and is commonly used in data streaming, machine learning and interactive analysis Apache Hive entails an analytical data processing engine that is excellent in batch processing of ETL and SQL queries and employs a query language referred to as HiveQL (Padhy, Patra & Satapathy, 2011). NoSQL Database is a non-rational and widely spread database that allows rapid, ad-hoc acquisition, organization, and analysis of considerably large-volume, disperate data types (Pokorny, 2013). Exercise 6: Big Data Application (1 mark) Name 3 industries that should use Big Data – justify your claim in 250 words for each industry using proper references. The marketing industry should use big data. The use of big data bears implications for a large number of practices and processes in marketing. Many definitions of the concept of marketing often conceptualize marketing in terms of the four Ps; promotion, product, place and price (Linoff & Berry, 2011). Other professionals may also add an extra P to represent packaging. Big data can help businesses to grasp a better understanding of the particular needs and preferences of their clients, which is critical to creating the form of packaging that would attract even more clients and more effectively translate to more sales. Also, by introducing different real-time data sets, incorporating supplier and inventory data, frameworks of consumers’ probability to buy, and financial predictions, a business can develop dynamic pricing that enable it to quote different prices at different times in different locations to different clients with the view of optimizing revenue. Moreover, one of the significant utility of big data in to derive product insights (Linoff & Berry, 2011). Businesses can easily undertake both qualitative and quantitative market analysis on the Internet at a relatively lower cost compared to two decades ago where big data was inexistence. Online survey tools coupled with videoconferencing facilities render focus groups, surveys and other data gathering techniques with large sample sizes much easier to carry out as well (Linoff & Berry, 2011). Institutions can monitor their websites and social media for comments, complaints, mentions, and reviews of their products and services by consumers. Additionally, marketers can utilize big data to establish the most optimal channels to place their brands at the same time helping consumers to pinpoint the most probable stores to purchase their products. Besides that, healthcare sector can use big data. Big data analytics are essential in linking financial, operational, as well as clinical operations in the industry. Furthermore, healthcare industry should employ big data in conducting research to promote evidence based practice. The sector can also utilize large datasets and big data analytics in the prediction of clinical, behavioral and psychological outcomes in patients hence improving treatment. On top of that, data can be utilized to increase the accuracy and efficiency of detecting diseases, such in clinical genome analysis. With careful data management, big data analytics deliver accurate and reliable forecasts or detection of such diseases (Raghupathi & Raghupathi, 2014). Similarly, In fact, there are a number of healthcare related information and databases that could be utilized by big data analytics to detect and forecast likely outbreaks of illnesses within the public. Big data can make this possible as data is mapped out. These operations can improve healthcare as types of clinical information, including genomics, patient data and clinical trials can be effectively analysed by big data analytics to help healthcare practitioners make well informed decisions (Raghupathi & Raghupathi, 2014). Moreover, the industry can find big data is a critical resource and a valuable tool upon that stimulate research, innovation and developments in healthcare sector. The increased application and usefulness of big data in the healthcare system can further the understanding of complex processes issues currently facing the sector. Equally, institutions of education, especially higher learning are today dealing and handling a much larger volume of internet-connected machines on their premises, a trend that is increasingly causing the generation of data the their conventional methods of coping cannot manage (Kalota, 2015). This justifies the need for using big data in the education industry. Schools and colleges have large datasets streaming in from all directions through digital applications, software-basses and e-Learning classroom activities and assessments, social media, blogs, and students surveys (Kalota, 2015). In addition to this, there is an increasing surge form the government, parents and the public at large – with online benchmarking of learners, instructors and existing curriculum performance becoming more popular among all the groups. All these streams of data is exerting pressure on current IT infrastructure, leading to CIOs to demand new, improved and effective architecture. Traditional software solutions, applications and databases, which have been around for several decades, are not specifically designed to cope with the current demand for data (Kalota, 2015). This observation may explain why a great deal of education institutions have yet to harness big data because they lack the necessary capacity to realize the advantage of big data. This challenge is bound to worsen as data rates and volumes continue to mount. Therefore, to effectively meet the data demand, the education sector needs to fully embrace big data. institutions that fails to adopt big data stand the inevitable opportunity of lagging behind by their counterparts that do as the potential of data-informed strategic decisions propels them forward to greater heights. References Datascience@berkeley (n.d.). What is Data Science. Accessed on 12 May 2016, from: https://datascience.berkeley.edu/about/what-is-data-science/ Dyer, C., Cordova, A., Mont, A., & Lin, J. (2008, June). Fast, easy, and cheap: Construction of statistical machine translation models with MapReduce. In Proceedings of the Third Workshop on Statistical Machine Translation (pp. 199-207). Association for Computational Linguistics. Garfinkel, S. (2007). An evaluation of amazon's grid computing services: EC2, S3, and SQS. Kalota, F. (2015). Applications of Big Data in Education. World Academy of Science, Engineering and Technology, International Journal of Social, Behavioral, Educational, Economic, Business and Industrial Engineering, 9(5), 1567-1572. Linoff, G. S., & Berry, M. J. (2011). Data mining techniques: for marketing, sales, and customer relationship management. John Wiley & Sons. Padhy, R. P., Patra, M. R., & Satapathy, S. C. (2011). RDBMS to NoSQL: Reviewing some next-generation non-relational databases. International Journal of Advanced Engineering Science and Technologies, 11(1), 15-30. Pokorny, J. (2013). NoSQL databases: a step to database scalability in web environment. International Journal of Web Information Systems, 9(1), 69-82. Raghupathi, W., & Raghupathi, V. (2014). Big data analytics in healthcare: promise and potential. Health Information Science and Systems, 2(1), 3. Rainie, L., Smith, A., & Duggan, M. (2013). Coming and going on Facebook. Pew Research Center’s Internet and American Life Project. Read More

What is Data Science - Math Problem Example

Extract of sample "What is Data Science"

CHECK THESE SAMPLES OF What is Data Science

Computer Sciences and Information Technology

Intelligent Design: Not Enough Data to Present It as Science

Method and Progress in Management Science

What science is

Science versus Pseudo- science

Civil War Resolution Data Set t-test: Quantitative Research Methods in Political Science

Galileos Daughter - A Drama of Science, Faith, and Love by Dava Sobel

Digital By Product Data in the Social Sciences