2.1Assuming that data mining techniques are to be used in the following cases, identify whether the task required is supervised or unsupervised learning. a.Supervised-Deciding whether to issue a loan to an applicant based on demographic and financial data (with reference to a database of similar data on prior customers).
b.Unsupervised-In an online bookstore, making recommendations to customers concerning additional items to buy based on the buying patterns in prior transactions. c.Supervised-Identifying a network data packet as dangerous (virus, hacker attack) based on comparison to other packets whose threat status is known. d.Unsupervised-Identifying segments of similar customers.
e.Supervised-Predicting whether a company will go bankrupt based on comparing its financial data to those of similar bankrupt and non-bankrupt firms. f.Unsupervised-Estimating the repair time required for an aircraft based on a trouble ticket. g.Supervised-Automated sorting of mail by zip code scanning. h.Unsupervised-Printing of custom discount coupons at the conclusion of a grocery store checkout based on what you just bought and what others have bought previously.
2.3Consider the sample from a database of credit applicants in Figure 2.13. Comment on the likelihood that it was sampled randomly, and whether it is likely to be a useful sample. I don’t think that the sample was random because records are taken from 8th person. If the sample were to be random it would vary more. I don’t think that the sample would be useful either because of the type of variables that are being used.
Executive Summary mySupermarket is a grocery shopping and comparison website which aims to provide customers with the best price for their shopping. This report examines how data warehousing provided mySupermarket with the foundation in which to build a successful enterprise, and allowed a subsequent expansion into the ‘business intelligence’ sector. The research draws attention to the problems ...
2.5Using the concept of overfitting, explain why when a model is fit to training data, zero error with those data is not necessarily good. It’s not good because when looking at models you want to see the relationship between the data if there are zero error in the data then the information you get is skewed and may not be a true reflection. 2.7A dataset has 1000 records and 50 variables with 5% of the values missing, spread randomly throughout the records and variables. An analyst decides to remove records that have missing values. About how many records would you expect would be removed? Trick question – all because none of the records would be useable.