# Looking into the Preparation of Data

Before estimating the logistic regression and decision tree, the data should be preprocessed the firstly. Initially, the data set should be transformed from *.arff to *.csv by the ArffViewer tool in Weka. Then the new data set could be opened by the Excel. After checking the data in the Excel sheet, no missing values were found. Then the z- score rule was used to do the outlier detection. There are six attributes should be tested by the Z value, birthDate, zipCode, svcStartDt, incomeCode, peakMinAv and peakMinDiff. If the Z value is larger than 3 (Zj> 3), then the original data should be replace by the mean plus 3 times the standard deviation of that attribute. If the Z value is smaller than -3 (Zj< -3), the original data should be replaced by the mean minus -3 times the standard deviation of that attribute. Meanwhile, it was found that the curPlan attribute are the nominal variables and the value 1, 2, 3 and 4 corresponds to the different peak minutes plan. So it should be recoded. In this paper, dummies were chose to code the variables. The curPlan attributes could be replaced by four dummies attributes named Dummy1, Dummy2, Dummy3 and Dummy4. The value setting of the four new attributes is showed as the following table. Then the adjusted data set should be opened by the Weka and transformed to the *.arff file. Because the subscriberID attribute is the customers’ ID numbers, obviously it is not related to the churn attribute. Then it should be removed in the Weka Explorer. To make sure all the attributes, except birthDate, zipCode, svcStartDt, incomeCode, peakMinAv and peakMinDiff, are the nominal type values, the filter NumerictoNominal could be applied to change the numeric type values to the nominal type. After that, the discretization and grouping should be carried out to group values of categorical variables for robust analysis. In this question, the equal interval binning method would be adopted. Again the six numeric attributes which were tested by the z- score before should be discretized or grouped. It is easy to achieve the goal by using the filter named Discretize in the Weka. With the default properties, the six attributes would be divided to 10 equal interval bins. At last, the whole data set should be slipped into two sets, one is the training set with the first 1/3 data (2666 rows) and the other is the test set with the rest data (1334 rows).

## Logistic regression

First of all is to estimate a logistic regression classifier with the training set, considering the churn indicator as the target variable. In the Classify option in Weka, the classifier named SimpleLogistic should be selected. In order to make the estimation more reliable, the value of maxBoostingInterations could be set as 1000. The test option is use training set. Then the results would be showed in the output window. The following are the main parameters.According to the above table, the most predictive input is the PeakMinDiff='(-A¢Ë†Å¾ ~ -446.02]’. It means the data belongs to this interval would decrease 1.84 times the odds-ratio. All of the four dummies, only Dummy2 presents a high relationship to the regression with a 1.37 parameter. From the results window, the confusion matrix could be found as follows. it is easy to calculate out that the classification accuracy is 72.0930%, the sensitivity is 72.9242%, and the specificity is 70.7171%, on the training set assuming a cut- off of 0.5. Then using the test set by Supplied test set option in Weka, the confusion matrix could be found as follows. According the equations and the confusion matrix, the classification accuracy for test set is 63.5682%, the sensitivity is 67.9675%, and the specificity is 59.8053%. The ROC curve on the test set also could be drawn and the area under ROC is 0.7153. Because Accuracy Ratio = 2* AUC- 1, so the accuracy ratio on the test set is 0.4306.

## Decision tree

To estimate a decision tree, the same data set would be used again. However, the data need not to be slit to two sets. Then the filter named J48 in Classifier Option should be chosen to estimate the tree. The test option is 66% percentage split. To prevent overfitting, the pruning and property setting should be adopted. By changing the confidencFactor, numFolds and reducedErrorPrunning, a smaller tree could be found when the confidencFactor is 0.1, numFolds is 5 and reducedErrorPruning is false. The number of leaves is 25 and the size of tree is 33. And according to the confusion matrix for test set of decision tree, it is easy to calculate out that the classification accuracy is 68.5294%, the sensitivity is 67.4822%, and the specificity is 71.2401%. The AUC is 0.6967 and the AR is 0.3934.

## Comparing

The classification accuracy of logistic regression (63.5682%) is smaller than that of decision tree (68.5294%). However, the AUC of logistic regression (0.7153) is larger than that of decision tree (0.6967). The most predictive input in logistic regression is the PeakMinDiff='(-A¢Ë†Å¾ ~ -446.02]’ and its parameter is -1.84. While, the most in decision tree is Dummy2=1:0 and it is 322.0/9.0. In the logistic regression, every parameter means the contribution of each attribute to the model. By comparing the result of training set to the test set, it is found that the classification accuracy of training set (72.093%) is a little larger than that of test set. It means that this logistic regression model does not fit the data very well. Then the decision tree shows the choice for important branches which is ease of interpretation and no parameters. However the decision tree is still a bit too large to use as the rule.

## Question 2

Title: A Typology of Irrigated Farms as a Tool for Sustainable Agricultural Development in Irrigation Schemes: The Case of the East Mitidja Scheme, Algeria Authors: Khaled Laoubi and Masahiro Yamao Citation: International Journal of Social Economics, Vol. 36, No. 8, 2009, pp. 813-831

## The data mining problem considered

In this article, the authors tried to use data mining to propos the typology. The typology is supposed to connect the characterization of the irrigated lands with the structural and functional aspects. East Mitidja (Hamiz) scheme in the Mitidja valley was selected as the data resource which is only from the water users and members of the irrigation agency ONID. A questionnaire was designed to collect the data. Then the samples were divided into 4 groups as EAC, EAI, private and others by the farm type. The farm structure characteristics were defined as landownership status (EAC, EAI and private farm), agricultural land area, groundwater assets, family labor, farm income, off-farm income, irrigation technique used, subsidies, type of marketing channel and farm equipment. While, farm functional characteristics include the percentage of irrigated area out of total agricultural area (TAA), irrigated citrus area, irrigated orchard area, irrigated grape area, irrigated vegetables area, irrigated industrial culture area, irrigated green house area, cereal area, abandoned area, the source of water used, and the investment in the farm.

## The data mining techniques used

In this article two main data mining techniques were used. One is multiple correspondence analysis (MCA) which is an extension of correspondence analysis for at least two variables. Before using the MCA, the selected data were converted to class variable. Then the Euclidean cloud of points would be conducted and the number of the axes should be restricted. The other one is ascendant hierarchical classification (AHC). It is a method for cluster which would be done after the MCA. It could divide the individuals by their factorial coordinates. The authors also used Statistical Package for Social Science 15 to do the data prepare and Portable System for Data Analysis Software v5.5 to apply the MCA and AHC.

## The results

The MCA represented that the first three factors explained more than 60 percents of the total variance. The first axis (32%) opposes the large EAC farms and small farms. The second factor of MCA (23%) opposes the EAC farms and the family labor which varies from 6 to 9. The third one was 10.21 percent. However no supplementary variables were found to contribute to this axis. In AHC, class 1 represents 16.42% of the sample. EAC farm land is the principal ownership type. Class 2 represents as much sample as class 1. Class 3 represents 5.22% sample. It includes all types of ownership and main characteristic use of the irrigation-saving technology and the conjunctive use of water resources. Class 4 represents the smallest of all, only 3.7. The principal ownership type of this class is private farms. Class 5 represents 15.67%. It focused on the same in class 4. It has has no extension services, and no investments made for these farms. Class 6 represents 20.15%. It turned to the EAC farms. It is interesting and contacts with extension service. Class 7 represents 22.39%. It’s principle ownership type is private farms. According to the results, it is found that, the agricultural irrigation policies have not benefited all types of farms. Different type farmers faced different problems. So the different measures would be taken to solve the problems. It also showed that the Agricultural and irrigation policies in Algeria had a different effect on various type farmer and factors.

## Critical discussion

The author used the questionnaire to collect the data, since the data might deviate from the actual situation. What is more, the article gave few reasons why chose these characterizations. Basically, different characterization would lead to different results. The MCA method in this article did not play an important role. Also the authors divided the data by types of ownerships, however they did not explain reason clearly. Then as the authors wrote in the end, “the results fail to meet expectations” and “the results are insignificant in terms of the adoption of irrigation techniques”.

## 1. FICO credit score

The credit score is a kind of method which tries to give a score that the bank can rank the risk of the loan applicants. It is based on the historical performance of the borrowers on the loan (Mester, 1997). The FICO is short for the Fair Isaac Company that developed the famous software to help other companies mark for the customers. Then the software was adopted by three main credit bureaus as a standard. The three credit bureaus are Experian, TransUnion, and Equifax. The FICO credit score is a kind of way to calculate the credit score using the software. It ranges between 300 and 850. The lower the score is, the higher the risk is. The FICO credit score is essentially made up of five parts, payment history (35%), total amounts owed (30%), length of credit history (15%), new credit (10%) and type of credit in use (10%). Usually one people could get three different FICO credit scores from the above three bureaus (Vohwinkle, 2010).

## 2. Unexpected loss in a Basel II context

In credit risk management, the loss that can occur if the defaults exceed expectations and deviate from the average is called the “Unexpected Loss”. It is considered as a large but sustainable loss. Basel II is a new model of the capital adequacy framework released by the Basel Committee in June 2004. It is designed to measure the riskiness of banks’ loan portfolios. In Basel II, the credit risk capital should be estimated by the standardized approach or the internal ratings-based approach (IRB) (Resti and Sironi, 2007). Basically, the IRB approach is based on the expected and unexpected loss (Prenio, 2005). In the credit risk management, there four key components, probability of default (PD), loss given default (LGD), exposure at default (EAD) and maturity (M). The Expected loss (EL) = (PD) x (LGD) x (EAD). What is more, Basel II wants bank capital to cover the whole amount of the value-at-risk (VaR= EL + UL). If the bank can only cover the expected loss, then the Basel will charge for the unexpected loss, which should refill the VAR gap.

## 3. Information Value of a variable

Information Value (IV) is a measure of predictive power used to assess the appropriateness of the classing and select predictive variables. It is increasingly popular as it represents good alternatives to approximate non-linearity in the data. It could be computed as follows. IV = {AŽA£(Dist Good – Dist Bad) x WOE} Information value of a variable is related to the evaluation for modeling and analysis purposes. Hence, maximize the information value of the variable is a recommendable criterion in data mining (Hababou, Cheng and Falk, 2006).

## 4. AUC based pruning

AUC is the area under the ROC curve, which provides a simple figure-of-merit for the performance of the constructed classifier. Pruning, as the name implied, involves removing branches of the decision tree to prevent the tree from the overfitting. AUC based pruning is a method to improve the performance. It starts from a model (e.g. logistic regression) with all n inputs. The next step is to remove each input in turn and re-estimate the model. The last is to remove the input giving the best AUC and repeat this procedure until AUC performance decreases significantly. Generally, the average performance of AUC is the criterion whether the pruning should be accepted. If the AUC is larger than that before the pruning, then accept the pruning. If not, the method to do the pruning should be adjusted (McGovern and Jensen, 2008).

## a) The cut- off is 175

To calculate the required values, the confusion matrix should be drawn up firstly. According to the equations used in Question 1, the results could be calculated out easily.

## b) All possible cut- off

The interval of each possible cut- off could be set as 10, from 45 to 345. After running the similar process in a), the results table could be drawn up as follows.

## c) & d) Kolmogorov- Smirnov Curve

In order to drawn Kolmogorov- Smirnov Curve, the two kinds of values should be calculated out firstly, P(s|G) and P(s|B). P(s|G) = A¢Ë†‘xA¢”°A¤s p(x|G) (equals 1- sensitivity) P(s|B) = A¢Ë†‘xA¢”°A¤s p(x|B) (equals the specificity) Then the K- S Curve could be painted out by Excel basing on the above table. The Kolmogorov-Smirnov (KS) Distance= maxs | P(s|G) – P(s|B) |, so the KS distance is 0.5556, where the cut- off is 145.

## e) ROC Curve

The ROC curve is a two-dimensional graphical illustration of the sensitivity on the Y-axis versus (1-specificity) on the X-axis for various values of the classification threshold. A straight line through (0, 0) and (1, 1) represents a classifier found by randomly guessing the class and serves as a benchmark. According to the above ROC curve, it is obviously the area under the scorecard is much larger than that under the random scorecard. Hence, the scorecard performs better than a random scorecard. Moreover, the coordinates of the spots on the curve could be used to estimate the area under the ROC curve. Then the AUC is about 0.7869.

## f) CAP/ Lorenz curve and Gini coefficient (accuracy ratio)

In the given data, there are 12 actual bad and 18 actual good. It is already known that the perfect model has an AR of 1, while the random model has an AR of 0. Because the AR= 2* AUC- 1, the scorecard has an AR of 57.38% (2*0.7869- 1). Then the CAP curve could be drawn out as follows.

## g) Relationship

By setting the cut- off as different values, the different classification accuracy, error rate, sensitivity and specificity could be calculated out. Then the K- S and the ROC are both based on the calculating results. They are the two different ways to measure the performance. However, the ROC usually includes the random scorecard curve which is considered as a fixed line. Then the CAP curve is the compare between the predictive scores and the actual scores.

Did you like this example?

Looking into the preparation of data. (2017, Jun 26). Retrieved June 14, 2021 , from
https://studydriver.com/looking-into-the-preparation-of-data/

This paper was written and submitted by a fellow student

Our verified experts write
your 100% original paper on any topic

Check Prices