wensui's profileWENSUI'S BLOG IN STATIST...PhotosBlogListsMore ![]() | Help |
WENSUI'S BLOG IN STATISTICAL COMPUTINGTough Times Never Last. But Tough People Do. - Robert Schuller RealMatrix++: A Real-Valued Matrix Class and LibraryRealMatrix++ A Real-Valued Matrix Class and Library Daniel Hanson Email: HoserAnalytics (at) gmail (dot) com June 1, 2009 Introduction and Purpose A frequent frustration for quant development in finance is the lack of a matrix class and matrix algebra library in the C++ Standard Library. Third party libraries do exist, both open source and commercial, but open source libraries are often time consuming to learn and implement, and commercial libraries can carry a heavy price tag. Furthermore, for quant finance, as in other domains such as predictive modeling, the vast majority of linear algebra calculations are real-valued. The intent of the RealMatrix++ Library is to provide an easy-to-use, non-templated matrix class and library for real-valued matrices and vectors, with the ultimate goal of providing functionality on par with that of the JAMA library for Java [1]. The code is not optimized for performance to the extent of, say, the Matrix Template Library [2], or Blitz [3], but it will likely offer more rapid development, and performance sufficient at least for prototyping models. Current Release Features and Design This initial release contains a real matrix class, and matrix operations +, -, and *, where (*) is overloaded for standard matrix multiplication, vector by matrix and matrix by vector multiplication, and scalar multiplication (see MatrixOperations.h for details). There is also a set of validation conditions outlined in MatrixConditions.h, which will check if two matrices are of the same dimensions (necessary for matrix addition and subtraction), if a matrix is square, or if a matrix is symmetric. The Matrix class itself is built as a wrapper around the valarray container in the C++ standard library. As Stroustrup[4] says, "The standard library does not provide a matrix class. Instead, the intent is for valarray to provide the tools for building matrices optimized for a variety of needs." Also, as Josuttis [5] points out, "a valarray can be used as a base both for vector and matrix operations.with good performance." On the other hand, he notes that "you often need some inconvenient and time-consuming type conversions." This relates primarily to the need to create a separate valarray instance to use the valarray slice function. For this reason, its use is avoided in the Matrix class here, although it may be used minimally going forward, in cases where the benefit outweighs the inconvenience. The current release has been unit tested, but not completely stress tested. It should work as designed, but at this stage it should probably be best considered a beta release. Your comments are invited (email address given at the top of this document). Future Additions/Enhancements The following functionality is being considered for future releases (not necessarily an exhaustive list): Matrix Decompositions o Cholesky Decomposition of square and symmetric matrices o Singular Value Decomposition o LU Decomposition Eigenvalue calculations Principal component analysis Solutions of linear systems Matrix inverse Licensing This software is released under the terms of the BSD License [6]: Copyright (c) 2009, Daniel Hanson All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. Neither the name of the copyright holder nor the names of any contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. References [1] http://math.nist.gov/javanumerics/jama/ [2] http://www.osl.iu.edu/research/mtl/ [3] http://www.oonumerics.org/blitz/ [4] Bjarne Stroustrup, The C++ Programming Language, 3E (Addison Wesley) [5] Nicolai Josuttis, The C++ Standard Library: A Tutorial and Reference (Addison Wesley) [6] http://www.opensource.org/licenses/bsd-license.php A MACRO TO IMPUTE NUMERIC VARIABLES BASED UPON BAD RATESlibname data 'E:\sas'; data test; set data.credit; if ranuni(2) < 0.1 then x2 = .; if ranuni(3) < 0.3 then x3 = .; if ranuni(4) < 0.2 then x4 = .; run; proc means data = test n nmiss nonobs; class y; var x2 - x4; run; /* N Y Variable Label N Miss 0 X2 X2 5217 591 X3 X3 4011 1797 X4 X4 4675 1133 1 X2 X2 343 29 X3 X3 268 104 X4 X4 311 61 */ %macro impute(data = , y = , x = , outfile = ); *********************************************; * THIS MACRO IS TO IMPUTE NUMERIC VARIABLES *; * BASED UPON BAD RATES *; * ----------------------------------------- *; * PARAMETERS: *; * DATA : INPUT DATA SET *; * Y : RESPONSE WITH 0 / 1 *; * X : A LIST OF NUMERIC PREDICTORS *; * WITH MISSING VALUES *; * OUTFILE: FILEREF FOR IMPUTING RECODE *; *********************************************; %if %sysfunc(fexist(&outfile)) %then %do; data _null_; rc = fdelete("&outfile"); if rc = 0 then put "THE FILE HAS BEEN DELETE."; run; %end; ods output position = _pos1; ods listing close; proc contents data = &data varnum; run; ods listing; proc sql noprint; select upcase(variable) into :x2 separated by ' ' from _pos1 where compress(upcase(type), ' ') = 'NUM' and index("%upcase(%sysfunc(compbl(&x)))", compress(upcase(variable), ' ')) > 0; select count(variable) into :xcnt from _pos1 where compress(upcase(type), ' ') = 'NUM' and index("%upcase(%sysfunc(compbl(&x)))", compress(upcase(variable), ' ')) > 0; quit; data _tmp1; retain &x2 &y; set &data; where &y in (0, 1); keep &x2 &y; run; ods output position = _pos2; ods listing close; proc contents data = _tmp1 varnum; run; ods listing; %let loop = 0; %do i = 1 %to &xcnt; proc sql noprint; select upcase(variable) into :var from _pos2 where num = &i; select count(distinct &var) into :vcnt from _tmp1 where &var ~= . and &y in (1, 0); select nmiss(&var) / count(*) * 100 into :mpct from _tmp1 where &y in (1, 0); select count(distinct &y) into :yflg from _tmp1 where &y in (1, 0); quit; %if &vcnt > 1 & &mpct > 0 & &mpct < 99 & &yflg = 2 %then %do; %let loop = %eval(&loop + 1); proc sql noprint; select mean(&y) format = 10.6 into :mis_rate from _tmp1 where &var = . and &y in (0, 1); select count(*) into :mcnt from _tmp1 where &var = . and &y in (1, 0); create table _tmp2 as select distinct "%upcase(%trim(&var))" as variable, &var as value, mean(&y) as bad_rate, &mis_rate as mis_rate, count(*) as freq, &mcnt as mis_cnt, abs(mean(&y) - &mis_rate) as dif from _tmp1 where &var ~= . and &y in (0, 1) group by &var; quit; proc sort data = _tmp2 sortsize = max; by dif descending freq; run; data _tmp3; set _tmp2; by dif descending freq; if _n_ = 1 then output; run; %if &loop = 1 %then %do; data _result; set _tmp3; run; %end; %else %then %do; data _result; set _result _tmp3; run; %end; %end; %end; data _null_; set _result end = eof; file impute mod; if _n_ = 1 then do; put @1 93 * "*" ";"; put @1 "* IMPUTED VARIABLE" @32 "| IMPUTED VALUE" @52 "| BAD RATE" @65 "| MIS. COUNT" @80 "| MIS. RATE" @93 "*;"; end; put @1 "*" 91 * "-" @93 "*;"; put @1 "*" @3 variable @32 "|" @34 value @52 "|" @54 bad_rate percent8.2 @65 "|" @67 mis_cnt @80 "|" @82 mis_rate percent8.2 @93 "*;"; if eof then do; put @1 93 * "*" ";"; end; run; data _null_; set _result; file impute mod; if _n_ = 1 then do; put " "; put @10 3 * "*" " RECODING OF IMPUTATION " 3 * "*" ";"; end; put @3 "if " @6 variable " = . " @40 "then " variable " = " value ";"; run; %mend impute; filename impute 'E:\sas\impute.sas'; %impute(data = test, y = y, x = x2 x3 x4, outfile = impute); /* $ more impute.sas **************************************************************; *IMPUTED VARIABLE|IMPUTED VALUE|BAD RATE|MIS. COUNT|MIS. RATE*; *------------------------------------------------------------*; * X2 | -0.68 | 4.55%| 620 | 4.68% *; *------------------------------------------------------------*; * X3 | -0.685 | 5.67%| 1901 | 5.47% *; *------------------------------------------------------------*; * X4 | 1.728 | 5.00%| 1194 | 5.11% *; **************************************************************; *** RECODING OF IMPUTATION ***; if X2 = . then X2 = -0.683 ; if X3 = . then X3 = -0.685 ; if X4 = . then X4 = 1.728 ; */ *** SHOW HOW TO USE THE IMPUTING RECODE ***; data test2; set test; %inc impute / source2; run; proc means data = test2 n nmiss nonobs; class y; var x2 - x4; run; /* N Y Variable Label N Miss 0 X2 X2 5808 0 X3 X3 5808 0 X4 X4 5808 0 1 X2 X2 372 0 X3 X3 372 0 X4 X4 372 0 */ READ FIXED-WIDTH DATA FILE WITH read.fwf()############################################## # READ FIXED-WIDTH DATA FILE WITH read.fwf() # # ------------------------------------------ # # EQUIVALENT SAS CODE: # # filename data 'E:\sas\fixed.txt'; # # data test; # # infile data truncover; # # input @1 city $ 1 - 22 @23 population; # # run; # ############################################## # OPEN A CONNECTION TO THE DATA FILE data <- file(description = "e:\\sas\\fixed.txt", open = "r") # width = c(...) ==> SPECIFIES COLUMN WIDTHS # col.names = c(...) ==> GIVES COLUMN NAMES # colClasses = c(...) ==> DEFINES COLUMN CLASSES test <- read.fwf(data, header = FALSE, width = c(22, 10), col.names = c("city", "population"), colClasses = c("character", "numeric")) close(data) READ DATA FROM A FILE CONNECTION WITH SCAN()################################################# # READ DATA FROM A FILE CONNECTION WITH SCAN() # #-----------------------------------------------# # EQUIVALENT SAS CODE: # # filename data 'E:\sas\Kyphosis'; # # data test; # # infile data; # # input x1 x2 x3 y @@; # # run; # ################################################# # OPEN A CONNECTION TO THE DATA FILE data <- file(description = "e:\\sas\\kyphosis", open = "r") # READ DATA FROM THE CONNECTION test <- data.frame(scan(data, what = list(x1 = 0, x2 = 0, x3 = 0, y = ""), quiet = TRUE)) # CLOSE THE CONNECTION close(data) EVALUATION OF Logistic() FUNCTION IN RWeka############################################## # EVALUATION OF Logistic() FUNCTION IN RWeka # ############################################## library(rpart) data(kyphosis) # MODEL FROM glm() FUNCTION model1 <- glm(Kyphosis ~., data = kyphosis, family = binomial()) # PRINT OUT MODEL SUMMARY summary(model1) # Estimate Std. Error z value Pr(>|z|) # (Intercept) -2.036934 1.449575 -1.405 0.15996 # Age 0.010930 0.006446 1.696 0.08996 . # Number 0.410601 0.224861 1.826 0.06785 . # Start -0.206510 0.067699 -3.050 0.00229 ** # MODEL FROM Logistic() FUNCTION IN RWeka library(RWeka) model2 <- Logistic(Kyphosis ~., data = kyphosis) # PRINT OUT ESTIMATES print(model2) # Class # Variable absent # ==================== # Age -0.0109 # Number -0.4106 # Start 0.2065 # Intercept 2.0369 # MODEL EVALUATION BASED ON 5-FOLD CROSS-VALIDATION summary(logit, newdata = kyphosis, class = TRUE, numFolds = 5, seed = 2009) # === 5 Fold Cross Validation === # === Summary === # Correctly Classified Instances 65 80.2469 % # Incorrectly Classified Instances 16 19.7531 % # Kappa statistic 0.3481 # Mean absolute error 0.2503 # Root mean squared error 0.3624 # Relative absolute error 74.2609 % # Root relative squared error 88.868 % # Total Number of Instances 81 # === Detailed Accuracy By Class === # TP Rate FP Rate Precision Recall F-Measure ROC Area Class # 0.906 0.588 0.853 0.906 0.879 0.835 absent # 0.412 0.094 0.538 0.412 0.467 0.835 present # Weighted Avg. 0.802 0.484 0.787 0.802 0.792 0.835 # === Confusion Matrix === # a b <-- classified as # 58 6 | a = absent # 10 7 | b = present DO DATA INPUT IN R CONSOLE WITH SCAN()############################################ # DO DATA INPUT IN R CONSOLE WITH SCAN() # #------------------------------------------# # COMPARABLE SAS CODE: # #------------------------------------------# # data test; # # input x1 x2 x3 y $ @@; # # cards; # # 71 . 3 0 158 14 3 0 # # 128 5 4 1 # # ; # # run; # ############################################ test <- data.frame(scan(file = "", what = list(x1 = 0, x2 = 0, x3 = 0, y = ""))) 71 NA 3 0 158 14 3 0 128 5 4 1 A MACRO TO DO MONOTONIC WOE TRANSFORMATION FOR THE NUMERIC PREDICTORlibname data 'E:\sas'; data test; set data.credit; keep y x2 x3 x8 x9; run; %macro monowoe(data = , y = , x = , maxbin = ); ******************************************************; * THIS MACRO IS TO DO A MONOTONIC WEIGHT OF EVIDENCE *; * (WOE) TRANSFORMATION FOR THE NUMERIC PREDICTOR. *; *----------------------------------------------------*; * PARAMETERS: *; * DATA : INPUT DATA *; * Y : GOOD/BAD RESPONSE (BAD ==> 1) *; * X : A NUMERIC PREDICTOR *; * MAXBIN: MAX. NUMBER OF BINS USED TO CALCULATE WOE *; ******************************************************; %let minbad = 10; data _tmp1(keep = &y &x); set &data; where &y in (0, 1) and &x ~= .; run; proc sql noprint; select min(&maxbin, count(distinct &x)) into :nbin from _tmp1; select count(*) into :cnt from _tmp1; select sum(&y) into :bad from _tmp1; quit; %do i = &nbin %to 2 %by -1; proc rank data = _tmp1 out = _tmp2 ties = low groups = &i; var &x; ranks rank; run; proc summary data = _tmp2 nway; class rank; output out = _tmp3(drop = _type_ rename = (_freq_ = freq)) sum(&y) = bad mean(&y) = bad_rate mean(&x) = x min(&x) = minx max(&x) = maxx; run; proc sql noprint; select min(bad) into :minbad2 from _tmp3; select min(bad_rate) into :minrate from _tmp3; select max(bad_rate) into :maxrate from _tmp3; quit; %if &minbad2 >= &minbad & &minrate > 0 & &maxrate < 1 %then %do; ods output SpearmanCorr = _corr(rename = (x = cor)); ods listing close; proc corr data = _tmp3 spearman; var x; with bad_rate; run; ods listing; proc sql noprint; select abs(cor) into :cor from _corr; quit; %if &cor = 1 %then %goto outside; %end; %end; %outside: proc sort data = _tmp3; by rank; run; data _tmp4; retain bin minx maxx bad freq bad_rate gpct bpct woe iv; set _tmp3; by rank; bin + 1; bpct = bad / &bad; gpct = (freq - bad) / (&cnt - &bad); woe = log(gpct / bpct); iv = (gpct - bpct) * log(gpct / bpct); keep bin minx maxx bad freq bad_rate gpct bpct woe iv; run; options nocenter nodate ls = 100; title; proc report data = _tmp4 box spacing = 1 split = "*"; column("MOTONOTIC WEIGHT OF EVIDENCE TRANSFORMATION FOR %upcase(&x)" bin minx maxx bad freq woe iv); define bin / "BIN*LEVEL" width = 5 format = z2. order order = data; define minx / "LOWER*BOUND" width = 10 format = 9.2; define maxx / "UPPER*BOUND" width = 10 format = 9.2; define bad / "#BADS" width = 10 format = 9.; define freq / "#FREQ" width = 10 format = 9.; define woe / "WOE" width = 10 format = 10.4; define iv / "INFO.*VALUE" width = 8 format = 8.4; run; %mend monowoe; options mprint mlogic nocenter; %monowoe(data = test, y = y, x = x3, maxbin = 20); /* +---------------------------------------------------------------------+ | MOTONOTIC WEIGHT OF EVIDENCE TRANSFORMATION FOR X3 | | BIN LOWER UPPER INFO.| |LEVEL BOUND BOUND #BADS #FREQ WOE VALUE| |---------------------------------------------------------------------| | 01| -1.19| -0.81| 134| 1576| -0.3722| 0.0417| |---------------------------------------------------------------------| | 02| -0.69| -0.31| 111| 1693| -0.0912| 0.0024| |---------------------------------------------------------------------| | 03| -0.18| 0.32| 67| 1419| 0.2565| 0.0135| |---------------------------------------------------------------------| | 04| 0.45| 2.97| 60| 1492| 0.4244| 0.0362| +---------------------------------------------------------------------+ */ THE EQUIVALENT FUNCTION OF PROC RANK WITH WEIGHT VARIABLE***********************************************; * A DEMO ON HOW TO DO THE EQUIVALENT FUNCTION *; * OF PROC RANK WITH WEIGHT VARIABLE *; * ------------------------------------------- *; * SPECIAL THANKS TO MY FRIEND AND COWORKER *; * Mr. Jonas Bilenas *; ***********************************************; data one; do i = 1 to 1000; x = rannor(1); if x < 0 then wt = 1; else wt = 10; output; end; run; %macro rank(data = , var = , weight = , groups = , output = ); data _tmp1(keep = &var &weight); set &data; run; proc univariate data = &data noprint; var &var; weight &weight; output out = _tmp2 pctlpre = decile_ pctlpts = 0 to 100 by %sysevalf(100 / &groups); run; data &output(keep = &var lower upper rank &weight); set _tmp1; if _n_ = 1 then do; set _tmp2; end; array d[*] decile_:; rank = 999; do i = 2 to dim(d); if &var <= d[i] and rank = 999 then do; lower = round(d[i - 1], 0.0001); upper = round(d[i], 0.0001); rank = i - 1; end; end; run; %mend rank; %rank(data = one, var = x, weight = wt, groups = 4, output = two); proc means data = two nonobs n min max missing sumwgt; class rank; var x; weight wt; run; /* OUTPUT: rank N Minimum Maximum Sum Wgts ------------------------------------------------------------- 1 591 -3.3158398 0.2201053 1365.00 2 137 0.2276076 0.6209279 1370.00 3 136 0.6211922 1.0996207 1360.00 4 136 1.1021029 3.2444009 1360.00 */ A SAS MACRO TO CREATE FREQUENCY SUMMARY TABLE*************************************************; * A MACRO TO CREATE FREQUENCY SUMMARY TABLE FOR *; * ALL INPUT CATEGORICAL VARIABLES *; *************************************************; libname data 'z:\XXX'; %let cat = x10 - x11; data tmp1; set data.credit_train; run; %macro freq(data = , y = , cat_x = ); data _tmp1; retain &cat_x &y; set &data; where &y in (0, 1); keep &cat_x &y; run; ods listing close; ods output position = _pos; proc contents data = _tmp1 varnum; run; ods listing; proc sql noprint; select count(*) - 1 into :nx from _pos; select variable into :xlist separated by ' ' from _pos where num <= &nx; select count(*) into :obs from _tmp1; run; %do i = 1 %to &nx; %let x2 = %scan(&xlist, &i, ' '); data _tmp2(keep = i var level &y); set _tmp1; i = &i; var = upcase("&x2"); level = input(&x2, $50.); run; proc summary data = _tmp2 nway missing; class i var level; output out = _sum1(drop = _type_ rename = (_freq_ = freq)) sum(&y) = bads mean(&y) = bad_rate; run; %if &i = 1 %then %do; data _sum2; set _sum1; run; %end; %else %do; data _sum2; set _sum2 _sum1; run; %end; %end; data _final; set _sum2; if compress(level, ' ') = '' then level = '999. MISSING'; pct = freq / &obs; run; options nocenter nodate; title; proc report data = _final box spacing = 2 headskep split = "*"; column("FREQUENCY SUMMARY OF CATEGORICAL VARIABLES* " var level freq pct bads bad_rate); define var / "ATTRIBUTE" order order = data width = 20 format = $20.; define level / "LEVEL" width = 20 format = $20.; define freq / "FREQUENCY" width = 10 format = 10.; define pct / "PERCENT" width = 10 format = percent10.4; define bads / "# BADS" width = 10 format = 9.; define bad_rate / "BAD RATE" width = 10 format = percent10.4; run; %mend freq; %freq(data = tmp1, y = y, cat_x = &cat); /* ----------------------------------------------------------------------- | FREQUENCY SUMMARY OF CATEGORICAL VARIABLES | | | |ATTRIBUTE LEVEL FREQUENCY PERCENT # BADS BAD RATE| | | |---------------------------------------------------------------------| |X10 | 1 | 1352| 28.8889% | 115| 8.5059% | | |----------+-----------+-----------+-----------+-----------| | | 2 | 3328| 71.1111% | 171| 5.1382% | |----------+----------+-----------+-----------+-----------+-----------| |X11 | 1 | 825| 17.6282% | 101| 12.2424% | | |----------+-----------+-----------+-----------+-----------| | | 2 | 3855| 82.3718% | 185| 4.7990% | ----------------------------------------------------------------------- */ LOGISTIC MODEL IN ENTERPRISE MINER WITH DMREG PROCEDURE*************************************************; * LOGISTIC MODEL IN ENTERPRISE MINER WITH *; * DMREG PROCEDURE *; *************************************************; libname data 'z:\XXX'; data tmp1; set data.credit_train; run; *** REGULAR WAY TO MODEL LOGISTIC REGRESSION ***; proc logistic data = tmp1 desc; class x10 x11 x13 x17 x18 X19 X22 / param = glm; model y = X2 X3 X8 X9 x10 x11 x13 x17 x18 X19 X22; run; /* Type 3 Analysis of Effects Wald Effect DF Chi-Square Pr > ChiSq X2 1 5.2833 0.0215 X3 1 26.7384 <.0001 ... ... Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 -2.9224 0.4685 38.9145 <.0001 X2 1 0.2162 0.0941 5.2833 0.0215 X3 1 -0.4701 0.0909 26.7384 <.0001 ... ... */ *** E-MINER WAY TO MODEL LOGISTIC REGRESSION ***; proc dmdb data = tmp1 out = tmp_db dmdbcat = tmp_cl; class x10 x11 x13 x17 x18 X19 X22 y(desc); var X2 X3 X8 X9; target y; run; proc dmreg data = tmp_db dmdbcat = tmp_cl; class x10 x11 x13 x17 x18 X19 X22 y; model y = X2 X3 X8 X9 x10 x11 x13 x17 x18 X19 X22 / coding = glm; code file = 'z:\XXX\score_code.sas'; run; /* Type 3 Analysis of Effects Wald Effect DF Chi-Square Pr > ChiSq X2 1 5.2832 0.0215 X3 1 26.7383 <.0001 ... ... Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 -2.9224 0.4685 38.92 <.0001 X2 1 0.2162 0.0941 5.28 0.0215 X3 1 -0.4701 0.0909 26.74 <.0001 ... ... */ MODEL SEARCH BY BUMPING*************************************************; * REFERENCE: *; * model search and inference by bootstrap *; * -- Tibshirani and Knight (1997) *; *************************************************; libname data 'z:\XXXX'; %include 'z:\XXXX\ks.sas'; %include 'z:\XXXX\roc.sas'; data train; set data.credit_train; run; proc sql noprint; select count(*) into :size from train; quit; %macro bumping; %let seed = 2009; %let n = 500; data _null_; do i = 1 to &n; random = put(ranuni(&seed) * (10 ** 8), 8.); name = compress("random"||put(i, 3.), ' '); call symput(name, random); end; run; proc datasets library = data nolist; delete catalog / memtype = catalog; run; quit; %do i = 1 %to &n; %put &&random&i; proc surveyselect data = train method = urs n = &size seed = &&random&i out = sample&i(rename = (NumberHits = _hits)) noprint; run; proc dmdb data = sample&i out = db_sample&i dmdbcat = cl_sample&i; class x10 - x24 y; var x2 - x9; target y; freq _hits; run; filename out_tree catalog "data.catalog.out_tree.source"; proc split data = db_sample&i dmdbcat = cl_sample&i criterion = gini assess = impurity maxbranch = 2 splitsize = 100 subtree = 30 nsurrs = 0; code file = out_tree; input x2 - x9 / level = interval; input x10 - x24 / level = nominal; target y / level = binary; freq _hits; run; filename in_tree catalog "data.catalog.tree&i..source"; data _null_; infile out_tree; input; file in_tree; if _n_ > 3 then put _infile_; run; data _tmp1(keep = p_y1 p_y0 y random); set data.credit_train; random = ranuni(1); %include in_tree; run; proc sort data = _tmp1 sortsize = max; by descending p_y1 random; run; data _tmp2; set _tmp1; where p_y1 ~= . and y in (0, 1); by descending p_y1 random; i + 1; run; proc rank data = _tmp2 out = _tmp3 groups = 10; var i; run; proc sql noprint; create table _tmp4 as select i + 1 as decile, count(*) as cnt, sum(y) as bad_cnt from _tmp3 group by i; select sum(cnt) into :cnt from _tmp4; select sum(bad_cnt) into :bad_cnt from _tmp4; quit; data _tmp5; set _tmp4; retain cum_cnt cum_bcnt cum_gcnt; bpct = bad_cnt / cnt; cum_cnt + cnt; cum_bcnt + bad_cnt; cum_gcnt + (cnt - bad_cnt); cum_pct = cum_cnt / &cnt; cum_bpct = cum_bcnt / &bad_cnt; cum_gpct = cum_gcnt / (&cnt - &bad_cnt); ks = (max(cum_bpct, cum_gpct) - min(cum_bpct, cum_gpct)) * 100; run; proc sql noprint; select max(ks) into :ks from _tmp5; quit; data _kstmp; tree = &i; ks = &ks; run; %if &i = 1 %then %do; data _ks; set _kstmp; run; %end; %else %do; data _ks; set _ks _kstmp; run; %end; %end; proc sql noprint; select max(ks) into :ks from _ks; select tree into :best from _ks where round(ks, 0.01) = round(&ks., 0.01); quit; filename best catalog "data.catalog.tree%trim(&best).source"; filename output "z:\XXXX\BestTree.txt"; data _null_; infile best; input; file output; if _n_ = 1 then do; put " ******************************************************; "; put " ***** BEST TREE: TREE %trim(&best) WITH KS = &KS *****; "; put " ******************************************************; "; end; put _infile_; run; %mend bumping; %bumping; data valid(keep = p_y1 p_y0 y); set data.credit_valid; %include output; run; %ks(data = valid, score = p_y1, y = y); %roc(data = valid, score = p_y0, y = y); HERE COMES THE SIMPSON RULEfunction [result] = simpson(f, a, b, N) %*************************************************************** %* A FUNCTION OF NUMERICAL INTEGRATION WITH SIMPSON RULE * %*=============================================================* %* PARAMETERS: * %* f: A FUNCTION INPUT * %* a: LOWER BOUND OF THE FUNCTION INPUT * %* b: UPPER BOUND OF THE FUNCTION INPUT * %* N: # OF INTERVALS * %*=============================================================* %* AN EXAMPLE: * %* y = simpson('sin(x)', 0, 1, 1000); * %*=============================================================* %* AUTHOR: Wensui Liu (liuwensui@hotmail.com) * %*************************************************************** dx = (b - a) / N; func = inline(f); result = 0; for i = 1:N min = a + (i - 1) * dx; med = a + (i - 1 / 2) * dx; max = a + i * dx; res = dx / 6 * (func(min) + 4 * func(med) + func(max)); result = result + res; end return BAGGING CART (CLASSIFICATION AND REGRESSION TREE)libname data 'z:\XXX'; data train; set data.credit_train; run; proc sql noprint; select count(*) into :size from train; quit; %macro bagging; %let seed = 2009; %let n = 100; data _null_; do i = 1 to &n; random = put(ranuni(&seed) * (10 ** 8), 8.); name = compress("random"||put(i, 3.), ' '); call symput(name, random); end; run; %do i = 1 %to &n; %put &&random&i; proc surveyselect data = train method = urs n = &size seed = &&random&i out = sample&i(rename = (NumberHits = _hits)) noprint; run; proc dmdb data = sample&i out = db_sample&i dmdbcat = cl_sample&i; class x10 - x24 y; var x2 - x9; target y; freq _hits; run; filename out_tree catalog "data.catalog.out_tree.source"; proc split data = db_sample&i dmdbcat = cl_sample&i criterion = gini assess = impurity maxbranch = 2 splitsize = 100 nsurrs = 0; code file = out_tree noleafid; input x2 - x9 / level = interval; input x10 - x24 / level = nominal; target y / level = binary; freq _hits; run; filename in_tree catalog "data.catalog.tree&i..source"; data _null_; infile out_tree; input; file in_tree; if _n_ > 3 then put _infile_; run; data train_score&i(keep = id p_y1 p_y0 y); set data.credit_train; %include in_tree; run; data valid_score&i(keep = id p_y1 p_y0 y); set data.credit_valid; %include in_tree; run; %if &i = 1 %then %do; data train_score; set train_score&i; run; data valid_score; set valid_score&i; run; %end; %else %do; data train_score; set train_score train_score&i; run; data valid_score; set valid_score valid_score&i; run; %end; %end; proc summary data = train_score nway; class id; output out = train_predict(drop = _type_ rename = (_freq_ = freq)) mean(p_y1) = p_1 mean(p_y0) = p_0 mean(y) = y; run; proc summary data = valid_score nway; class id; output out = valid_predict(drop = _type_ rename = (_freq_ = freq)) mean(p_y1) = p_1 mean(p_y0) = p_0 mean(y) = y; run; %mend bagging; %bagging; data data.scored; set train_predict(in = a) valid_predict(in = b); if a then train = 1; else train = 0; run; proc freq data = data.scored; tables train * y / list missing; run; %include 'z:\XXXX\ks.sas'; %include 'z:\XXXX\roc.sas'; data valid train; set data.scored; if train = 1 then output train; else if train = 0 then output valid; run; title 'KS statistics for trainning dataset'; %ks(data = train, score = p_1, y = y); title 'KS statistics for validation dataset'; %ks(data = valid, score = p_1, y = y); title 'ROC statistics for trainning dataset'; %roc(data = train, score = p_0, y = y); title 'ROC statistics for validation dataset'; %roc(data = valid, score = p_0, y = y); /* OUTPUT KS FOR TRAINING DATA CU. CU. MIN MAX DECILE FREQ BAD# BAD% FREQ BAD# CU. BAD% SCORE SCORE %OF BAD KS 1 468 156 33.33% 468 156 33.33% 0.13 0.35 54.55% 47.44 2 468 53 11.32% 936 209 22.33% 0.09 0.13 73.08% 56.53 3 468 34 7.26% 1404 243 17.31% 0.07 0.09 84.97% 58.54 4 468 26 5.56% 1872 269 14.37% 0.06 0.07 94.06% 57.57 5 468 9 1.92% 2340 278 11.88% 0.05 0.06 97.20% 50.28 6 468 7 1.50% 2808 285 10.15% 0.04 0.05 99.65% 42.23 7 468 0 0.00% 3276 285 8.70% 0.03 0.04 99.65% 31.58 8 468 0 0.00% 3744 285 7.61% 0.02 0.03 99.65% 20.93 9 468 1 0.21% 4212 286 6.79% 0.02 0.02 100.00% 10.65 10 468 0 0.00% 4680 286 6.11% 0.00 0.02 100.00% 0.00 KS FOR VALIDATION DATA CU. CU. MIN MAX DECILE FREQ BAD# BAD% FREQ BAD# CU. BAD% SCORE SCORE %OF BAD KS 1 150 31 20.67% 150 31 20.67% 0.12 0.36 36.05% 27.63 2 150 13 8.67% 300 44 14.67% 0.09 0.12 51.16% 33.06 3 150 12 8.00% 450 56 12.44% 0.07 0.09 65.12% 37.25 4 150 5 3.33% 600 61 10.17% 0.06 0.07 70.93% 32.81 5 150 7 4.67% 750 68 9.07% 0.05 0.06 79.07% 30.84 6 150 4 2.67% 900 72 8.00% 0.04 0.05 83.72% 25.16 7 150 2 1.33% 1050 74 7.05% 0.03 0.04 86.05% 17.02 8 150 6 4.00% 1200 80 6.67% 0.03 0.03 93.02% 13.82 9 150 5 3.33% 1350 85 6.30% 0.02 0.03 98.84% 9.37 10 150 1 0.67% 1500 86 5.73% 0.00 0.02 100.00% 0.00 AREA UNDER ROC FOR TRAINING DATA -------- 0.888135 AREA UNDER ROC FOR VALIDATION DATA -------- 0.726259 */ ANOTHER FUNCTION FOR NUMERICAL INTEGRATIONfunction result = trape(f, a, b, N) %*************************************************************** %* A FUNCTION OF NUMERICAL INTEGRATION WITH TRAPEZOIDAL RULE * %*=============================================================* %* PARAMETERS: * %* f: A FUNCTION INPUT * %* a: LOWER BOUND OF THE FUNCTION INPUT * %* b: UPPER BOUND OF THE FUNCTION INPUT * %* N: # OF INTERVALS * %*=============================================================* %* AN EXAMPLE: * %* y = trape('sin(x)', 0, 1, 1000); * %*=============================================================* %* AUTHOR: Wensui Liu (liuwensui@hotmail.com) * %*************************************************************** dx = (b - a) / N; func = inline(f); result = 0; for i = 1:N res = dx * (func(a + (i - 1) * dx) + func(a + i * dx)) / 2; result = result + res; end return A PRACTICE IN MATLABfunction [result] = mpoint(f, a, b, N) %*************************************************************** %* A FUNCTION OF NUMERICAL INTEGRATION WITH MIDPOINT RULE * %*=============================================================* %* PARAMETERS: * %* f: A FUNCTION INPUT * %* a: LOWER BOUND OF THE FUNCTION INPUT * %* b: UPPER BOUND OF THE FUNCTION INPUT * %* N: # OF INTERVALS * %*=============================================================* %* AN EXAMPLE: * %* y = mpoint('sin(x)', 0, 1, 1000); * %*=============================================================* %* AUTHOR: Wensui Liu (liuwensui@hotmail.com) * %*************************************************************** dx = (b - a) / N; func = inline(f); result = 0; for i = 1:N res = dx * (func(a + (i - 1/2) * dx)); result = result + res; end return SAS MACRO TO CALCULATE AUC (AREA UNDER CURVE)%macro roc(data = , score =, y = ); *******************************************; * THIS MACRO IS TO CALCULATE AUC (AREA *; * UNDER CURVE) USING TRAPEZOIDAL RULE *; *-----------------------------------------*; * INPUT PARAMETERS: *; * DATA : INPUT DATASET *; * SCORE: SCORE (HIGHER ==> BETTER) *; * Y : 0/1 TARGET (1 ==> BAD) *; *******************************************; data _tmp1; set &data; where &y in (0, 1) and &score ~= .; random = ranuni(1); keep &y &score random; run; proc sql noprint; select sum(&y) into :bcnt from _tmp1; select count(*) - sum(&y) into :gcnt from _tmp1; quit; proc sort data = _tmp1 sortsize = max; by &score random; run; data _tmp2; set _tmp1; by &score random; retain gcum bcum; gcum + (1 - &y); bcum + &y; gpct = gcum / &gcnt; bpct = bcum / &bcnt; run; proc sort data = _tmp2 sortsize = max; by gpct bpct; run; data _tmp3; set _tmp2; by gpct bpct; if last.gpct then output; run; title 'AREA UNDER ROC'; proc sql; create table _tmp4 as select a.gcum as gcum, (b.gpct - a.gpct) * (b.bpct + a.bpct) / 2 as dx from _tmp3 as a, _tmp3 as b where a.gcum + 1 = b.gcum; select sum(dx) from _tmp4; quit; title; proc datasets library = work nolist; delete _tmp:/memtype = data; run; quit; %mend roc; data test; do i = 1 to 10000; x = ranuni(1); if x + rannor(1) * 0.2 > 0.7 then y = 1; else y = 0; output; end; run; proc logistic data = test; model y = x; score data = test out = predicted; run; %roc(data = predicted, score = p_0, y = y); /* -- STANDARD AUC RESULT FROM PROC LOGISTIC: Association of Predicted Probabilities and Observed Responses Percent Concordant 91.1 Somers' D 0.824 Percent Discordant 8.8 Gamma 0.825 Percent Tied 0.1 Tau-a 0.352 Pairs 21336604 c 0.912 -- CALCULATED AUC RESULT FROM MY MACRO: AREA UNDER ROC 0.912021 */ SAS MACRO TO CALCULATE GAINS CHART WITH KS%macro ks(data = , score = , y = ); options nocenter mprint nodate; data _tmp1; set &data; where &score ~= . and y in (1, 0); random = ranuni(1); keep &score &y random; run; proc sort data = _tmp1 sortsize = max; by descending &score random; run; data _tmp2; set _tmp1; by descending &score random; i + 1; run; proc rank data = _tmp2 out = _tmp3 groups = 10; var i; run; proc sql noprint; create table _tmp4 as select i + 1 as decile, count(*) as cnt, sum(&y) as bad_cnt, min(&score) as min_scr format = 8.2, max(&score) as max_scr format = 8.2 from _tmp3 group by i; select sum(cnt) into :cnt from _tmp4; select sum(bad_cnt) into :bad_cnt from _tmp4; quit; data _tmp5; set _tmp4; retain cum_cnt cum_bcnt cum_gcnt; cum_cnt + cnt; cum_bcnt + bad_cnt; cum_gcnt + (cnt - bad_cnt); cum_pct = cum_cnt / &cnt; cum_bpct = cum_bcnt / &bad_cnt; cum_gpct = cum_gcnt / (&cnt - &bad_cnt); ks = (max(cum_bpct, cum_gpct) - min(cum_bpct, cum_gpct)) * 100; format cum_bpct percent9.2 cum_gpct percent9.2 ks 6.2; label decile = 'DECILE' cnt = '#FREQ' bad_cnt = '#BAD' min_scr = 'MIN SCORE' max_scr = 'MAX SCORE' cum_gpct = 'CUM GOOD%' cum_bpct = 'CUM BAD%' ks = 'KS'; run; title "%upcase(&score) KS"; proc print data = _tmp5 label noobs; var decile cnt bad_cnt min_scr max_scr cum_bpct cum_gpct ks; run; title; proc datasets library = work nolist; delete _: / memtype = data; run; quit; %mend ks; data test; do i = 1 to 1000; score = ranuni(1); if score * 2 + rannor(1) * 0.3 > 1.5 then y = 1; else y = 0; output; end; run; %ks(data = test, score = score, y = y); /* SCORE KS MIN MAX DECILE #FREQ #BAD SCORE SCORE CUM BAD% CUM GOOD% KS 1 100 87 0.91 1.00 34.25% 1.74% 32.51 2 100 78 0.80 0.91 64.96% 4.69% 60.27 3 100 49 0.69 0.80 84.25% 11.53% 72.72 4 100 25 0.61 0.69 94.09% 21.58% 72.51 5 100 11 0.51 0.60 98.43% 33.51% 64.91 6 100 3 0.40 0.51 99.61% 46.51% 53.09 7 100 1 0.32 0.40 100.00% 59.79% 40.21 8 100 0 0.20 0.31 100.00% 73.19% 26.81 9 100 0 0.11 0.19 100.00% 86.60% 13.40 10 100 0 0.00 0.10 100.00% 100.00% 0.00 */ SAS MACRO TO ASSIGN RANDOM INTEGERS TO A GROUP OF MACRO VARIABLES%macro seeds(n = , prefix = , seed = 1); data _null_; do i = 1 to &n; random = put(ranuni(&seed) * (10 ** 10), 10.); name = compress("prefix"||put(i, 4.), ' '); call symput(name, random); end; run; %do i = 1 %to &n; %put SEED VALUE &i : &prefix&i = &&prefix&i.; %end; %mend seeds; %seeds(n = 50, prefix = random); /* SEED VALUE 1 : random1 = 1849625698 SEED VALUE 2 : random2 = 9700887157 ... ... SEED VALUE 49 : random49 = 5113178829 SEED VALUE 50 : random50 = 4332045635 */ AN EXAMPLE OF USING DMDB PROCEDURE/************************************************ * AN EXAMPLE TO USE DMDB PROCEDURE IN SAS * * ENTERPRISE MINER CREATING DATA MINING DB WITH * * DMDB PROCEDURE * ************************************************/ data test; input X1 X2 X3 Y; datalines; 1 -0.04926 -0.61599 0 2 0.31526 -1.65430 1 3 0.64517 -0.06878 0 1 0.02592 -0.71203 1 2 1.45284 -0.39064 1 3 0.22533 -0.44507 0 1 -1.24641 -0.73156 0 1 0.88542 -1.63595 1 2 -1.35613 -1.58209 0 3 -1.22333 1.98124 1 2 -0.36128 0.77962 0 1 -0.01054 -0.76720 0 2 1.43263 0.53820 0 3 -0.17737 0.25381 0 2 -1.64183 -0.34028 0 1 -0.83537 -2.00245 0 3 0.00284 -0.75016 0 2 -0.28847 -0.77437 0 3 -0.21167 -0.53833 0 1 -1.23276 0.11452 1 ; run; *** CREATE OF DATA-MINING DATABASE WITH DMDB PROCEDURE ***; proc dmdb batch data = test /* SPECIFY INPUT DATA */ out = out_data /* SPECIFY OUTPUT DATA */ dmdbcat = out_cat /* SPECIFY OUTPUT CATALOG */ classout = out_class /* OUTPUT SUMMARY OF CLASS VARIABLES */ varout = out_var /* OUTPUT SUMMARY OF NUMERIC VARIABLES*/; class x1(asc) y; var x2 x3; target y; run; title "SUMMARY OF CLASS VARIABLES"; proc print data = out_class noobs; run; /* NAME LEVEL FREQUENCY TYPE CRAW NRAW X1 1 7 N 1 X1 2 7 N 2 X1 3 6 N 3 Y 0 14 N 0 Y 1 6 N 1 */ title "SUMMARY OF CONTINUOUS VARIABLES"; proc print data = out_var noobs; run; /* NAME NMISS MIN MAX MEAN STD SKEWNESS KURTOSIS X2 0 -1.64183 1.45284 -0.18245 0.88684 0.17469 -0.49032 X3 0 -2.00245 1.98124 -0.46709 0.92896 0.72040 1.38541 */ THE BEAUTY OF RECURSION IN F##light // GAMMA FUNCTION: AN EXAMPLE USING RECURSION IN F# let rec gamma x = match x with |1 -> 1 |x -> (x - 1) * gamma (x - 1) printfn "%i" (gamma 5) A EDA MACRO TO RANK BOTH NUMERIC AND CATEGORICAL VARIABLES%macro eda(data = , vlist = , dep = , out = out, groups = 100); /****************************************************************************** * EXPLORATORY DATA ANALYSIS MACRO FOR BINARY RESPONSE * * --------------------------------------------------------------------------- * * PURPOSE * * rank all independent variables to be used in the model * * --------------------------------------------------------------------------- * * MACRO VARIABLES * * data : input dataset; * * vlist : list of independent variables; * * dep : dependent variables, must be either 0 or 1; * * out : output dataset, default is out; * * groups : number of bins, default is 100; * * --------------------------------------------------------------------------- * * CALCULATED MEASURES * * 1) DISTANCE MEASURE : KS * * 2) INFORMATION MEASURE : ENTROPY, INFORMATION VALUE, GINI INDEX * * 3) DEPENDENCE MEASURE : LIKELIHOOD RATIO TEST * * 4) CLASSIFY ERROR MEASURE: MISCLASSIFICATION RATE OF VALIDATION SET * * --------------------------------------------------------------------------- * * AUTHOR: WENSUI LIU * ******************************************************************************/ /* read data in and keep the variables */ data _tmp1; set &data; keep &vlist &dep validate; if &dep = 1 then do; validate = (ranuni(1) < 0.2); end; else do; validate = (ranuni(0) < 0.2); end; run; proc sql noprint; /* count number of indep variables */ select count(*) - 2 into :nvar from sashelp.vcolumn where upcase(libname) = 'WORK' and upcase(memname) = '_TMP1'; /* read the names of indep variables into a macro variable */ select name into :vars separated by ' ' from sashelp.vcolumn where upcase(libname) = 'WORK' and upcase(memname) = '_TMP1' and upcase(name) not in ("%upcase(&dep.)", "VALIDATE"); /* create an empty table to contain measures */ create table &out ( var char(50) label = 'Predictors', bin num format = 5.0 label = '# of Bins', /* DISTANCE MEASURES */ ks num format = 12.8 label = 'KS-statistics' , /* INFORMATION MEASURES */ en num format = 12.8 label = 'Entropy', iv num format = 12.8 label = 'Info Values', gn num format = 12.8 label = 'Gini Index', /* DEPENDENCE MEASURES */ lr num format = 12.8 label = 'Likelihood Ratio Test', /* MISCLASSIFICATION MEASURES */ pi num format = 12.8 label = 'Predictive Index', cr num format = 12.8 label = 'Misclassification Rate' ); quit; %do i = 1 %to &nvar; /* get the indep variable one by one */ %let var = %scan(&vars, &i, ' '); %put ===========================; %put Variable &i: &var; %put ===========================; /* sort the indep variable and create a sequential index */ proc sort data = _tmp1 out = _tmp2(keep = &var &dep validate) sortsize = max; by &var; run; data _tmp3; set _tmp2; by &var; if first.%trim(&var) then index + 1; run; /* count distinct level of index */ proc sql noprint; select count(distinct index) into :level from _tmp3; quit; %if &level > 2 %then %do; /* bin the indep variable based on index */ proc rank data = _tmp3 groups = &groups ties = low out = _tmp4; var index; ranks level; run; /* get the # of binned levels */ proc sql noprint; select count(distinct level) into :nbin from _tmp4; quit; /* run a logistic regression with a random sample set */ ods listing close; ods output association = _auc1 (where = (label2 = 'c')); ods output FitStatistics = _lr1 (where = (index(Criterion, 'Log') > 0)); proc logistic data = _tmp4 desc; where validate = 0; class level; model &dep = level; score data = _tmp4(where = (validate = 1)) out = _score1; run; ods listing; proc sql noprint; select sum(&dep) into :nvalidate from _score1; quit; proc sort data = _score1 sortsize = max; by p_0; run; data _score2; set _score1; by p_0; if _n_ <= &nvalidate then pred = 1; else pred = 0; run; /* get the misclassification rate of validation data */ proc sql noprint; select sum(case when pred ~= &dep then 1 else 0 end) / count(*) into :cr from _score2; quit; /* get the AUC */ data _auc2; set _auc1; call symput('pi', put((nValue2 - 0.5) / 0.5, 8.6)); run; /* get the Likelihood Ratio test */ data _lr2; set _lr1; lr = 1 - probchi(InterceptOnly - InterceptAndCovariates, &nbin - 1); /* assign P-value of Likelihood Ratio test to a macro variable */ call symput('lr', put(lr, 8.6)); run; /* get the ks-statistics of indep variable */ proc npar1way data = _tmp4 missing noprint edf; class level; var &dep; output out = _ks1 edf; run; proc sql noprint; /* assign ks to a macro variable */ select _ks_ into :ks from _ks1; /* count numbers of goods and bads by binning levels */ create table _iv1 as select level, sum(case when &dep = 0 then 1 else 0 end) as g_cnt, sum(case when &dep = 1 then 1 else 0 end) as b_cnt from _tmp4 group by level; /* calculate information value of indep variable */ create table _iv2 as select level, b_cnt, g_cnt, g_cnt / sum(g_cnt) as g_pct, b_cnt / sum(b_cnt) as b_pct, (calculated b_pct - calculated g_pct) * log(max(calculated b_pct, constant("small")) / max(calculated g_pct, constant("small"))) as iv_tmp from _iv1; select sum(iv_tmp) into :iv from _iv2; /* calculate entropy of the indep variable */ create table _en1 as select level, - g_cnt * log(max(constant("small"), g_cnt / (g_cnt + b_cnt))) as num, g_cnt + b_cnt as den from _iv1; /* assign entropy into macro variable */ select -log(sum(num) / sum(den)) into :en from _en1; quit; proc iml; sort _iv2 by descending b_pct; use _iv2; read all var {level b_cnt g_cnt} into x; close _iv2; n_event = x[, 2]; n_noevent = x[, 3]; gini = 0; do i = 2 to nrow(x); _a = 0; do j = 1 to i - 1; _a = sum(_a, n_noevent[j, 1]); end; gini = sum(gini, n_event[i, 1] * _a); end; gini = (1 - (gini * 2 + (n_event # n_noevent)[+, ]) / (n_event[+, ] * n_noevent[+, ])) * 100; /* assign gini index into macro variable */ call symput('gn', char(gini)); quit; /* combine all measures in one table */ data _measure ; var = "%trim(&var.)"; bin = &nbin; ks = &ks; en = &en; iv = &iv; gn = &gn; lr = &lr; pi = π cr = &cr; run; /* insert measures into output table */ proc sql; insert into &out select * from _measure; quit; %end; %end; /* delete unwanted tables */ proc datasets nolist library = work; delete _: / memtype = data; run; quit; /* sort output table by ks-statistics */ proc sort data = &out; by descending ks; run; /* print out meassures */ proc print data = &out label; run; %mend eda; 当我老了 -给我们的父母當我老了,不再是原來的我…… 請理解我,對我有一點耐心。 當我把菜湯灑到自己的衣服上時,當我忘記怎樣繫鞋帶時,請想一想當初我是如何手把手地教你。 當我一遍又一遍地重複你早已聽膩的話語,請耐心地聽我說,不要打斷我。你小的時候,我不得不重複那個講過千百遍的故事,直到你進入夢鄉。 當我需要你幫我洗澡時,請不要責備我。還記得小時候我千方百計哄你洗澡的情形嗎? 當我對新科技和新事物不知所措時,請不要嘲笑我。想一想當初我怎樣耐心地回答你的每一個「為什麼」。 當我由於雙腿疲勞而無法行走時,請伸出你年輕有力的手攙扶我。就像你小時候學習走路時,我扶你那樣。 當我忽然忘記我們談話的主題,請給我一些時間讓我回想。其實對我來說,談論什麼並不重要,只要你能在一旁聽我說,我就很滿足……… 當你看著老去的我,請不要悲傷…… 理解我,支持我,就像你剛才開始學習如何生活時我對你那樣。 當初我引導你走上人生路,如今請陪伴我走完最後的路。 給我 ~ 你的愛和耐心,我會抱以感激的微笑,這微笑中 ~ 凝結著我對你無限的愛。 A MACRO TO CHECK MODEL INVERSIONdata kyphosis; input Age StartVert NumVert Kyphosis @@; datalines; 71 5 3 0 158 14 3 0 128 5 4 1 2 1 5 0 1 15 4 0 1 16 2 0 61 17 2 0 37 16 3 0 113 16 2 0 59 12 6 1 82 14 5 1 148 16 3 0 18 2 5 0 1 12 4 0 243 8 8 0 168 18 3 0 1 16 3 0 78 15 6 0 175 13 5 0 80 16 5 0 27 9 4 0 22 16 2 0 105 5 6 1 96 12 3 1 131 3 2 0 15 2 7 1 9 13 5 0 12 2 14 1 8 6 3 0 100 14 3 0 4 16 3 0 151 16 2 0 31 16 3 0 125 11 2 0 130 13 5 0 112 16 3 0 140 11 5 0 93 16 3 0 1 9 3 0 52 6 5 1 20 9 6 0 91 12 5 1 73 1 5 1 35 13 3 0 143 3 9 0 61 1 4 0 97 16 3 0 139 10 3 1 136 15 4 0 131 13 5 0 121 3 3 1 177 14 2 0 68 10 5 0 9 17 2 0 139 6 10 1 2 17 2 0 140 15 4 0 72 15 5 0 2 13 3 0 120 8 5 1 51 9 7 0 102 13 3 0 130 1 4 1 114 8 7 1 81 1 4 0 118 16 3 0 118 16 4 0 17 10 4 0 195 17 2 0 159 13 4 0 18 11 4 0 15 16 5 0 158 15 4 0 127 12 4 0 87 16 4 0 206 10 4 0 11 15 3 0 178 15 4 0 157 13 3 1 26 13 7 0 120 13 2 0 42 6 7 1 36 13 4 0 ; run; proc logistic data = kyphosis desc; model kyphosis = Age StartVert NumVert / lackfit; score data = kyphosis out = prediction; run; %macro inversion(data = , score = score, x = ); /********************************************************* * PURPOSE OF THE MACRO * * CHECK MODEL INVERSION FOR EACH NUMERIC PREDICTOR * * ------------------------------------------------------ * * MACRO INPUTS * * DATA : input dataset * * SCORE: predicted probability * * X : list of numeric predictors * * ------------------------------------------------------ * * AUTHOR: WENSUI LIU, RISK MODELER AT CHASE * *********************************************************/ options orientation = landscape center pageno = 1 nodate ls = 100 ps = 80; data _tmp1; retain &score &x; set &data; keep &score &x; run; ods listing close; ods output position = _pos; proc contents data = _tmp1 varnum; run; ods listing; proc sql noprint; select count(*) - 1 into :nx from _pos; quit; %do i = 1 %to &nx; %let var = %scan(&x, &i); proc rank data = _tmp1 groups = 10 ties = low out = _tmp2(keep = &score rank) ; var &var; ranks rank; run; proc summary nway data = _tmp2; class rank; output out = _tmp3 (drop = _freq_ _type_) mean(&score) = prob; run; data _tmp4; set _tmp3; decile = put(rank + 1, z2.); attrib decile label = "Decile of &var"; run; title "Inversion Check for &var"; proc chart data = _tmp4; vbar decile / sumvar = prob type = mean width = 8 symbol = "o" noheader space = 0 nospace axis = 0 to 1 by 0.2 ref = 0 to 1 by 0.2; run; title; %end; proc datasets library = work; delete _:; run; quit; %mend; %inversion(data = prediction, score = p_1, x = Age StartVert NumVert); /* OUTPUT: Inversion Check for Age Predicted Probability: Kyphosis=1 Mean 1 +-------------------------------------------------------------------------------- | | | 0.8 +-------------------------------------------------------------------------------- | | | 0.6 +-------------------------------------------------------------------------------- | | | 0.4 +-------------------------------------------------------------------------------- | oooooooo | oooooooo | oooooooo oooooooooooooooo oooooooo 0.2 +--------oooooooo--------oooooooooooooooo--------oooooooooooooooooooooooooooooooo | oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo |oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo |oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo +-------------------------------------------------------------------------------- 01 02 03 04 05 06 07 08 09 10 Decile of Age Inversion Check for StartVert Predicted Probability: Kyphosis=1 Mean 1 +------------------------------------------------------------------------ | | | 0.8 +------------------------------------------------------------------------ | | | 0.6 +oooooooo---------------------------------------------------------------- |oooooooo |oooooooooooooooo |oooooooooooooooo 0.4 +oooooooooooooooo-------------------------------------------------------- |oooooooooooooooooooooooo |oooooooooooooooooooooooo |oooooooooooooooooooooooo 0.2 +oooooooooooooooooooooooo------------------------------------------------ |oooooooooooooooooooooooooooooooooooooooo |oooooooooooooooooooooooooooooooooooooooooooooooooooooooo |oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo +------------------------------------------------------------------------ 01 02 03 04 05 06 07 08 10 Decile of StartVert Inversion Check for NumVert Predicted Probability: Kyphosis=1 Mean 1 +------------------------------------------------ | | | oooooooo 0.8 +----------------------------------------oooooooo | oooooooo | oooooooo | oooooooo 0.6 +----------------------------------------oooooooo | oooooooo | oooooooo | oooooooo 0.4 +----------------------------------------oooooooo | oooooooooooooooo | oooooooooooooooo | oooooooooooooooooooooooo 0.2 +----------------oooooooooooooooooooooooooooooooo | oooooooooooooooooooooooooooooooo |oooooooooooooooooooooooooooooooooooooooooooooooo |oooooooooooooooooooooooooooooooooooooooooooooooo +------------------------------------------------ 01 02 05 07 09 10 Decile of NumVert */ A MACRO FOR MEDIAN IMPUTATIONdata tmp1; do i = 1 to 1000; x1 = rannor(1); x2 = rannor(2); if i < 10 then do; x1 = .; x2 = .; end; output; end; run; proc means data = tmp1 n nmiss median; var x1 x2; run; %macro impute(in = , out = output, varlist = ); /********************************************************** * PURPOSE OF THE MACRO: * * PERFORM MEDIAN IMPUTATION FOR NUMERIC VARIABLES * * ------------------------------------------------------- * * PARAMETERS: * * IN : INPUT DATASET * * OUT : OUTPUT DATASET * * VARLIST: A LIST OF VARIABLES TO BE IMPUTED * * ------------------------------------------------------- * * AUTHOR: WENSUI LIU, RISK MODELER AT CHASE * **********************************************************/ options mprint; /* CREATE TEMPORARY DATASETS FOR THE MACRO */ data _origin _tmp1(keep = _id &varlist); set ∈ _id + 1; run; proc sort data = _origin sortsize = max; _id; run; /* CALCULATE MEDIANS */ proc summary data = _tmp1 nway; output out = _med1(drop = _freq_ _type_) median(&varlist) =; run; ods listing close; ods output position = _pos; proc contents data = _med1 varnum; run; ods listing; /* REPLACE MISSINGS BY MEDIAN */ proc sql noprint; select "case when a."||compress(variable, ' ')||" = . then b."|| compress(variable, ' ')||" else a."||compress(variable, ' ')|| " end as imp_"||compress(variable, ' ') into :replace separated by ", " from _pos; create table _tmp2 as select a._id, &replace from _tmp1 as a, _med1 as b; quit; proc sort data = _tmp2 sortsize = max; by _id; run; /* MERGE BACK WITH THE ORIGIN DATA */ data &out(drop = _id); merge _origin _tmp2; by _id; run; %mend impute; %impute(in = , varlist = x1 x2); proc means data = output n nmiss median; var imp_:; run; A MACRO PERFORMING 1-SIDE CAPPING OF NUMERIC VARIABLESdata tmp1; do i = 1 to 1000; x1 = rannor(1); x2 = rannor(2); crap = 1; output; end; run; proc means data = tmp1 p99 min median max n; var x1 x2; run; %macro cap(in = , out = output, varlist = , p = 99); /********************************************************** * PURPOSE OF THE MACRO: * * PERFORM 1-SIDE CAPPING OF NUMERIC VARIABLES * * ------------------------------------------------------- * * PARAMETERS: * * IN : INPUT DATASET * * OUT : OUTPUT DATASET * * VARLIST: A LIST OF VARIABLES TO BE CAPPED * * P : CAPPING PERCENTAGE * * ------------------------------------------------------- * * AUTHOR: WENSUI LIU, RISK MODELER AT CHASE * **********************************************************/ options mprint; /* CREATE TEMPORARY DATASETS FOR THE MACRO */ data _origin _tmp1(keep = _id &varlist); set ∈ _id + 1; run; proc sort data = _origin sortsize = max; _id; run; /* CALCULATE CAPPING LIMITS */ proc summary data = _tmp1 nway; output out = _sum1(drop = _freq_ _type_) P%trim(&p)(&varlist) =; run; ods listing close; ods output position = _pos; proc contents data = _sum1 varnum; run; ods listing; /* REPLACE EXTREMES BY CAPPED LIMITS */ proc sql noprint; select "min(a."||compress(variable, ' ')||", b."|| compress(variable, ' ')||") as cap_"||compress(variable, ' ') into :replace separated by ", " from _pos; create table _tmp2 as select a._id, &replace from _tmp1 as a, _sum1 as b; quit; proc sort data = _tmp2 sortsize = max; by _id; run; /* MERGE BACK WITH THE ORIGIN DATA */ data &out(drop = _id); merge _origin _tmp2; by _id; run; %mend cap; %cap(in = tmp1, varlist = x1 x2); proc means data = output min median max n; var cap_:; run; |
Thanks for visiting!
Ajaywrote:
Lovely helpful
keep it simple site Needed to know how to use SQL server for storing data and R for building scoring model do you know any stuff on that ? Ajay,India ohri2007@gmail.com www.decisionstats.com
July 7
(no name)
wrote:
Hello WenSui. Thank you so much for your blog. I've added your site to my reference for statistics. Your posts on count data models in SAS is fantastic. Especially since I have an older version without proc countreg. Your blog is like an ideal statistic textbook which succintly explains the theory behind the text and then immediately provides the code and output. I will definitely recommend your site to my colleagues.
Have a wonderful day
Apr. 17
慧闽 刘wrote:
It is my first time visiting your stattistical blog. I need to say you are amazing. You must be a very smart guy and also hard working. Can not imagine how much you have done from March 2006.
I am very interested in your paper 'Improving credit scoring by GAM' and also classification using GAM.
Thanks for your great job. I am sure this will help me a lot with my GAM project after I struggle with reading your paper and codes for a few days.
Thanks again. Have a great day.
Dec. 5
No namewrote:
老大,
从你的博客中学到好多东西, 而且老大是个热心人, 从你的代码里
得到不少启发。
能不能再写个关于评分卡的代码, 比如你作出了模型, 想再进一步作出评分卡,
如以age 为变量之一, 不同年龄段的分数 这样的代码是找不到的。
好期待啊
June 7
|
||||
|
|