wensui's profileWENSUI'S BLOG IN STATIST...PhotosBlogListsMore ![]() | Help |
RealMatrix++: A Real-Valued Matrix Class and LibraryRealMatrix++ A Real-Valued Matrix Class and Library Daniel Hanson Email: HoserAnalytics (at) gmail (dot) com June 1, 2009 Introduction and Purpose A frequent frustration for quant development in finance is the lack of a matrix class and matrix algebra library in the C++ Standard Library. Third party libraries do exist, both open source and commercial, but open source libraries are often time consuming to learn and implement, and commercial libraries can carry a heavy price tag. Furthermore, for quant finance, as in other domains such as predictive modeling, the vast majority of linear algebra calculations are real-valued. The intent of the RealMatrix++ Library is to provide an easy-to-use, non-templated matrix class and library for real-valued matrices and vectors, with the ultimate goal of providing functionality on par with that of the JAMA library for Java [1]. The code is not optimized for performance to the extent of, say, the Matrix Template Library [2], or Blitz [3], but it will likely offer more rapid development, and performance sufficient at least for prototyping models. Current Release Features and Design This initial release contains a real matrix class, and matrix operations +, -, and *, where (*) is overloaded for standard matrix multiplication, vector by matrix and matrix by vector multiplication, and scalar multiplication (see MatrixOperations.h for details). There is also a set of validation conditions outlined in MatrixConditions.h, which will check if two matrices are of the same dimensions (necessary for matrix addition and subtraction), if a matrix is square, or if a matrix is symmetric. The Matrix class itself is built as a wrapper around the valarray container in the C++ standard library. As Stroustrup[4] says, "The standard library does not provide a matrix class. Instead, the intent is for valarray to provide the tools for building matrices optimized for a variety of needs." Also, as Josuttis [5] points out, "a valarray can be used as a base both for vector and matrix operations.with good performance." On the other hand, he notes that "you often need some inconvenient and time-consuming type conversions." This relates primarily to the need to create a separate valarray instance to use the valarray slice function. For this reason, its use is avoided in the Matrix class here, although it may be used minimally going forward, in cases where the benefit outweighs the inconvenience. The current release has been unit tested, but not completely stress tested. It should work as designed, but at this stage it should probably be best considered a beta release. Your comments are invited (email address given at the top of this document). Future Additions/Enhancements The following functionality is being considered for future releases (not necessarily an exhaustive list): Matrix Decompositions o Cholesky Decomposition of square and symmetric matrices o Singular Value Decomposition o LU Decomposition Eigenvalue calculations Principal component analysis Solutions of linear systems Matrix inverse Licensing This software is released under the terms of the BSD License [6]: Copyright (c) 2009, Daniel Hanson All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. Neither the name of the copyright holder nor the names of any contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. References [1] http://math.nist.gov/javanumerics/jama/ [2] http://www.osl.iu.edu/research/mtl/ [3] http://www.oonumerics.org/blitz/ [4] Bjarne Stroustrup, The C++ Programming Language, 3E (Addison Wesley) [5] Nicolai Josuttis, The C++ Standard Library: A Tutorial and Reference (Addison Wesley) [6] http://www.opensource.org/licenses/bsd-license.php A MACRO TO IMPUTE NUMERIC VARIABLES BASED UPON BAD RATESlibname data 'E:\sas'; data test; set data.credit; if ranuni(2) < 0.1 then x2 = .; if ranuni(3) < 0.3 then x3 = .; if ranuni(4) < 0.2 then x4 = .; run; proc means data = test n nmiss nonobs; class y; var x2 - x4; run; /* N Y Variable Label N Miss 0 X2 X2 5217 591 X3 X3 4011 1797 X4 X4 4675 1133 1 X2 X2 343 29 X3 X3 268 104 X4 X4 311 61 */ %macro impute(data = , y = , x = , outfile = ); *********************************************; * THIS MACRO IS TO IMPUTE NUMERIC VARIABLES *; * BASED UPON BAD RATES *; * ----------------------------------------- *; * PARAMETERS: *; * DATA : INPUT DATA SET *; * Y : RESPONSE WITH 0 / 1 *; * X : A LIST OF NUMERIC PREDICTORS *; * WITH MISSING VALUES *; * OUTFILE: FILEREF FOR IMPUTING RECODE *; *********************************************; %if %sysfunc(fexist(&outfile)) %then %do; data _null_; rc = fdelete("&outfile"); if rc = 0 then put "THE FILE HAS BEEN DELETE."; run; %end; ods output position = _pos1; ods listing close; proc contents data = &data varnum; run; ods listing; proc sql noprint; select upcase(variable) into :x2 separated by ' ' from _pos1 where compress(upcase(type), ' ') = 'NUM' and index("%upcase(%sysfunc(compbl(&x)))", compress(upcase(variable), ' ')) > 0; select count(variable) into :xcnt from _pos1 where compress(upcase(type), ' ') = 'NUM' and index("%upcase(%sysfunc(compbl(&x)))", compress(upcase(variable), ' ')) > 0; quit; data _tmp1; retain &x2 &y; set &data; where &y in (0, 1); keep &x2 &y; run; ods output position = _pos2; ods listing close; proc contents data = _tmp1 varnum; run; ods listing; %let loop = 0; %do i = 1 %to &xcnt; proc sql noprint; select upcase(variable) into :var from _pos2 where num = &i; select count(distinct &var) into :vcnt from _tmp1 where &var ~= . and &y in (1, 0); select nmiss(&var) / count(*) * 100 into :mpct from _tmp1 where &y in (1, 0); select count(distinct &y) into :yflg from _tmp1 where &y in (1, 0); quit; %if &vcnt > 1 & &mpct > 0 & &mpct < 99 & &yflg = 2 %then %do; %let loop = %eval(&loop + 1); proc sql noprint; select mean(&y) format = 10.6 into :mis_rate from _tmp1 where &var = . and &y in (0, 1); select count(*) into :mcnt from _tmp1 where &var = . and &y in (1, 0); create table _tmp2 as select distinct "%upcase(%trim(&var))" as variable, &var as value, mean(&y) as bad_rate, &mis_rate as mis_rate, count(*) as freq, &mcnt as mis_cnt, abs(mean(&y) - &mis_rate) as dif from _tmp1 where &var ~= . and &y in (0, 1) group by &var; quit; proc sort data = _tmp2 sortsize = max; by dif descending freq; run; data _tmp3; set _tmp2; by dif descending freq; if _n_ = 1 then output; run; %if &loop = 1 %then %do; data _result; set _tmp3; run; %end; %else %then %do; data _result; set _result _tmp3; run; %end; %end; %end; data _null_; set _result end = eof; file impute mod; if _n_ = 1 then do; put @1 93 * "*" ";"; put @1 "* IMPUTED VARIABLE" @32 "| IMPUTED VALUE" @52 "| BAD RATE" @65 "| MIS. COUNT" @80 "| MIS. RATE" @93 "*;"; end; put @1 "*" 91 * "-" @93 "*;"; put @1 "*" @3 variable @32 "|" @34 value @52 "|" @54 bad_rate percent8.2 @65 "|" @67 mis_cnt @80 "|" @82 mis_rate percent8.2 @93 "*;"; if eof then do; put @1 93 * "*" ";"; end; run; data _null_; set _result; file impute mod; if _n_ = 1 then do; put " "; put @10 3 * "*" " RECODING OF IMPUTATION " 3 * "*" ";"; end; put @3 "if " @6 variable " = . " @40 "then " variable " = " value ";"; run; %mend impute; filename impute 'E:\sas\impute.sas'; %impute(data = test, y = y, x = x2 x3 x4, outfile = impute); /* $ more impute.sas **************************************************************; *IMPUTED VARIABLE|IMPUTED VALUE|BAD RATE|MIS. COUNT|MIS. RATE*; *------------------------------------------------------------*; * X2 | -0.68 | 4.55%| 620 | 4.68% *; *------------------------------------------------------------*; * X3 | -0.685 | 5.67%| 1901 | 5.47% *; *------------------------------------------------------------*; * X4 | 1.728 | 5.00%| 1194 | 5.11% *; **************************************************************; *** RECODING OF IMPUTATION ***; if X2 = . then X2 = -0.683 ; if X3 = . then X3 = -0.685 ; if X4 = . then X4 = 1.728 ; */ *** SHOW HOW TO USE THE IMPUTING RECODE ***; data test2; set test; %inc impute / source2; run; proc means data = test2 n nmiss nonobs; class y; var x2 - x4; run; /* N Y Variable Label N Miss 0 X2 X2 5808 0 X3 X3 5808 0 X4 X4 5808 0 1 X2 X2 372 0 X3 X3 372 0 X4 X4 372 0 */ READ FIXED-WIDTH DATA FILE WITH read.fwf()############################################## # READ FIXED-WIDTH DATA FILE WITH read.fwf() # # ------------------------------------------ # # EQUIVALENT SAS CODE: # # filename data 'E:\sas\fixed.txt'; # # data test; # # infile data truncover; # # input @1 city $ 1 - 22 @23 population; # # run; # ############################################## # OPEN A CONNECTION TO THE DATA FILE data <- file(description = "e:\\sas\\fixed.txt", open = "r") # width = c(...) ==> SPECIFIES COLUMN WIDTHS # col.names = c(...) ==> GIVES COLUMN NAMES # colClasses = c(...) ==> DEFINES COLUMN CLASSES test <- read.fwf(data, header = FALSE, width = c(22, 10), col.names = c("city", "population"), colClasses = c("character", "numeric")) close(data) READ DATA FROM A FILE CONNECTION WITH SCAN()################################################# # READ DATA FROM A FILE CONNECTION WITH SCAN() # #-----------------------------------------------# # EQUIVALENT SAS CODE: # # filename data 'E:\sas\Kyphosis'; # # data test; # # infile data; # # input x1 x2 x3 y @@; # # run; # ################################################# # OPEN A CONNECTION TO THE DATA FILE data <- file(description = "e:\\sas\\kyphosis", open = "r") # READ DATA FROM THE CONNECTION test <- data.frame(scan(data, what = list(x1 = 0, x2 = 0, x3 = 0, y = ""), quiet = TRUE)) # CLOSE THE CONNECTION close(data) EVALUATION OF Logistic() FUNCTION IN RWeka############################################## # EVALUATION OF Logistic() FUNCTION IN RWeka # ############################################## library(rpart) data(kyphosis) # MODEL FROM glm() FUNCTION model1 <- glm(Kyphosis ~., data = kyphosis, family = binomial()) # PRINT OUT MODEL SUMMARY summary(model1) # Estimate Std. Error z value Pr(>|z|) # (Intercept) -2.036934 1.449575 -1.405 0.15996 # Age 0.010930 0.006446 1.696 0.08996 . # Number 0.410601 0.224861 1.826 0.06785 . # Start -0.206510 0.067699 -3.050 0.00229 ** # MODEL FROM Logistic() FUNCTION IN RWeka library(RWeka) model2 <- Logistic(Kyphosis ~., data = kyphosis) # PRINT OUT ESTIMATES print(model2) # Class # Variable absent # ==================== # Age -0.0109 # Number -0.4106 # Start 0.2065 # Intercept 2.0369 # MODEL EVALUATION BASED ON 5-FOLD CROSS-VALIDATION summary(logit, newdata = kyphosis, class = TRUE, numFolds = 5, seed = 2009) # === 5 Fold Cross Validation === # === Summary === # Correctly Classified Instances 65 80.2469 % # Incorrectly Classified Instances 16 19.7531 % # Kappa statistic 0.3481 # Mean absolute error 0.2503 # Root mean squared error 0.3624 # Relative absolute error 74.2609 % # Root relative squared error 88.868 % # Total Number of Instances 81 # === Detailed Accuracy By Class === # TP Rate FP Rate Precision Recall F-Measure ROC Area Class # 0.906 0.588 0.853 0.906 0.879 0.835 absent # 0.412 0.094 0.538 0.412 0.467 0.835 present # Weighted Avg. 0.802 0.484 0.787 0.802 0.792 0.835 # === Confusion Matrix === # a b <-- classified as # 58 6 | a = absent # 10 7 | b = present DO DATA INPUT IN R CONSOLE WITH SCAN()############################################ # DO DATA INPUT IN R CONSOLE WITH SCAN() # #------------------------------------------# # COMPARABLE SAS CODE: # #------------------------------------------# # data test; # # input x1 x2 x3 y $ @@; # # cards; # # 71 . 3 0 158 14 3 0 # # 128 5 4 1 # # ; # # run; # ############################################ test <- data.frame(scan(file = "", what = list(x1 = 0, x2 = 0, x3 = 0, y = ""))) 71 NA 3 0 158 14 3 0 128 5 4 1 A MACRO TO DO MONOTONIC WOE TRANSFORMATION FOR THE NUMERIC PREDICTORlibname data 'E:\sas'; data test; set data.credit; keep y x2 x3 x8 x9; run; %macro monowoe(data = , y = , x = , maxbin = ); ******************************************************; * THIS MACRO IS TO DO A MONOTONIC WEIGHT OF EVIDENCE *; * (WOE) TRANSFORMATION FOR THE NUMERIC PREDICTOR. *; *----------------------------------------------------*; * PARAMETERS: *; * DATA : INPUT DATA *; * Y : GOOD/BAD RESPONSE (BAD ==> 1) *; * X : A NUMERIC PREDICTOR *; * MAXBIN: MAX. NUMBER OF BINS USED TO CALCULATE WOE *; ******************************************************; %let minbad = 10; data _tmp1(keep = &y &x); set &data; where &y in (0, 1) and &x ~= .; run; proc sql noprint; select min(&maxbin, count(distinct &x)) into :nbin from _tmp1; select count(*) into :cnt from _tmp1; select sum(&y) into :bad from _tmp1; quit; %do i = &nbin %to 2 %by -1; proc rank data = _tmp1 out = _tmp2 ties = low groups = &i; var &x; ranks rank; run; proc summary data = _tmp2 nway; class rank; output out = _tmp3(drop = _type_ rename = (_freq_ = freq)) sum(&y) = bad mean(&y) = bad_rate mean(&x) = x min(&x) = minx max(&x) = maxx; run; proc sql noprint; select min(bad) into :minbad2 from _tmp3; select min(bad_rate) into :minrate from _tmp3; select max(bad_rate) into :maxrate from _tmp3; quit; %if &minbad2 >= &minbad & &minrate > 0 & &maxrate < 1 %then %do; ods output SpearmanCorr = _corr(rename = (x = cor)); ods listing close; proc corr data = _tmp3 spearman; var x; with bad_rate; run; ods listing; proc sql noprint; select abs(cor) into :cor from _corr; quit; %if &cor = 1 %then %goto outside; %end; %end; %outside: proc sort data = _tmp3; by rank; run; data _tmp4; retain bin minx maxx bad freq bad_rate gpct bpct woe iv; set _tmp3; by rank; bin + 1; bpct = bad / &bad; gpct = (freq - bad) / (&cnt - &bad); woe = log(gpct / bpct); iv = (gpct - bpct) * log(gpct / bpct); keep bin minx maxx bad freq bad_rate gpct bpct woe iv; run; options nocenter nodate ls = 100; title; proc report data = _tmp4 box spacing = 1 split = "*"; column("MOTONOTIC WEIGHT OF EVIDENCE TRANSFORMATION FOR %upcase(&x)" bin minx maxx bad freq woe iv); define bin / "BIN*LEVEL" width = 5 format = z2. order order = data; define minx / "LOWER*BOUND" width = 10 format = 9.2; define maxx / "UPPER*BOUND" width = 10 format = 9.2; define bad / "#BADS" width = 10 format = 9.; define freq / "#FREQ" width = 10 format = 9.; define woe / "WOE" width = 10 format = 10.4; define iv / "INFO.*VALUE" width = 8 format = 8.4; run; %mend monowoe; options mprint mlogic nocenter; %monowoe(data = test, y = y, x = x3, maxbin = 20); /* +---------------------------------------------------------------------+ | MOTONOTIC WEIGHT OF EVIDENCE TRANSFORMATION FOR X3 | | BIN LOWER UPPER INFO.| |LEVEL BOUND BOUND #BADS #FREQ WOE VALUE| |---------------------------------------------------------------------| | 01| -1.19| -0.81| 134| 1576| -0.3722| 0.0417| |---------------------------------------------------------------------| | 02| -0.69| -0.31| 111| 1693| -0.0912| 0.0024| |---------------------------------------------------------------------| | 03| -0.18| 0.32| 67| 1419| 0.2565| 0.0135| |---------------------------------------------------------------------| | 04| 0.45| 2.97| 60| 1492| 0.4244| 0.0362| +---------------------------------------------------------------------+ */ THE EQUIVALENT FUNCTION OF PROC RANK WITH WEIGHT VARIABLE***********************************************; * A DEMO ON HOW TO DO THE EQUIVALENT FUNCTION *; * OF PROC RANK WITH WEIGHT VARIABLE *; * ------------------------------------------- *; * SPECIAL THANKS TO MY FRIEND AND COWORKER *; * Mr. Jonas Bilenas *; ***********************************************; data one; do i = 1 to 1000; x = rannor(1); if x < 0 then wt = 1; else wt = 10; output; end; run; %macro rank(data = , var = , weight = , groups = , output = ); data _tmp1(keep = &var &weight); set &data; run; proc univariate data = &data noprint; var &var; weight &weight; output out = _tmp2 pctlpre = decile_ pctlpts = 0 to 100 by %sysevalf(100 / &groups); run; data &output(keep = &var lower upper rank &weight); set _tmp1; if _n_ = 1 then do; set _tmp2; end; array d[*] decile_:; rank = 999; do i = 2 to dim(d); if &var <= d[i] and rank = 999 then do; lower = round(d[i - 1], 0.0001); upper = round(d[i], 0.0001); rank = i - 1; end; end; run; %mend rank; %rank(data = one, var = x, weight = wt, groups = 4, output = two); proc means data = two nonobs n min max missing sumwgt; class rank; var x; weight wt; run; /* OUTPUT: rank N Minimum Maximum Sum Wgts ------------------------------------------------------------- 1 591 -3.3158398 0.2201053 1365.00 2 137 0.2276076 0.6209279 1370.00 3 136 0.6211922 1.0996207 1360.00 4 136 1.1021029 3.2444009 1360.00 */ A SAS MACRO TO CREATE FREQUENCY SUMMARY TABLE*************************************************; * A MACRO TO CREATE FREQUENCY SUMMARY TABLE FOR *; * ALL INPUT CATEGORICAL VARIABLES *; *************************************************; libname data 'z:\XXX'; %let cat = x10 - x11; data tmp1; set data.credit_train; run; %macro freq(data = , y = , cat_x = ); data _tmp1; retain &cat_x &y; set &data; where &y in (0, 1); keep &cat_x &y; run; ods listing close; ods output position = _pos; proc contents data = _tmp1 varnum; run; ods listing; proc sql noprint; select count(*) - 1 into :nx from _pos; select variable into :xlist separated by ' ' from _pos where num <= &nx; select count(*) into :obs from _tmp1; run; %do i = 1 %to &nx; %let x2 = %scan(&xlist, &i, ' '); data _tmp2(keep = i var level &y); set _tmp1; i = &i; var = upcase("&x2"); level = input(&x2, $50.); run; proc summary data = _tmp2 nway missing; class i var level; output out = _sum1(drop = _type_ rename = (_freq_ = freq)) sum(&y) = bads mean(&y) = bad_rate; run; %if &i = 1 %then %do; data _sum2; set _sum1; run; %end; %else %do; data _sum2; set _sum2 _sum1; run; %end; %end; data _final; set _sum2; if compress(level, ' ') = '' then level = '999. MISSING'; pct = freq / &obs; run; options nocenter nodate; title; proc report data = _final box spacing = 2 headskep split = "*"; column("FREQUENCY SUMMARY OF CATEGORICAL VARIABLES* " var level freq pct bads bad_rate); define var / "ATTRIBUTE" order order = data width = 20 format = $20.; define level / "LEVEL" width = 20 format = $20.; define freq / "FREQUENCY" width = 10 format = 10.; define pct / "PERCENT" width = 10 format = percent10.4; define bads / "# BADS" width = 10 format = 9.; define bad_rate / "BAD RATE" width = 10 format = percent10.4; run; %mend freq; %freq(data = tmp1, y = y, cat_x = &cat); /* ----------------------------------------------------------------------- | FREQUENCY SUMMARY OF CATEGORICAL VARIABLES | | | |ATTRIBUTE LEVEL FREQUENCY PERCENT # BADS BAD RATE| | | |---------------------------------------------------------------------| |X10 | 1 | 1352| 28.8889% | 115| 8.5059% | | |----------+-----------+-----------+-----------+-----------| | | 2 | 3328| 71.1111% | 171| 5.1382% | |----------+----------+-----------+-----------+-----------+-----------| |X11 | 1 | 825| 17.6282% | 101| 12.2424% | | |----------+-----------+-----------+-----------+-----------| | | 2 | 3855| 82.3718% | 185| 4.7990% | ----------------------------------------------------------------------- */ LOGISTIC MODEL IN ENTERPRISE MINER WITH DMREG PROCEDURE*************************************************; * LOGISTIC MODEL IN ENTERPRISE MINER WITH *; * DMREG PROCEDURE *; *************************************************; libname data 'z:\XXX'; data tmp1; set data.credit_train; run; *** REGULAR WAY TO MODEL LOGISTIC REGRESSION ***; proc logistic data = tmp1 desc; class x10 x11 x13 x17 x18 X19 X22 / param = glm; model y = X2 X3 X8 X9 x10 x11 x13 x17 x18 X19 X22; run; /* Type 3 Analysis of Effects Wald Effect DF Chi-Square Pr > ChiSq X2 1 5.2833 0.0215 X3 1 26.7384 <.0001 ... ... Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 -2.9224 0.4685 38.9145 <.0001 X2 1 0.2162 0.0941 5.2833 0.0215 X3 1 -0.4701 0.0909 26.7384 <.0001 ... ... */ *** E-MINER WAY TO MODEL LOGISTIC REGRESSION ***; proc dmdb data = tmp1 out = tmp_db dmdbcat = tmp_cl; class x10 x11 x13 x17 x18 X19 X22 y(desc); var X2 X3 X8 X9; target y; run; proc dmreg data = tmp_db dmdbcat = tmp_cl; class x10 x11 x13 x17 x18 X19 X22 y; model y = X2 X3 X8 X9 x10 x11 x13 x17 x18 X19 X22 / coding = glm; code file = 'z:\XXX\score_code.sas'; run; /* Type 3 Analysis of Effects Wald Effect DF Chi-Square Pr > ChiSq X2 1 5.2832 0.0215 X3 1 26.7383 <.0001 ... ... Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 -2.9224 0.4685 38.92 <.0001 X2 1 0.2162 0.0941 5.28 0.0215 X3 1 -0.4701 0.0909 26.74 <.0001 ... ... */ MODEL SEARCH BY BUMPING*************************************************; * REFERENCE: *; * model search and inference by bootstrap *; * -- Tibshirani and Knight (1997) *; *************************************************; libname data 'z:\XXXX'; %include 'z:\XXXX\ks.sas'; %include 'z:\XXXX\roc.sas'; data train; set data.credit_train; run; proc sql noprint; select count(*) into :size from train; quit; %macro bumping; %let seed = 2009; %let n = 500; data _null_; do i = 1 to &n; random = put(ranuni(&seed) * (10 ** 8), 8.); name = compress("random"||put(i, 3.), ' '); call symput(name, random); end; run; proc datasets library = data nolist; delete catalog / memtype = catalog; run; quit; %do i = 1 %to &n; %put &&random&i; proc surveyselect data = train method = urs n = &size seed = &&random&i out = sample&i(rename = (NumberHits = _hits)) noprint; run; proc dmdb data = sample&i out = db_sample&i dmdbcat = cl_sample&i; class x10 - x24 y; var x2 - x9; target y; freq _hits; run; filename out_tree catalog "data.catalog.out_tree.source"; proc split data = db_sample&i dmdbcat = cl_sample&i criterion = gini assess = impurity maxbranch = 2 splitsize = 100 subtree = 30 nsurrs = 0; code file = out_tree; input x2 - x9 / level = interval; input x10 - x24 / level = nominal; target y / level = binary; freq _hits; run; filename in_tree catalog "data.catalog.tree&i..source"; data _null_; infile out_tree; input; file in_tree; if _n_ > 3 then put _infile_; run; data _tmp1(keep = p_y1 p_y0 y random); set data.credit_train; random = ranuni(1); %include in_tree; run; proc sort data = _tmp1 sortsize = max; by descending p_y1 random; run; data _tmp2; set _tmp1; where p_y1 ~= . and y in (0, 1); by descending p_y1 random; i + 1; run; proc rank data = _tmp2 out = _tmp3 groups = 10; var i; run; proc sql noprint; create table _tmp4 as select i + 1 as decile, count(*) as cnt, sum(y) as bad_cnt from _tmp3 group by i; select sum(cnt) into :cnt from _tmp4; select sum(bad_cnt) into :bad_cnt from _tmp4; quit; data _tmp5; set _tmp4; retain cum_cnt cum_bcnt cum_gcnt; bpct = bad_cnt / cnt; cum_cnt + cnt; cum_bcnt + bad_cnt; cum_gcnt + (cnt - bad_cnt); cum_pct = cum_cnt / &cnt; cum_bpct = cum_bcnt / &bad_cnt; cum_gpct = cum_gcnt / (&cnt - &bad_cnt); ks = (max(cum_bpct, cum_gpct) - min(cum_bpct, cum_gpct)) * 100; run; proc sql noprint; select max(ks) into :ks from _tmp5; quit; data _kstmp; tree = &i; ks = &ks; run; %if &i = 1 %then %do; data _ks; set _kstmp; run; %end; %else %do; data _ks; set _ks _kstmp; run; %end; %end; proc sql noprint; select max(ks) into :ks from _ks; select tree into :best from _ks where round(ks, 0.01) = round(&ks., 0.01); quit; filename best catalog "data.catalog.tree%trim(&best).source"; filename output "z:\XXXX\BestTree.txt"; data _null_; infile best; input; file output; if _n_ = 1 then do; put " ******************************************************; "; put " ***** BEST TREE: TREE %trim(&best) WITH KS = &KS *****; "; put " ******************************************************; "; end; put _infile_; run; %mend bumping; %bumping; data valid(keep = p_y1 p_y0 y); set data.credit_valid; %include output; run; %ks(data = valid, score = p_y1, y = y); %roc(data = valid, score = p_y0, y = y); HERE COMES THE SIMPSON RULEfunction [result] = simpson(f, a, b, N) %*************************************************************** %* A FUNCTION OF NUMERICAL INTEGRATION WITH SIMPSON RULE * %*=============================================================* %* PARAMETERS: * %* f: A FUNCTION INPUT * %* a: LOWER BOUND OF THE FUNCTION INPUT * %* b: UPPER BOUND OF THE FUNCTION INPUT * %* N: # OF INTERVALS * %*=============================================================* %* AN EXAMPLE: * %* y = simpson('sin(x)', 0, 1, 1000); * %*=============================================================* %* AUTHOR: Wensui Liu (liuwensui@hotmail.com) * %*************************************************************** dx = (b - a) / N; func = inline(f); result = 0; for i = 1:N min = a + (i - 1) * dx; med = a + (i - 1 / 2) * dx; max = a + i * dx; res = dx / 6 * (func(min) + 4 * func(med) + func(max)); result = result + res; end return BAGGING CART (CLASSIFICATION AND REGRESSION TREE)libname data 'z:\XXX'; data train; set data.credit_train; run; proc sql noprint; select count(*) into :size from train; quit; %macro bagging; %let seed = 2009; %let n = 100; data _null_; do i = 1 to &n; random = put(ranuni(&seed) * (10 ** 8), 8.); name = compress("random"||put(i, 3.), ' '); call symput(name, random); end; run; %do i = 1 %to &n; %put &&random&i; proc surveyselect data = train method = urs n = &size seed = &&random&i out = sample&i(rename = (NumberHits = _hits)) noprint; run; proc dmdb data = sample&i out = db_sample&i dmdbcat = cl_sample&i; class x10 - x24 y; var x2 - x9; target y; freq _hits; run; filename out_tree catalog "data.catalog.out_tree.source"; proc split data = db_sample&i dmdbcat = cl_sample&i criterion = gini assess = impurity maxbranch = 2 splitsize = 100 nsurrs = 0; code file = out_tree noleafid; input x2 - x9 / level = interval; input x10 - x24 / level = nominal; target y / level = binary; freq _hits; run; filename in_tree catalog "data.catalog.tree&i..source"; data _null_; infile out_tree; input; file in_tree; if _n_ > 3 then put _infile_; run; data train_score&i(keep = id p_y1 p_y0 y); set data.credit_train; %include in_tree; run; data valid_score&i(keep = id p_y1 p_y0 y); set data.credit_valid; %include in_tree; run; %if &i = 1 %then %do; data train_score; set train_score&i; run; data valid_score; set valid_score&i; run; %end; %else %do; data train_score; set train_score train_score&i; run; data valid_score; set valid_score valid_score&i; run; %end; %end; proc summary data = train_score nway; class id; output out = train_predict(drop = _type_ rename = (_freq_ = freq)) mean(p_y1) = p_1 mean(p_y0) = p_0 mean(y) = y; run; proc summary data = valid_score nway; class id; output out = valid_predict(drop = _type_ rename = (_freq_ = freq)) mean(p_y1) = p_1 mean(p_y0) = p_0 mean(y) = y; run; %mend bagging; %bagging; data data.scored; set train_predict(in = a) valid_predict(in = b); if a then train = 1; else train = 0; run; proc freq data = data.scored; tables train * y / list missing; run; %include 'z:\XXXX\ks.sas'; %include 'z:\XXXX\roc.sas'; data valid train; set data.scored; if train = 1 then output train; else if train = 0 then output valid; run; title 'KS statistics for trainning dataset'; %ks(data = train, score = p_1, y = y); title 'KS statistics for validation dataset'; %ks(data = valid, score = p_1, y = y); title 'ROC statistics for trainning dataset'; %roc(data = train, score = p_0, y = y); title 'ROC statistics for validation dataset'; %roc(data = valid, score = p_0, y = y); /* OUTPUT KS FOR TRAINING DATA CU. CU. MIN MAX DECILE FREQ BAD# BAD% FREQ BAD# CU. BAD% SCORE SCORE %OF BAD KS 1 468 156 33.33% 468 156 33.33% 0.13 0.35 54.55% 47.44 2 468 53 11.32% 936 209 22.33% 0.09 0.13 73.08% 56.53 3 468 34 7.26% 1404 243 17.31% 0.07 0.09 84.97% 58.54 4 468 26 5.56% 1872 269 14.37% 0.06 0.07 94.06% 57.57 5 468 9 1.92% 2340 278 11.88% 0.05 0.06 97.20% 50.28 6 468 7 1.50% 2808 285 10.15% 0.04 0.05 99.65% 42.23 7 468 0 0.00% 3276 285 8.70% 0.03 0.04 99.65% 31.58 8 468 0 0.00% 3744 285 7.61% 0.02 0.03 99.65% 20.93 9 468 1 0.21% 4212 286 6.79% 0.02 0.02 100.00% 10.65 10 468 0 0.00% 4680 286 6.11% 0.00 0.02 100.00% 0.00 KS FOR VALIDATION DATA CU. CU. MIN MAX DECILE FREQ BAD# BAD% FREQ BAD# CU. BAD% SCORE SCORE %OF BAD KS 1 150 31 20.67% 150 31 20.67% 0.12 0.36 36.05% 27.63 2 150 13 8.67% 300 44 14.67% 0.09 0.12 51.16% 33.06 3 150 12 8.00% 450 56 12.44% 0.07 0.09 65.12% 37.25 4 150 5 3.33% 600 61 10.17% 0.06 0.07 70.93% 32.81 5 150 7 4.67% 750 68 9.07% 0.05 0.06 79.07% 30.84 6 150 4 2.67% 900 72 8.00% 0.04 0.05 83.72% 25.16 7 150 2 1.33% 1050 74 7.05% 0.03 0.04 86.05% 17.02 8 150 6 4.00% 1200 80 6.67% 0.03 0.03 93.02% 13.82 9 150 5 3.33% 1350 85 6.30% 0.02 0.03 98.84% 9.37 10 150 1 0.67% 1500 86 5.73% 0.00 0.02 100.00% 0.00 AREA UNDER ROC FOR TRAINING DATA -------- 0.888135 AREA UNDER ROC FOR VALIDATION DATA -------- 0.726259 */ ANOTHER FUNCTION FOR NUMERICAL INTEGRATIONfunction result = trape(f, a, b, N) %*************************************************************** %* A FUNCTION OF NUMERICAL INTEGRATION WITH TRAPEZOIDAL RULE * %*=============================================================* %* PARAMETERS: * %* f: A FUNCTION INPUT * %* a: LOWER BOUND OF THE FUNCTION INPUT * %* b: UPPER BOUND OF THE FUNCTION INPUT * %* N: # OF INTERVALS * %*=============================================================* %* AN EXAMPLE: * %* y = trape('sin(x)', 0, 1, 1000); * %*=============================================================* %* AUTHOR: Wensui Liu (liuwensui@hotmail.com) * %*************************************************************** dx = (b - a) / N; func = inline(f); result = 0; for i = 1:N res = dx * (func(a + (i - 1) * dx) + func(a + i * dx)) / 2; result = result + res; end return A PRACTICE IN MATLABfunction [result] = mpoint(f, a, b, N) %*************************************************************** %* A FUNCTION OF NUMERICAL INTEGRATION WITH MIDPOINT RULE * %*=============================================================* %* PARAMETERS: * %* f: A FUNCTION INPUT * %* a: LOWER BOUND OF THE FUNCTION INPUT * %* b: UPPER BOUND OF THE FUNCTION INPUT * %* N: # OF INTERVALS * %*=============================================================* %* AN EXAMPLE: * %* y = mpoint('sin(x)', 0, 1, 1000); * %*=============================================================* %* AUTHOR: Wensui Liu (liuwensui@hotmail.com) * %*************************************************************** dx = (b - a) / N; func = inline(f); result = 0; for i = 1:N res = dx * (func(a + (i - 1/2) * dx)); result = result + res; end return SAS MACRO TO CALCULATE AUC (AREA UNDER CURVE)%macro roc(data = , score =, y = ); *******************************************; * THIS MACRO IS TO CALCULATE AUC (AREA *; * UNDER CURVE) USING TRAPEZOIDAL RULE *; *-----------------------------------------*; * INPUT PARAMETERS: *; * DATA : INPUT DATASET *; * SCORE: SCORE (HIGHER ==> BETTER) *; * Y : 0/1 TARGET (1 ==> BAD) *; *******************************************; data _tmp1; set &data; where &y in (0, 1) and &score ~= .; random = ranuni(1); keep &y &score random; run; proc sql noprint; select sum(&y) into :bcnt from _tmp1; select count(*) - sum(&y) into :gcnt from _tmp1; quit; proc sort data = _tmp1 sortsize = max; by &score random; run; data _tmp2; set _tmp1; by &score random; retain gcum bcum; gcum + (1 - &y); bcum + &y; gpct = gcum / &gcnt; bpct = bcum / &bcnt; run; proc sort data = _tmp2 sortsize = max; by gpct bpct; run; data _tmp3; set _tmp2; by gpct bpct; if last.gpct then output; run; title 'AREA UNDER ROC'; proc sql; create table _tmp4 as select a.gcum as gcum, (b.gpct - a.gpct) * (b.bpct + a.bpct) / 2 as dx from _tmp3 as a, _tmp3 as b where a.gcum + 1 = b.gcum; select sum(dx) from _tmp4; quit; title; proc datasets library = work nolist; delete _tmp:/memtype = data; run; quit; %mend roc; data test; do i = 1 to 10000; x = ranuni(1); if x + rannor(1) * 0.2 > 0.7 then y = 1; else y = 0; output; end; run; proc logistic data = test; model y = x; score data = test out = predicted; run; %roc(data = predicted, score = p_0, y = y); /* -- STANDARD AUC RESULT FROM PROC LOGISTIC: Association of Predicted Probabilities and Observed Responses Percent Concordant 91.1 Somers' D 0.824 Percent Discordant 8.8 Gamma 0.825 Percent Tied 0.1 Tau-a 0.352 Pairs 21336604 c 0.912 -- CALCULATED AUC RESULT FROM MY MACRO: AREA UNDER ROC 0.912021 */ SAS MACRO TO CALCULATE GAINS CHART WITH KS%macro ks(data = , score = , y = ); options nocenter mprint nodate; data _tmp1; set &data; where &score ~= . and y in (1, 0); random = ranuni(1); keep &score &y random; run; proc sort data = _tmp1 sortsize = max; by descending &score random; run; data _tmp2; set _tmp1; by descending &score random; i + 1; run; proc rank data = _tmp2 out = _tmp3 groups = 10; var i; run; proc sql noprint; create table _tmp4 as select i + 1 as decile, count(*) as cnt, sum(&y) as bad_cnt, min(&score) as min_scr format = 8.2, max(&score) as max_scr format = 8.2 from _tmp3 group by i; select sum(cnt) into :cnt from _tmp4; select sum(bad_cnt) into :bad_cnt from _tmp4; quit; data _tmp5; set _tmp4; retain cum_cnt cum_bcnt cum_gcnt; cum_cnt + cnt; cum_bcnt + bad_cnt; cum_gcnt + (cnt - bad_cnt); cum_pct = cum_cnt / &cnt; cum_bpct = cum_bcnt / &bad_cnt; cum_gpct = cum_gcnt / (&cnt - &bad_cnt); ks = (max(cum_bpct, cum_gpct) - min(cum_bpct, cum_gpct)) * 100; format cum_bpct percent9.2 cum_gpct percent9.2 ks 6.2; label decile = 'DECILE' cnt = '#FREQ' bad_cnt = '#BAD' min_scr = 'MIN SCORE' max_scr = 'MAX SCORE' cum_gpct = 'CUM GOOD%' cum_bpct = 'CUM BAD%' ks = 'KS'; run; title "%upcase(&score) KS"; proc print data = _tmp5 label noobs; var decile cnt bad_cnt min_scr max_scr cum_bpct cum_gpct ks; run; title; proc datasets library = work nolist; delete _: / memtype = data; run; quit; %mend ks; data test; do i = 1 to 1000; score = ranuni(1); if score * 2 + rannor(1) * 0.3 > 1.5 then y = 1; else y = 0; output; end; run; %ks(data = test, score = score, y = y); /* SCORE KS MIN MAX DECILE #FREQ #BAD SCORE SCORE CUM BAD% CUM GOOD% KS 1 100 87 0.91 1.00 34.25% 1.74% 32.51 2 100 78 0.80 0.91 64.96% 4.69% 60.27 3 100 49 0.69 0.80 84.25% 11.53% 72.72 4 100 25 0.61 0.69 94.09% 21.58% 72.51 5 100 11 0.51 0.60 98.43% 33.51% 64.91 6 100 3 0.40 0.51 99.61% 46.51% 53.09 7 100 1 0.32 0.40 100.00% 59.79% 40.21 8 100 0 0.20 0.31 100.00% 73.19% 26.81 9 100 0 0.11 0.19 100.00% 86.60% 13.40 10 100 0 0.00 0.10 100.00% 100.00% 0.00 */ SAS MACRO TO ASSIGN RANDOM INTEGERS TO A GROUP OF MACRO VARIABLES%macro seeds(n = , prefix = , seed = 1); data _null_; do i = 1 to &n; random = put(ranuni(&seed) * (10 ** 10), 10.); name = compress("prefix"||put(i, 4.), ' '); call symput(name, random); end; run; %do i = 1 %to &n; %put SEED VALUE &i : &prefix&i = &&prefix&i.; %end; %mend seeds; %seeds(n = 50, prefix = random); /* SEED VALUE 1 : random1 = 1849625698 SEED VALUE 2 : random2 = 9700887157 ... ... SEED VALUE 49 : random49 = 5113178829 SEED VALUE 50 : random50 = 4332045635 */ AN EXAMPLE OF USING DMDB PROCEDURE/************************************************ * AN EXAMPLE TO USE DMDB PROCEDURE IN SAS * * ENTERPRISE MINER CREATING DATA MINING DB WITH * * DMDB PROCEDURE * ************************************************/ data test; input X1 X2 X3 Y; datalines; 1 -0.04926 -0.61599 0 2 0.31526 -1.65430 1 3 0.64517 -0.06878 0 1 0.02592 -0.71203 1 2 1.45284 -0.39064 1 3 0.22533 -0.44507 0 1 -1.24641 -0.73156 0 1 0.88542 -1.63595 1 2 -1.35613 -1.58209 0 3 -1.22333 1.98124 1 2 -0.36128 0.77962 0 1 -0.01054 -0.76720 0 2 1.43263 0.53820 0 3 -0.17737 0.25381 0 2 -1.64183 -0.34028 0 1 -0.83537 -2.00245 0 3 0.00284 -0.75016 0 2 -0.28847 -0.77437 0 3 -0.21167 -0.53833 0 1 -1.23276 0.11452 1 ; run; *** CREATE OF DATA-MINING DATABASE WITH DMDB PROCEDURE ***; proc dmdb batch data = test /* SPECIFY INPUT DATA */ out = out_data /* SPECIFY OUTPUT DATA */ dmdbcat = out_cat /* SPECIFY OUTPUT CATALOG */ classout = out_class /* OUTPUT SUMMARY OF CLASS VARIABLES */ varout = out_var /* OUTPUT SUMMARY OF NUMERIC VARIABLES*/; class x1(asc) y; var x2 x3; target y; run; title "SUMMARY OF CLASS VARIABLES"; proc print data = out_class noobs; run; /* NAME LEVEL FREQUENCY TYPE CRAW NRAW X1 1 7 N 1 X1 2 7 N 2 X1 3 6 N 3 Y 0 14 N 0 Y 1 6 N 1 */ title "SUMMARY OF CONTINUOUS VARIABLES"; proc print data = out_var noobs; run; /* NAME NMISS MIN MAX MEAN STD SKEWNESS KURTOSIS X2 0 -1.64183 1.45284 -0.18245 0.88684 0.17469 -0.49032 X3 0 -2.00245 1.98124 -0.46709 0.92896 0.72040 1.38541 */ THE BEAUTY OF RECURSION IN F##light // GAMMA FUNCTION: AN EXAMPLE USING RECURSION IN F# let rec gamma x = match x with |1 -> 1 |x -> (x - 1) * gamma (x - 1) printfn "%i" (gamma 5) |
|
|