wensui's profileWENSUI'S BLOG IN STATIST...PhotosBlogListsMore Tools Help

WENSUI'S BLOG IN STATISTICAL COMPUTING

Tough Times Never Last. But Tough People Do. - Robert Schuller

RealMatrix++: A Real-Valued Matrix Class and Library

RealMatrix++
A Real-Valued Matrix Class and Library

Daniel Hanson
Email: HoserAnalytics (at) gmail (dot) com
June 1, 2009

Introduction and Purpose

A frequent frustration for quant development in finance is the lack of a matrix class and matrix algebra library in the C++ Standard Library.  Third party libraries do exist, both open source and commercial, but open source libraries are often time consuming to learn and implement, and commercial libraries can carry a heavy price tag.  Furthermore, for quant finance, as in other domains such as predictive modeling, the vast majority of linear algebra calculations are real-valued.

The intent of the RealMatrix++ Library is to provide an easy-to-use, non-templated matrix class and library for real-valued matrices and vectors, with the ultimate goal of providing functionality on par with that of the JAMA library for Java [1].  The code is not optimized for performance to the extent of, say, the Matrix Template Library [2], or Blitz [3], but it will likely offer more rapid development, and performance sufficient at least for prototyping models.

Current Release Features and Design

This initial release contains a real matrix class, and matrix operations +, -, and *, where (*) is overloaded for standard matrix multiplication, vector by matrix and matrix by vector multiplication, and scalar multiplication (see MatrixOperations.h for details).

There is also a set of validation conditions outlined in MatrixConditions.h, which will check if two matrices are of the same dimensions (necessary for matrix addition and subtraction), if a matrix is square, or if a matrix is symmetric.

The Matrix class itself is built as a wrapper around the valarray container in the C++ standard library.  As Stroustrup[4] says, "The standard library does not provide a matrix class.  Instead, the intent is for valarray to provide the tools for building matrices optimized for a variety of needs."

Also, as Josuttis [5] points out, "a valarray can be used as a base both for vector and matrix operations.with good performance."  On the other hand, he notes that "you often need some inconvenient and time-consuming type conversions." This relates primarily to the need to create a separate valarray instance to use the valarray slice function.  For this reason, its use is avoided in the Matrix class here, although it may be used minimally going forward, in cases where the benefit outweighs the inconvenience.

The current release has been unit tested, but not completely stress tested.  It should work as designed, but at this stage it should probably be best considered a beta release. Your comments are invited (email address given at the top of this document).

Future Additions/Enhancements

The following functionality is being considered for future
releases (not necessarily an exhaustive list):
        Matrix Decompositions
        o    Cholesky Decomposition of square and symmetric matrices
        o    Singular Value Decomposition
        o    LU Decomposition
        Eigenvalue calculations
        Principal component analysis
        Solutions of linear systems
        Matrix inverse

Licensing

This software is released under the terms of the BSD License [6]:

Copyright (c) 2009, Daniel Hanson
All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.  Neither the name of the copyright holder nor the names of any contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT  SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

References

[1]
http://math.nist.gov/javanumerics/jama/
[2] http://www.osl.iu.edu/research/mtl/
[3] http://www.oonumerics.org/blitz/
[4] Bjarne Stroustrup, The C++ Programming Language, 3E (Addison Wesley)
[5] Nicolai Josuttis, The C++ Standard Library: A Tutorial and Reference  (Addison Wesley)
[6]
http://www.opensource.org/licenses/bsd-license.php

A MACRO TO IMPUTE NUMERIC VARIABLES BASED UPON BAD RATES

libname data 'E:\sas';

data test;
  set data.credit;
  if ranuni(2) < 0.1 then x2 = .;
  if ranuni(3) < 0.3 then x3 = .;
  if ranuni(4) < 0.2 then x4 = .;
run;

proc means data = test n nmiss nonobs;
  class y;
  var x2 - x4;
run;  

/*                                       N
     Y    Variable    Label       N    Miss
     0    X2          X2       5217     591
          X3          X3       4011    1797
          X4          X4       4675    1133
     1    X2          X2        343      29
          X3          X3        268     104
          X4          X4        311      61
*/
    
%macro impute(data = , y = , x = , outfile = );
*********************************************;
* THIS MACRO IS TO IMPUTE NUMERIC VARIABLES *;
* BASED UPON BAD RATES                      *;
* ----------------------------------------- *;
* PARAMETERS:                               *;
*  DATA   : INPUT DATA SET                  *;
*  Y      : RESPONSE WITH 0 / 1             *;
*  X      : A LIST OF NUMERIC PREDICTORS    *;
*           WITH MISSING VALUES             *;    
*  OUTFILE: FILEREF FOR IMPUTING RECODE     *;
*********************************************;
    
%if %sysfunc(fexist(&outfile)) %then %do;
  data _null_;
    rc = fdelete("&outfile");
    if rc = 0 then put "THE FILE HAS BEEN DELETE.";
  run;
%end;

ods output position = _pos1;
ods listing close;
proc contents data = &data varnum;
run;
ods listing;

proc sql noprint;
  select
    upcase(variable) into :x2 separated by ' '
  from
    _pos1
  where
    compress(upcase(type), ' ') = 'NUM' and
    index("%upcase(%sysfunc(compbl(&x)))", compress(upcase(variable), ' ')) > 0;

  select
    count(variable) into :xcnt
  from
    _pos1
  where
    compress(upcase(type), ' ') = 'NUM' and
    index("%upcase(%sysfunc(compbl(&x)))", compress(upcase(variable), ' ')) > 0;    
quit;

data _tmp1;
  retain &x2 &y;
  set &data;
  where &y in (0, 1);
  keep &x2 &y;
run;  

ods output position = _pos2;
ods listing close;
proc contents data = _tmp1 varnum;
run;
ods listing;

%let loop = 0;

%do i = 1 %to &xcnt;
  proc sql noprint;
    select
      upcase(variable) into :var
    from
      _pos2
    where
      num = &i;

    select
      count(distinct &var) into :vcnt
    from
      _tmp1
    where
      &var ~= . and
      &y in (1, 0);

    select
      nmiss(&var) / count(*) * 100 into :mpct
    from
      _tmp1
    where
      &y in (1, 0);

    select
      count(distinct &y) into :yflg
    from
      _tmp1
    where
      &y in (1, 0);
  quit;
  
  %if &vcnt > 1 & &mpct > 0 & &mpct < 99 & &yflg = 2 %then %do;        
    %let loop = %eval(&loop + 1);
    
    proc sql noprint;
      select
        mean(&y) format = 10.6 into :mis_rate
      from
        _tmp1
      where
        &var = .     and
        &y in (0, 1);

      select
        count(*) into :mcnt
      from
        _tmp1
      where
        &var = .     and
        &y in (1, 0);

      create table
        _tmp2 as
      select distinct
        "%upcase(%trim(&var))"    as variable,
        &var                      as value,
        mean(&y)                  as bad_rate,
        &mis_rate                 as mis_rate,
        count(*)                  as freq,
        &mcnt                     as mis_cnt,
        abs(mean(&y) - &mis_rate) as dif
      from
        _tmp1
      where
        &var ~= .    and
        &y in (0, 1)
      group by
        &var;
    quit;

    proc sort data = _tmp2 sortsize = max;
      by dif descending freq;
    run;

    data _tmp3;
      set _tmp2;
      by dif descending freq;
      if _n_ = 1 then output;
    run;

    %if &loop = 1 %then %do;
      data _result;
        set _tmp3;
      run;
    %end;
    %else %then %do;
      data _result;
        set _result _tmp3;
      run;
    %end;    
  %end;        
%end;    

data _null_;
  set _result end = eof;
  file impute mod;
  if _n_ = 1 then do;
    put @1 93 * "*" ";";
    put @1 "* IMPUTED VARIABLE" @32 "| IMPUTED VALUE" @52 "| BAD RATE"
        @65 "| MIS. COUNT" @80 "| MIS. RATE" @93 "*;";
  end;
  put @1 "*" 91 * "-" @93 "*;";
  put @1 "*"  @3 variable @32 "|" @34 value @52 "|" @54 bad_rate percent8.2
      @65 "|" @67 mis_cnt @80 "|" @82 mis_rate percent8.2 @93 "*;";
  if eof then do;
    put @1 93 * "*" ";";
  end;  
run;

data _null_;
  set _result;
  file impute mod;
  if _n_ = 1 then do;
    put " ";
    put @10 3 * "*" " RECODING OF IMPUTATION " 3 * "*" ";";
  end;
  put @3  "if " @6 variable " = . " @40 "then " variable
      " = " value ";";
run;

%mend impute;

filename impute 'E:\sas\impute.sas';

%impute(data = test, y = y, x = x2 x3 x4, outfile = impute);

/*
$ more impute.sas
**************************************************************;
*IMPUTED VARIABLE|IMPUTED VALUE|BAD RATE|MIS. COUNT|MIS. RATE*;
*------------------------------------------------------------*;
* X2             | -0.68       |   4.55%| 620      |   4.68% *;
*------------------------------------------------------------*;
* X3             | -0.685      |   5.67%| 1901     |   5.47% *;
*------------------------------------------------------------*;
* X4             | 1.728       |   5.00%| 1194     |   5.11% *;
**************************************************************;
 
         *** RECODING OF IMPUTATION ***;
  if X2  = .                           then X2  = -0.683 ;
  if X3  = .                           then X3  = -0.685 ;
  if X4  = .                           then X4  = 1.728 ;
*/

*** SHOW HOW TO USE THE IMPUTING RECODE ***;
data test2;
  set test;
  %inc impute / source2;
run;

proc means data = test2 n nmiss nonobs;
  class y;
  var x2 - x4;
run;  

/*                                       N
     Y    Variable    Label       N    Miss
     0    X2          X2       5808       0
          X3          X3       5808       0
          X4          X4       5808       0
     1    X2          X2        372       0
          X3          X3        372       0
          X4          X4        372       0
*/

READ FIXED-WIDTH DATA FILE WITH read.fwf()

##############################################
# READ FIXED-WIDTH DATA FILE WITH read.fwf() #
# ------------------------------------------ #
# EQUIVALENT SAS CODE:                       #
# filename data 'E:\sas\fixed.txt';          #
# data test;                                 #
#   infile data truncover;                   #
#   input @1 city $ 1 - 22 @23 population;   #
# run;                                       #
##############################################

# OPEN A CONNECTION TO THE DATA FILE
data <- file(description = "e:\\sas\\fixed.txt", open = "r")

# width = c(...)      ==> SPECIFIES COLUMN WIDTHS
# col.names = c(...)  ==> GIVES COLUMN NAMES
# colClasses = c(...) ==> DEFINES COLUMN CLASSES
test <- read.fwf(data, header = FALSE, width = c(22, 10), 
                 col.names = c("city", "population"),
                 colClasses = c("character", "numeric"))

close(data)

READ DATA FROM A FILE CONNECTION WITH SCAN()

#################################################
# READ DATA FROM A FILE CONNECTION WITH SCAN()  #
#-----------------------------------------------#
# EQUIVALENT SAS CODE:                          #
# filename data 'E:\sas\Kyphosis';              #
# data test;                                    #
#   infile data;                                #
#   input x1 x2 x3 y @@;                        #
# run;                                          #
#################################################

# OPEN A CONNECTION TO THE DATA FILE
data <- file(description = "e:\\sas\\kyphosis", open = "r")

# READ DATA FROM THE CONNECTION
test <- data.frame(scan(data, what = list(x1 = 0, x2 = 0, x3 = 0, y = ""),
                        quiet = TRUE))

# CLOSE THE CONNECTION
close(data) 

EVALUATION OF Logistic() FUNCTION IN RWeka

##############################################
# EVALUATION OF Logistic() FUNCTION IN RWeka #
##############################################

library(rpart)
data(kyphosis)

# MODEL FROM glm() FUNCTION
model1 <- glm(Kyphosis ~., data = kyphosis, family = binomial())

# PRINT OUT MODEL SUMMARY
summary(model1)

#               Estimate Std. Error z value Pr(>|z|)  
#  (Intercept) -2.036934   1.449575  -1.405  0.15996  
#  Age          0.010930   0.006446   1.696  0.08996 .
#  Number       0.410601   0.224861   1.826  0.06785 .
#  Start       -0.206510   0.067699  -3.050  0.00229 **

# MODEL FROM Logistic() FUNCTION IN RWeka
library(RWeka)
model2 <- Logistic(Kyphosis ~., data = kyphosis)

# PRINT OUT ESTIMATES
print(model2)

#                Class
# Variable      absent
# ====================
# Age          -0.0109
# Number       -0.4106
# Start         0.2065
# Intercept     2.0369

# MODEL EVALUATION BASED ON 5-FOLD CROSS-VALIDATION
summary(logit, newdata = kyphosis, class = TRUE, numFolds = 5, seed = 2009)

# === 5 Fold Cross Validation ===
# === Summary ===
# Correctly Classified Instances          65               80.2469 %
# Incorrectly Classified Instances        16               19.7531 %
# Kappa statistic                          0.3481
# Mean absolute error                      0.2503
# Root mean squared error                  0.3624
# Relative absolute error                 74.2609 %
# Root relative squared error             88.868  %
# Total Number of Instances               81    
# === Detailed Accuracy By Class ===
#                TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class
#                  0.906     0.588      0.853     0.906     0.879      0.835    absent
#                  0.412     0.094      0.538     0.412     0.467      0.835    present
# Weighted Avg.    0.802     0.484      0.787     0.802     0.792      0.835
# === Confusion Matrix ===
#   a  b   <-- classified as
#  58  6 |  a = absent
#  10  7 |  b = present

DO DATA INPUT IN R CONSOLE WITH SCAN()

############################################
# DO DATA INPUT IN R CONSOLE WITH SCAN()   #
#------------------------------------------#
# COMPARABLE SAS CODE:                     #
#------------------------------------------#
# data test;                               #
#   input x1 x2 x3 y $ @@;                 #
# cards;                                   #
# 71 . 3 0 158 14 3 0                      #
# 128 5 4 1                                #
# ;                                        #
# run;                                     #
############################################

test <- data.frame(scan(file = "", 
                        what = list(x1 = 0, x2 = 0, x3 = 0, y = "")))
71 NA 3 0 158 14 3 0
128 5 4 1

A MACRO TO DO MONOTONIC WOE TRANSFORMATION FOR THE NUMERIC PREDICTOR

libname data 'E:\sas';

data test;
  set data.credit;
  keep y x2 x3 x8 x9;
run;  

%macro monowoe(data = , y = , x = , maxbin = );
******************************************************;
* THIS MACRO IS TO DO A MONOTONIC WEIGHT OF EVIDENCE *;
* (WOE) TRANSFORMATION FOR THE NUMERIC PREDICTOR.    *;
*----------------------------------------------------*;
* PARAMETERS:                                        *;
*  DATA  : INPUT DATA                                *;
*  Y     : GOOD/BAD RESPONSE (BAD ==> 1)             *;
*  X     : A NUMERIC PREDICTOR                       *;
*  MAXBIN: MAX. NUMBER OF BINS USED TO CALCULATE WOE *;
******************************************************;    
    
%let minbad = 10;

data _tmp1(keep = &y &x);
  set &data;
  where &y in (0, 1) and &x ~= .;
run;  

proc sql noprint;
  select min(&maxbin, count(distinct &x)) into :nbin from _tmp1;

  select count(*) into :cnt from _tmp1;

  select sum(&y) into :bad from _tmp1;
quit;

%do i = &nbin %to 2 %by -1;

  proc rank data = _tmp1 out = _tmp2 ties = low groups = &i;
    var &x;
    ranks rank;
  run;

  proc summary data = _tmp2 nway;
    class rank;
    output out = _tmp3(drop = _type_ rename = (_freq_ = freq))
    sum(&y)  = bad mean(&y) = bad_rate
    mean(&x) = x min(&x) = minx max(&x) = maxx;
  run;
  
  proc sql noprint;
    select min(bad) into :minbad2 from _tmp3;
    
    select min(bad_rate) into :minrate from _tmp3;

    select max(bad_rate) into :maxrate from _tmp3;    
  quit;

  %if &minbad2 >= &minbad & &minrate > 0 & &maxrate < 1 %then %do;
    ods output SpearmanCorr = _corr(rename = (x = cor));
    ods listing close;
    proc corr data = _tmp3 spearman;
      var x;
      with bad_rate;
    run;
    ods listing;

    proc sql noprint;
      select abs(cor) into :cor from _corr;
    quit;
    
    %if &cor = 1 %then %goto outside;
  %end;
  
%end;    

%outside:

proc sort data = _tmp3;
  by rank;
run;

data _tmp4;
  retain bin minx maxx bad freq bad_rate
         gpct bpct woe iv;
  set _tmp3;  
  by rank;

  bin + 1;
  bpct = bad / &bad;
  gpct = (freq - bad) / (&cnt - &bad);
  woe  = log(gpct / bpct);
  iv   = (gpct - bpct) * log(gpct / bpct);
  keep bin minx maxx bad freq bad_rate
       gpct bpct woe iv;  
run;

options nocenter nodate ls = 100;
title;
proc report data = _tmp4 box spacing = 1 split = "*";
  column("MOTONOTIC WEIGHT OF EVIDENCE TRANSFORMATION FOR %upcase(&x)"
         bin minx maxx bad freq woe iv);
  define bin      / "BIN*LEVEL"    width = 5  format = z2. order order = data;
  define minx     / "LOWER*BOUND"  width = 10 format = 9.2;
  define maxx     / "UPPER*BOUND"  width = 10 format = 9.2;
  define bad      / "#BADS"        width = 10 format = 9.;
  define freq     / "#FREQ"        width = 10 format = 9.;
  define woe      / "WOE"          width = 10 format = 10.4;
  define iv       / "INFO.*VALUE"  width = 8  format = 8.4;
run;

%mend monowoe;

options mprint mlogic nocenter;
%monowoe(data = test, y = y, x = x3, maxbin = 20);

/*
 +---------------------------------------------------------------------+
 |         MOTONOTIC WEIGHT OF EVIDENCE TRANSFORMATION FOR X3          |
 |  BIN      LOWER      UPPER                                     INFO.|
 |LEVEL      BOUND      BOUND      #BADS      #FREQ        WOE    VALUE|
 |---------------------------------------------------------------------|
 |   01|     -1.19|     -0.81|       134|      1576|   -0.3722|  0.0417|
 |---------------------------------------------------------------------|
 |   02|     -0.69|     -0.31|       111|      1693|   -0.0912|  0.0024|
 |---------------------------------------------------------------------|
 |   03|     -0.18|      0.32|        67|      1419|    0.2565|  0.0135|
 |---------------------------------------------------------------------|
 |   04|      0.45|      2.97|        60|      1492|    0.4244|  0.0362|
 +---------------------------------------------------------------------+
*/ 

THE EQUIVALENT FUNCTION OF PROC RANK WITH WEIGHT VARIABLE

***********************************************;
* A DEMO ON HOW TO DO THE EQUIVALENT FUNCTION *;
* OF PROC RANK WITH WEIGHT VARIABLE           *;
* ------------------------------------------- *;
* SPECIAL THANKS TO MY FRIEND AND COWORKER    *;
* Mr. Jonas Bilenas                           *;
***********************************************;

data one;
  do i = 1 to 1000;
    x  = rannor(1);
    if x < 0 then wt = 1;
    else wt = 10;
    output;
  end;
run;

%macro rank(data = , var = , weight = , groups = , output = );

data _tmp1(keep = &var &weight);
  set &data;
run;

proc univariate data = &data noprint;
  var &var;
  weight &weight;
  output out = _tmp2
  pctlpre = decile_
  pctlpts = 0 to 100 by %sysevalf(100 / &groups);
run;

data &output(keep = &var lower upper rank &weight);
  set _tmp1;

  if _n_ = 1 then do;
    set _tmp2;
  end;
  array d[*] decile_:;
  rank = 999;
  do i = 2 to dim(d);
    if &var <= d[i] and rank = 999 then do;
      lower  = round(d[i - 1], 0.0001);
      upper  = round(d[i], 0.0001);
      rank = i - 1;
    end;
  end;        
run;

%mend rank;

%rank(data = one, var = x, weight = wt, groups = 4, output = two);

proc means data = two nonobs n min max missing sumwgt;
  class rank;
  var x;
  weight wt;
run;

/* OUTPUT:

 rank       N         Minimum         Maximum        Sum Wgts
-------------------------------------------------------------    
    1     591      -3.3158398       0.2201053         1365.00
    2     137       0.2276076       0.6209279         1370.00
    3     136       0.6211922       1.0996207         1360.00
    4     136       1.1021029       3.2444009         1360.00
*/

A SAS MACRO TO CREATE FREQUENCY SUMMARY TABLE

*************************************************;
* A MACRO TO CREATE FREQUENCY SUMMARY TABLE FOR *;
* ALL INPUT CATEGORICAL VARIABLES               *;
*************************************************;

libname data 'z:\XXX';

%let cat = x10 - x11;

data tmp1;
  set data.credit_train;
run;

%macro freq(data = , y = , cat_x = );

data _tmp1;
  retain &cat_x &y;
  set &data;
  where &y in (0, 1);
  keep &cat_x &y;
run;

ods listing close;
ods output position = _pos;
proc contents data = _tmp1 varnum;
run;
ods listing;

proc sql noprint;
  select count(*) - 1 into :nx from _pos;

  select variable into :xlist separated by ' ' from _pos where num <= &nx;

  select count(*) into :obs from _tmp1;
run;

%do i = 1 %to &nx;
  %let x2 = %scan(&xlist, &i, ' ');

  data _tmp2(keep = i var level &y);
    set _tmp1;
    i   = &i;
    var = upcase("&x2");
    level = input(&x2, $50.);
  run;
  
  proc summary data = _tmp2 nway missing;
    class i var level;
    output out = _sum1(drop = _type_ rename = (_freq_ = freq))
    sum(&y) = bads mean(&y) = bad_rate;
  run;

  %if &i = 1 %then %do;
    data _sum2;
      set _sum1;
    run;
  %end;
  %else %do;
    data _sum2;
      set _sum2 _sum1;
    run;
  %end;    
%end;    

data _final;
  set _sum2;
  if compress(level, ' ') = '' then level = '999. MISSING';
  pct = freq / &obs;
run;  

options nocenter nodate;
title;
proc report data = _final box spacing = 2 headskep split = "*";
  column("FREQUENCY SUMMARY OF CATEGORICAL VARIABLES* "
         var level freq pct bads bad_rate);  
  define var      / "ATTRIBUTE" order order = data width = 20 format = $20.;
  define level    / "LEVEL"     width = 20 format = $20.;
  define freq     / "FREQUENCY" width = 10 format = 10.;
  define pct      / "PERCENT"   width = 10 format = percent10.4;
  define bads     / "# BADS"    width = 10 format = 9.;
  define bad_rate / "BAD RATE"  width = 10 format = percent10.4;
run;
  
%mend freq;

%freq(data = tmp1, y = y, cat_x = &cat);

/*                                                                                                  
-----------------------------------------------------------------------
|          FREQUENCY SUMMARY OF CATEGORICAL VARIABLES                 |
|                                                                     |
|ATTRIBUTE     LEVEL     FREQUENCY     PERCENT      # BADS    BAD RATE|
|                                                                     |
|---------------------------------------------------------------------|
|X10       | 1        |       1352|  28.8889% |        115|   8.5059% |  
|          |----------+-----------+-----------+-----------+-----------|  
|          | 2        |       3328|  71.1111% |        171|   5.1382% |  
|----------+----------+-----------+-----------+-----------+-----------|  
|X11       | 1        |        825|  17.6282% |        101|  12.2424% |  
|          |----------+-----------+-----------+-----------+-----------|  
|          | 2        |       3855|  82.3718% |        185|   4.7990% |  
-----------------------------------------------------------------------  
*/                                                                                                

LOGISTIC MODEL IN ENTERPRISE MINER WITH DMREG PROCEDURE

*************************************************;
* LOGISTIC MODEL IN ENTERPRISE MINER WITH       *;
* DMREG PROCEDURE                               *;
*************************************************;

libname data 'z:\XXX';

data tmp1;
  set data.credit_train;
run;

*** REGULAR WAY TO MODEL LOGISTIC REGRESSION ***;
proc logistic data = tmp1 desc;
  class x10 x11 x13 x17 x18 X19 X22 / param = glm;
  model y = X2 X3 X8 X9 x10 x11 x13 x17 x18 X19 X22;
run;  
/*      Type 3 Analysis of Effects
                        Wald
Effect      DF    Chi-Square    Pr > ChiSq
X2           1        5.2833        0.0215
X3           1       26.7384        <.0001
 ... ...
               Analysis of Maximum Likelihood Estimates
                                  Standard          Wald
Parameter       DF    Estimate       Error    Chi-Square    Pr > ChiSq
Intercept        1     -2.9224      0.4685       38.9145        <.0001
X2               1      0.2162      0.0941        5.2833        0.0215
X3               1     -0.4701      0.0909       26.7384        <.0001
 ... ...
*/

*** E-MINER WAY TO MODEL LOGISTIC REGRESSION ***;
proc dmdb data = tmp1 out = tmp_db dmdbcat = tmp_cl;
  class x10 x11 x13 x17 x18 X19 X22 y(desc);
  var X2 X3 X8 X9;
  target y;
run;

proc dmreg data = tmp_db dmdbcat = tmp_cl;
  class x10 x11 x13 x17 x18 X19 X22 y;
  model y = X2 X3 X8 X9 x10 x11 x13 x17 x18 X19 X22 / coding = glm;
  code file = 'z:\XXX\score_code.sas';
run;
/*         Type 3 Analysis of Effects
                          Wald
Effect        DF    Chi-Square    Pr > ChiSq
X2             1        5.2832        0.0215
X3             1       26.7383        <.0001
 ... ...
                         Analysis of Maximum Likelihood Estimates
                                  Standard          Wald              
Parameter       DF    Estimate       Error    Chi-Square    Pr > ChiSq
Intercept        1     -2.9224      0.4685         38.92        <.0001
X2               1      0.2162      0.0941          5.28        0.0215
X3               1     -0.4701      0.0909         26.74        <.0001
 ... ...
*/

MODEL SEARCH BY BUMPING

*************************************************;
* REFERENCE:                                    *;
*   model search and inference by bootstrap     *;
*   -- Tibshirani and Knight (1997)             *;
*************************************************;

libname data 'z:\XXXX';

%include 'z:\XXXX\ks.sas';
%include 'z:\XXXX\roc.sas';

data train;
  set data.credit_train;
run;

proc sql noprint;
  select
    count(*) into :size
  from
    train;    
quit;

%macro bumping;

%let seed = 2009;

%let n = 500;

data _null_;
  do i = 1 to &n;
    random = put(ranuni(&seed) * (10 ** 8), 8.);
    name   = compress("random"||put(i, 3.), ' ');
    call symput(name, random);
  end;
run;    

proc datasets library = data nolist;
  delete catalog / memtype = catalog;
run;
quit;

%do i = 1 %to &n;
    
  %put &&random&i;

  proc surveyselect data = train method = urs n = &size seed = &&random&i
    out = sample&i(rename = (NumberHits = _hits)) noprint;
  run;
  
  proc dmdb data = sample&i out = db_sample&i dmdbcat = cl_sample&i;
    class x10 - x24 y;
    var x2 - x9;
    target y;
    freq _hits;
  run;

  filename out_tree catalog "data.catalog.out_tree.source";
  
  proc split data = db_sample&i dmdbcat = cl_sample&i
    criterion    = gini
    assess       = impurity
    maxbranch    = 2
    splitsize    = 100
    subtree      = 30
    nsurrs       = 0;
    code file    = out_tree;
    input x2 - x9   / level = interval;
    input x10 - x24 / level = nominal;
    target y        / level = binary;
    freq _hits;
  run;  

  filename in_tree catalog "data.catalog.tree&i..source";

  data _null_;
    infile out_tree;
    input;
    file in_tree;
    if _n_ > 3 then put _infile_;
  run;

  data _tmp1(keep = p_y1 p_y0 y random);
    set data.credit_train;
    random = ranuni(1);
    %include in_tree;
  run;

  proc sort data = _tmp1 sortsize = max;
    by descending p_y1 random;
  run;

  data _tmp2;
    set _tmp1;
    where p_y1 ~= . and y in (0, 1);
    by descending p_y1 random;
    i + 1;
  run;

  proc rank data = _tmp2 out = _tmp3 groups = 10;
    var i;
  run;

  proc sql noprint;
    create table
      _tmp4 as
    select
      i + 1       as decile,
      count(*)    as cnt,
      sum(y)      as bad_cnt
    from
      _tmp3
    group by
      i;

    select sum(cnt) into :cnt from  _tmp4;

    select sum(bad_cnt) into :bad_cnt from _tmp4;    
  quit;
  
  data _tmp5;
    set _tmp4;
    retain cum_cnt cum_bcnt cum_gcnt;
    bpct     = bad_cnt / cnt;
    cum_cnt  + cnt;
    cum_bcnt + bad_cnt;
    cum_gcnt + (cnt - bad_cnt);
    cum_pct  = cum_cnt  / &cnt;
    cum_bpct = cum_bcnt / &bad_cnt;
    cum_gpct = cum_gcnt / (&cnt - &bad_cnt);
    ks       = (max(cum_bpct, cum_gpct) - min(cum_bpct, cum_gpct)) * 100;
  run;

  proc sql noprint;
    select max(ks) into :ks from _tmp5;
  quit;
  
  data _kstmp;
    tree = &i;
    ks   = &ks;
  run;

  %if &i = 1 %then %do;
    data _ks;
      set _kstmp;
    run;
  %end;    
  %else %do;
    data _ks;
      set _ks _kstmp;
    run;
  %end;        
  
%end;

proc sql noprint;
  select max(ks) into :ks from _ks;
  
  select tree into :best from _ks where round(ks, 0.01) = round(&ks., 0.01);
quit;

filename best catalog "data.catalog.tree%trim(&best).source";
filename output "z:\XXXX\BestTree.txt";

data _null_;
  infile best;
  input;
  file output;
  if _n_ = 1 then do;
    put " ******************************************************; ";
    put " ***** BEST TREE: TREE %trim(&best) WITH KS = &KS *****; ";
    put " ******************************************************; ";    
  end;
  put _infile_;
run;

%mend bumping;

%bumping;

data valid(keep = p_y1 p_y0 y);
  set data.credit_valid;
  %include output;
run;

%ks(data = valid, score = p_y1, y = y);
%roc(data = valid, score = p_y0, y = y);

HERE COMES THE SIMPSON RULE

function [result] = simpson(f, a, b, N)
%***************************************************************
%* A FUNCTION OF NUMERICAL INTEGRATION WITH SIMPSON RULE       *
%*=============================================================*
%* PARAMETERS:                                                 *
%*   f: A FUNCTION INPUT                                       *                  
%*   a: LOWER BOUND OF THE FUNCTION INPUT                      *
%*   b: UPPER BOUND OF THE FUNCTION INPUT                      *
%*   N: # OF INTERVALS                                         *
%*=============================================================*
%* AN EXAMPLE:                                                 *
%*   y = simpson('sin(x)', 0, 1, 1000);                        *
%*=============================================================*
%* AUTHOR: Wensui Liu (liuwensui@hotmail.com)                  *
%***************************************************************

dx = (b - a) / N;
func = inline(f);
result = 0;

for i = 1:N
    min = a + (i - 1) * dx;
    med = a + (i - 1 / 2) * dx;
    max = a + i * dx;
    res = dx / 6 * (func(min) + 4 * func(med) + func(max));
    result = result + res;
end
return

BAGGING CART (CLASSIFICATION AND REGRESSION TREE)

libname data 'z:\XXX';

data train;
  set data.credit_train;
run;

proc sql noprint;
  select
    count(*) into :size
  from
    train;    
quit;

%macro bagging;

%let seed = 2009;

%let n = 100;

data _null_;
  do i = 1 to &n;
    random = put(ranuni(&seed) * (10 ** 8), 8.);
    name   = compress("random"||put(i, 3.), ' ');
    call symput(name, random);
  end;
run;    

%do i = 1 %to &n;
    
  %put &&random&i;

  proc surveyselect data = train method = urs n = &size seed = &&random&i
    out = sample&i(rename = (NumberHits = _hits)) noprint;
  run;
  
  proc dmdb data = sample&i out = db_sample&i dmdbcat = cl_sample&i;
    class x10 - x24 y;
    var x2 - x9;
    target y;
    freq _hits;
  run;

  filename out_tree catalog "data.catalog.out_tree.source";
  
  proc split data = db_sample&i dmdbcat = cl_sample&i
    criterion    = gini
    assess       = impurity
    maxbranch    = 2
    splitsize    = 100
    nsurrs       = 0;
    code file    = out_tree noleafid;
    input x2 - x9   / level = interval;
    input x10 - x24 / level = nominal;
    target y        / level = binary;
    freq _hits;
  run;  

  filename in_tree catalog "data.catalog.tree&i..source";

  data _null_;
    infile out_tree;
        input;
        file in_tree;
        if _n_ > 3 then put _infile_;
  run;

  data train_score&i(keep = id p_y1 p_y0 y);
    set data.credit_train;
    %include in_tree;
  run;

  data valid_score&i(keep = id p_y1 p_y0 y);
    set data.credit_valid;
    %include in_tree;
  run;

  %if &i = 1 %then %do;
    data train_score;
      set train_score&i;
    run;

    data valid_score;
      set valid_score&i;
    run;
  %end;    
  %else %do;
    data train_score;
      set train_score train_score&i;
    run;

    data valid_score;
      set valid_score valid_score&i;
    run;
  %end;    

%end;

proc summary data = train_score nway;
  class id;
  output out = train_predict(drop = _type_ rename = (_freq_ = freq))
  mean(p_y1) = p_1 mean(p_y0) = p_0 mean(y) = y;
run;

proc summary data = valid_score nway;
  class id;
  output out = valid_predict(drop = _type_ rename = (_freq_ = freq))
  mean(p_y1) = p_1 mean(p_y0) = p_0 mean(y) = y;
run;

%mend bagging;

%bagging;

data data.scored;
  set train_predict(in = a)
      valid_predict(in = b);
  
  if a then train = 1;
  else train = 0;
run;

proc freq data = data.scored;
  tables train * y / list missing;
run;

%include 'z:\XXXX\ks.sas';
%include 'z:\XXXX\roc.sas';

data valid train;
   set data.scored;
   if train = 1 then output train;
   else if train = 0  then output valid;
run;  

title 'KS statistics for trainning dataset';
%ks(data = train, score = p_1, y = y);

title 'KS statistics for validation dataset';
%ks(data = valid, score = p_1, y = y);

title 'ROC statistics for trainning dataset';
%roc(data = train, score = p_0, y = y);

title 'ROC statistics for validation dataset';
%roc(data = valid, score = p_0, y = y);

/* OUTPUT
KS FOR TRAINING DATA
                                 CU.    CU.                  MIN      MAX
DECILE  FREQ   BAD#      BAD%   FREQ   BAD#    CU. BAD%    SCORE    SCORE     %OF BAD       KS
   1     468    156   33.33%     468    156     33.33%      0.13     0.35     54.55%     47.44
   2     468     53   11.32%     936    209     22.33%      0.09     0.13     73.08%     56.53
   3     468     34    7.26%    1404    243     17.31%      0.07     0.09     84.97%     58.54
   4     468     26    5.56%    1872    269     14.37%      0.06     0.07     94.06%     57.57
   5     468      9    1.92%    2340    278     11.88%      0.05     0.06     97.20%     50.28
   6     468      7    1.50%    2808    285     10.15%      0.04     0.05     99.65%     42.23
   7     468      0    0.00%    3276    285      8.70%      0.03     0.04     99.65%     31.58
   8     468      0    0.00%    3744    285      7.61%      0.02     0.03     99.65%     20.93
   9     468      1    0.21%    4212    286      6.79%      0.02     0.02    100.00%     10.65
  10     468      0    0.00%    4680    286      6.11%      0.00     0.02    100.00%      0.00

KS FOR VALIDATION DATA
                                CU.    CU.                  MIN      MAX
DECILE  FREQ   BAD#     BAD%   FREQ   BAD#    CU. BAD%     SCORE    SCORE     %OF BAD       KS
   1     150    31    20.67%     150    31      20.67%      0.12     0.36     36.05%     27.63
   2     150    13     8.67%     300    44      14.67%      0.09     0.12     51.16%     33.06
   3     150    12     8.00%     450    56      12.44%      0.07     0.09     65.12%     37.25
   4     150     5     3.33%     600    61      10.17%      0.06     0.07     70.93%     32.81
   5     150     7     4.67%     750    68       9.07%      0.05     0.06     79.07%     30.84
   6     150     4     2.67%     900    72       8.00%      0.04     0.05     83.72%     25.16
   7     150     2     1.33%    1050    74       7.05%      0.03     0.04     86.05%     17.02
   8     150     6     4.00%    1200    80       6.67%      0.03     0.03     93.02%     13.82
   9     150     5     3.33%    1350    85       6.30%      0.02     0.03     98.84%      9.37
  10     150     1     0.67%    1500    86       5.73%      0.00     0.02    100.00%      0.00

AREA UNDER ROC FOR TRAINING DATA
--------
0.888135

AREA UNDER ROC FOR VALIDATION DATA
--------
0.726259
*/

ANOTHER FUNCTION FOR NUMERICAL INTEGRATION

function result = trape(f, a, b, N)
%***************************************************************
%* A FUNCTION OF NUMERICAL INTEGRATION WITH TRAPEZOIDAL RULE   *
%*=============================================================*
%* PARAMETERS:                                                 *
%*   f: A FUNCTION INPUT                                       * 
%*   a: LOWER BOUND OF THE FUNCTION INPUT                      *
%*   b: UPPER BOUND OF THE FUNCTION INPUT                      *
%*   N: # OF INTERVALS                                         *
%*=============================================================*
%* AN EXAMPLE:                                                 *
%*   y = trape('sin(x)', 0, 1, 1000);                          *
%*=============================================================*
%* AUTHOR: Wensui Liu (liuwensui@hotmail.com)                  *
%***************************************************************

dx = (b - a) / N;
func = inline(f);
result = 0;

for i = 1:N
    res = dx * (func(a + (i - 1) * dx) + func(a + i * dx)) / 2;
    result = result + res;
end
return

A PRACTICE IN MATLAB

function [result] = mpoint(f, a, b, N)
%***************************************************************
%* A FUNCTION OF NUMERICAL INTEGRATION WITH MIDPOINT RULE      *
%*=============================================================*
%* PARAMETERS:                                                 *
%*   f: A FUNCTION INPUT                                       *
%*   a: LOWER BOUND OF THE FUNCTION INPUT                      *
%*   b: UPPER BOUND OF THE FUNCTION INPUT                      *
%*   N: # OF INTERVALS                                         *
%*=============================================================*
%* AN EXAMPLE:                                                 *
%*   y = mpoint('sin(x)', 0, 1, 1000);                         *
%*=============================================================*
%* AUTHOR: Wensui Liu (liuwensui@hotmail.com)                  *
%***************************************************************

dx = (b - a) / N;
func = inline(f);
result = 0;

for i = 1:N
    res = dx * (func(a + (i - 1/2) * dx));
    result = result + res;
end
return

SAS MACRO TO CALCULATE AUC (AREA UNDER CURVE)

%macro roc(data = , score =, y = );
*******************************************;
* THIS MACRO IS TO CALCULATE AUC (AREA    *;
* UNDER CURVE) USING TRAPEZOIDAL RULE     *;
*-----------------------------------------*;
* INPUT PARAMETERS:                       *;
*   DATA : INPUT DATASET                  *;
*   SCORE: SCORE (HIGHER ==> BETTER)      *;
*   Y    : 0/1 TARGET (1 ==> BAD)         *;
*******************************************;
    
data _tmp1;
  set &data;
  where &y in (0, 1) and &score ~= .;
  random = ranuni(1);
  keep &y &score random;
run;

proc sql noprint;
  select sum(&y) into :bcnt from _tmp1;
  select count(*) - sum(&y) into :gcnt from _tmp1;
quit;

proc sort data = _tmp1 sortsize = max;
  by &score random;
run;  

data _tmp2;
  set _tmp1;
  by &score random;
  retain gcum bcum;
  gcum + (1 - &y);
  bcum + &y;
  gpct = gcum / &gcnt;
  bpct = bcum / &bcnt;
run;

proc sort data = _tmp2 sortsize = max;
  by gpct bpct;
run;

data _tmp3;
  set _tmp2;
  by gpct bpct;
  if last.gpct then output;
run;

title 'AREA UNDER ROC';
proc sql;
  create table
    _tmp4 as
  select
    a.gcum as gcum,
    (b.gpct - a.gpct) * (b.bpct + a.bpct) / 2 as dx
  from
    _tmp3 as a, _tmp3 as b
  where
    a.gcum + 1 = b.gcum;

  select sum(dx) from _tmp4;
quit;  
title;

proc datasets library = work nolist;
  delete _tmp:/memtype = data;
run;
quit;
      
%mend roc;    

data test;
  do i = 1 to 10000;
    x = ranuni(1);
    if x + rannor(1) * 0.2 > 0.7 then y = 1;
    else y = 0;
    output;
  end;
run;

proc logistic data = test;
  model y = x;
  score data = test out = predicted;
run;  

%roc(data = predicted, score = p_0, y = y);

/*
-- STANDARD AUC RESULT FROM PROC LOGISTIC:

Association of Predicted Probabilities and Observed Responses

Percent Concordant        91.1    Somers' D    0.824
Percent Discordant         8.8    Gamma        0.825
Percent Tied               0.1    Tau-a        0.352
Pairs                 21336604    c            0.912

-- CALCULATED AUC RESULT FROM MY MACRO:

AREA UNDER ROC          

0.912021
*/

SAS MACRO TO CALCULATE GAINS CHART WITH KS

%macro ks(data = , score = , y = );

options nocenter mprint nodate;

data _tmp1;
  set &data;
  where &score ~= . and y in (1, 0);
  random = ranuni(1);
  keep &score &y random;
run;

proc sort data = _tmp1 sortsize = max;
  by descending &score random;
run;

data _tmp2;
  set _tmp1;
  by descending &score random;
  i + 1;
run;

proc rank data = _tmp2 out = _tmp3 groups = 10;
  var i;
run;

proc sql noprint;
create table
  _tmp4 as
select
  i + 1       as decile,
  count(*)    as cnt,
  sum(&y)     as bad_cnt,
  min(&score) as min_scr format = 8.2,
  max(&score) as max_scr format = 8.2
from
  _tmp3
group by
  i;

select
  sum(cnt) into :cnt
from
  _tmp4;

select
  sum(bad_cnt) into :bad_cnt
from
  _tmp4;    
quit;

data _tmp5;
  set _tmp4;
  retain cum_cnt cum_bcnt cum_gcnt;
  cum_cnt  + cnt;
  cum_bcnt + bad_cnt;
  cum_gcnt + (cnt - bad_cnt);
  cum_pct  = cum_cnt  / &cnt;
  cum_bpct = cum_bcnt / &bad_cnt;
  cum_gpct = cum_gcnt / (&cnt - &bad_cnt);
  ks       = (max(cum_bpct, cum_gpct) - min(cum_bpct, cum_gpct)) * 100;

  format cum_bpct percent9.2 cum_gpct percent9.2
         ks       6.2;
  
  label decile    = 'DECILE'
        cnt       = '#FREQ'
        bad_cnt   = '#BAD'
        min_scr   = 'MIN SCORE'
        max_scr   = 'MAX SCORE'
        cum_gpct  = 'CUM GOOD%'
        cum_bpct  = 'CUM BAD%'
        ks        = 'KS';
run;

title "%upcase(&score) KS";
proc print data  = _tmp5 label noobs;
  var decile cnt bad_cnt min_scr max_scr cum_bpct cum_gpct ks;
run;    
title;

proc datasets library = work nolist;
  delete _: / memtype = data;
run;
quit;

%mend ks;    

data test;
  do i = 1 to 1000;
    score = ranuni(1);
    if score * 2 + rannor(1) * 0.3 > 1.5 then y = 1;
    else y = 0;
    output;
  end;
run;

%ks(data = test, score = score, y = y);

/*
SCORE KS               
                                MIN         MAX
DECILE    #FREQ    #BAD       SCORE       SCORE     CUM BAD%    CUM GOOD%        KS

   1       100      87         0.91        1.00      34.25%        1.74%      32.51
   2       100      78         0.80        0.91      64.96%        4.69%      60.27
   3       100      49         0.69        0.80      84.25%       11.53%      72.72
   4       100      25         0.61        0.69      94.09%       21.58%      72.51
   5       100      11         0.51        0.60      98.43%       33.51%      64.91
   6       100       3         0.40        0.51      99.61%       46.51%      53.09
   7       100       1         0.32        0.40     100.00%       59.79%      40.21
   8       100       0         0.20        0.31     100.00%       73.19%      26.81
   9       100       0         0.11        0.19     100.00%       86.60%      13.40
  10       100       0         0.00        0.10     100.00%      100.00%       0.00
*/

SAS MACRO TO ASSIGN RANDOM INTEGERS TO A GROUP OF MACRO VARIABLES

%macro seeds(n = , prefix = , seed = 1);

data _null_;
  do i = 1 to &n;
    random = put(ranuni(&seed) * (10 ** 10), 10.);
    name   = compress("prefix"||put(i, 4.), ' ');
    call symput(name, random);
  end;
run;    

%do i = 1 %to &n;
  %put SEED VALUE &i  : &prefix&i =  &&prefix&i.;
%end;  

%mend seeds;

%seeds(n = 50, prefix = random);

/*
SEED VALUE 1  : random1 =  1849625698
SEED VALUE 2  : random2 =  9700887157
 ... ...
SEED VALUE 49  : random49 =  5113178829
SEED VALUE 50  : random50 =  4332045635
*/

AN EXAMPLE OF USING DMDB PROCEDURE

/************************************************
* AN EXAMPLE TO USE DMDB PROCEDURE IN SAS       *
* ENTERPRISE MINER CREATING DATA MINING DB WITH *
* DMDB PROCEDURE                                *
************************************************/

data test;
  input X1 X2 X3 Y;
datalines;
1 -0.04926 -0.61599 0
2 0.31526 -1.65430 1
3 0.64517 -0.06878 0
1 0.02592 -0.71203 1
2 1.45284 -0.39064 1
3 0.22533 -0.44507 0
1 -1.24641 -0.73156 0
1 0.88542 -1.63595 1
2 -1.35613 -1.58209 0
3 -1.22333 1.98124 1
2 -0.36128 0.77962 0
1 -0.01054 -0.76720 0
2 1.43263 0.53820 0
3 -0.17737 0.25381 0
2 -1.64183 -0.34028 0
1 -0.83537 -2.00245 0
3 0.00284 -0.75016 0
2 -0.28847 -0.77437 0
3 -0.21167 -0.53833 0
1 -1.23276 0.11452 1
;
run;

*** CREATE OF DATA-MINING DATABASE WITH DMDB PROCEDURE ***;
proc dmdb batch
  data      = test      /* SPECIFY INPUT DATA                 */
  out       = out_data  /* SPECIFY OUTPUT DATA                */
  dmdbcat   = out_cat   /* SPECIFY OUTPUT CATALOG             */
  classout  = out_class /* OUTPUT SUMMARY OF CLASS VARIABLES  */
  varout    = out_var   /* OUTPUT SUMMARY OF NUMERIC VARIABLES*/;
  class x1(asc) y;
  var x2 x3;
  target y;
run;

title "SUMMARY OF CLASS VARIABLES";
proc print data = out_class noobs;
run;

/*
NAME    LEVEL    FREQUENCY    TYPE    CRAW    NRAW
 X1       1              7     N                1
 X1       2              7     N                2
 X1       3              6     N                3
 Y        0             14     N                0
 Y        1              6     N                1
*/

title "SUMMARY OF CONTINUOUS VARIABLES";
proc print data = out_var noobs;
run;

/*
NAME        NMISS       MIN        MAX        MEAN        STD      SKEWNESS    KURTOSIS
 X2             0    -1.64183    1.45284    -0.18245    0.88684     0.17469    -0.49032
 X3             0    -2.00245    1.98124    -0.46709    0.92896     0.72040     1.38541
*/

THE BEAUTY OF RECURSION IN F#

#light

// GAMMA FUNCTION: AN EXAMPLE USING RECURSION IN F#

let rec gamma x =
  match x with
  |1 -> 1
  |x -> (x - 1) * gamma (x - 1)
  
printfn "%i" (gamma 5)  

A EDA MACRO TO RANK BOTH NUMERIC AND CATEGORICAL VARIABLES

%macro eda(data = , vlist = , dep = , out = out, groups = 100);
    
/******************************************************************************
* EXPLORATORY DATA ANALYSIS MACRO FOR BINARY RESPONSE                         *
* --------------------------------------------------------------------------- *
* PURPOSE                                                                     *
*   rank all independent variables to be used in the model                    *
* --------------------------------------------------------------------------- *
* MACRO VARIABLES                                                             *
*   data   : input dataset;                                                   *
*   vlist  : list of independent variables;                                   *
*   dep    : dependent variables, must be either 0 or 1;                      *
*   out    : output dataset, default is out;                                  *
*   groups : number of bins, default is 100;                                  *
* --------------------------------------------------------------------------- *
* CALCULATED MEASURES                                                         *
*   1) DISTANCE MEASURE      : KS                                             *
*   2) INFORMATION MEASURE   : ENTROPY, INFORMATION VALUE, GINI INDEX         *
*   3) DEPENDENCE MEASURE    : LIKELIHOOD RATIO TEST                          *
*   4) CLASSIFY ERROR MEASURE: MISCLASSIFICATION RATE OF VALIDATION SET       *
* --------------------------------------------------------------------------- *
* AUTHOR: WENSUI LIU                                                          *
******************************************************************************/

/* read data in and keep the variables */    
data _tmp1;
  set &data;
  keep &vlist &dep validate;
  if &dep = 1 then do;
    validate = (ranuni(1) < 0.2);
  end;
  else do;
    validate = (ranuni(0) < 0.2);
  end;
run;

proc sql noprint;
/* count number of indep variables */
  select 
    count(*) - 2
  into
    :nvar
  from
    sashelp.vcolumn
  where
    upcase(libname) = 'WORK' and
    upcase(memname) = '_TMP1';

/* read the names of indep variables into a macro variable */  
  select 
    name
  into
    :vars separated by ' '
  from
    sashelp.vcolumn
  where
    upcase(libname) = 'WORK' and
    upcase(memname) = '_TMP1' and
    upcase(name) not in ("%upcase(&dep.)", "VALIDATE");

/* create an empty table to contain measures */
  create table &out
  (
    var char(50)          label = 'Predictors',
    bin num format = 5.0  label = '# of Bins',      
    /* DISTANCE MEASURES */
    ks  num format = 12.8 label = 'KS-statistics' ,
    /* INFORMATION MEASURES */
    en  num format = 12.8 label = 'Entropy',
    iv  num format = 12.8 label = 'Info Values',
    gn  num format = 12.8 label = 'Gini Index',    
    /* DEPENDENCE MEASURES */
    lr  num format = 12.8 label = 'Likelihood Ratio Test',
    /* MISCLASSIFICATION MEASURES */        
    pi  num format = 12.8 label = 'Predictive Index',
    cr  num format = 12.8 label = 'Misclassification Rate'
  );
quit;

%do i = 1 %to &nvar;
/* get the indep variable one by one */
  %let var = %scan(&vars, &i, ' ');

  %put ===========================;
  %put Variable &i: &var;
  %put ===========================;

/* sort the indep variable and create a sequential index */
  proc sort data = _tmp1 out = _tmp2(keep = &var &dep validate) sortsize = max;
    by &var;
  run;

  data _tmp3;
    set _tmp2;
    by &var;
    if first.%trim(&var) then index + 1;
  run;

/* count distinct level of index */  
  proc sql noprint;
    select
      count(distinct index)
    into
      :level
    from
      _tmp3;
  quit;

  %if &level > 2 %then %do;
/* bin the indep variable based on index */
    proc rank data = _tmp3 groups = &groups ties = low out = _tmp4;
      var index;
      ranks level;
    run;

/* get the # of binned levels */
    proc sql noprint;
      select
        count(distinct level)
      into
        :nbin
      from
        _tmp4;
    quit;
    
/* run a logistic regression with a random sample set */    
    ods listing close;
    ods output association = _auc1 (where = (label2 = 'c'));
    ods output FitStatistics = _lr1 (where = (index(Criterion, 'Log') > 0));
    proc logistic data = _tmp4 desc;
      where validate = 0;
      class level;
      model &dep = level;
      score data = _tmp4(where = (validate = 1)) out = _score1;
    run;
    ods listing;

    proc sql noprint;
      select
        sum(&dep)
      into
        :nvalidate
      from
        _score1;
    quit;

    proc sort data = _score1 sortsize = max;
      by p_0;
    run;

    data _score2;
      set _score1;
      by p_0;
      if _n_ <= &nvalidate then pred = 1;
      else pred = 0;
    run;
    
/* get the misclassification rate of validation data */    
    proc sql noprint;
      select
        sum(case when pred ~= &dep then 1 else 0 end) / count(*)
      into
        :cr
      from
        _score2;
    quit;
    
/* get the AUC */
    data _auc2;
      set _auc1;
      call symput('pi', put((nValue2 - 0.5) / 0.5, 8.6));
    run;
        
/* get the Likelihood Ratio test */  
    data _lr2;
      set _lr1;
      lr = 1 - probchi(InterceptOnly - InterceptAndCovariates, &nbin - 1);
      /* assign P-value of Likelihood Ratio test to a macro variable */
      call symput('lr', put(lr, 8.6));
    run;

/* get the ks-statistics of indep variable */    
    proc npar1way data = _tmp4 missing noprint edf;
      class level;
      var &dep;
      output out = _ks1 edf;
    run;
    
    proc sql noprint;
/* assign ks to a macro variable */        
      select
        _ks_
      into
        :ks
      from
        _ks1;
      
/* count numbers of goods and bads by binning levels */
      create table
        _iv1 as
      select
        level,
        sum(case when &dep = 0 then 1 else 0 end) as g_cnt,
        sum(case when &dep = 1 then 1 else 0 end) as b_cnt
      from
        _tmp4
      group by
        level;

/* calculate information value of indep variable */      
      create table
        _iv2 as
      select
        level,
        b_cnt,
        g_cnt,
        g_cnt / sum(g_cnt) as g_pct,
        b_cnt / sum(b_cnt) as b_pct,
        (calculated b_pct - calculated g_pct) *
        log(max(calculated b_pct, constant("small")) /
            max(calculated g_pct, constant("small")))
                           as iv_tmp
      from
        _iv1;

      select
        sum(iv_tmp)
      into
        :iv
      from
        _iv2;

/* calculate entropy of the indep variable */
      create table
        _en1 as
      select
        level,
        - g_cnt * log(max(constant("small"), g_cnt / (g_cnt + b_cnt))) as num,
        g_cnt + b_cnt                                                  as den
      from
        _iv1;

      /* assign entropy into macro variable */
      select
        -log(sum(num) / sum(den))
      into
        :en
      from
        _en1;
    quit;

    proc iml;
      sort _iv2 by descending b_pct;
      use _iv2;
      read all var {level b_cnt g_cnt} into x;
      close _iv2;

      n_event = x[, 2];
      n_noevent = x[, 3];
      gini = 0;
      do i = 2 to nrow(x);
        _a = 0;
        do j = 1 to i - 1;
          _a = sum(_a, n_noevent[j, 1]);
        end;
        gini = sum(gini, n_event[i, 1] * _a);
      end;
      gini = (1 - (gini * 2 + (n_event # n_noevent)[+, ])
             / (n_event[+, ] * n_noevent[+, ])) * 100;
      /* assign gini index into macro variable */
      call symput('gn', char(gini));
    quit;
    
/* combine all measures in one table */    
    data _measure ;
      var = "%trim(&var.)";
      bin = &nbin;
      ks  = &ks;
      en  = &en;
      iv  = &iv;
      gn  = &gn;
      lr  = &lr;
      pi  = &pi;
      cr  = &cr;
    run;

/* insert measures into output table */    
    proc sql;        
      insert into 
        &out
      select
        *
      from
        _measure;
    quit;
    
  %end;
  
%end;

/* delete unwanted tables */
proc datasets nolist library = work;
  delete _: / memtype = data;
run;
quit;
  
/* sort output table by ks-statistics */  
proc sort data = &out;
  by descending ks;
run;

/* print out meassures */  
proc print data = &out label;
run;

%mend eda;

当我老了 -给我们的父母

當我老了,不再是原來的我…… 請理解我,對我有一點耐心。
當我把菜湯灑到自己的衣服上時,當我忘記怎樣繫鞋帶時,請想一想當初我是如何手把手地教你。 
當我一遍又一遍地重複你早已聽膩的話語,請耐心地聽我說,不要打斷我。你小的時候,我不得不重複那個講過千百遍的故事,直到你進入夢鄉。 
當我需要你幫我洗澡時,請不要責備我。還記得小時候我千方百計哄你洗澡的情形嗎?
當我對新科技和新事物不知所措時,請不要嘲笑我。想一想當初我怎樣耐心地回答你的每一個「為什麼」。 
當我由於雙腿疲勞而無法行走時,請伸出你年輕有力的手攙扶我。就像你小時候學習走路時,我扶你那樣。 
當我忽然忘記我們談話的主題,請給我一些時間讓我回想。其實對我來說,談論什麼並不重要,只要你能在一旁聽我說,我就很滿足……… 
當你看著老去的我,請不要悲傷…… 理解我,支持我,就像你剛才開始學習如何生活時我對你那樣。 
當初我引導你走上人生路,如今請陪伴我走完最後的路。 
給我 ~ 你的愛和耐心,我會抱以感激的微笑,這微笑中 ~ 凝結著我對你無限的愛。

A MACRO TO CHECK MODEL INVERSION

data kyphosis;
  input Age StartVert NumVert Kyphosis @@;
  datalines;
71  5  3  0     158 14 3 0    128 5   4 1
2   1  5  0     1   15 4 0    1   16  2 0
61  17 2  0     37  16 3 0    113 16  2 0
59  12 6  1     82  14 5 1    148 16  3 0
18  2  5  0     1   12 4 0    243 8   8 0
168 18 3  0     1   16 3 0    78  15  6 0
175 13 5  0     80  16 5 0    27  9   4 0
22  16 2  0     105 5  6 1    96  12  3 1
131 3  2  0     15  2  7 1    9   13  5 0
12  2 14  1     8   6  3 0    100 14  3 0
4   16 3  0     151 16 2 0    31  16  3 0
125 11 2  0     130 13 5 0    112 16  3 0
140 11 5  0     93  16 3 0    1   9   3 0
52  6  5  1     20  9  6 0    91  12  5 1
73  1  5  1     35  13 3 0    143 3   9 0
61  1  4  0     97  16 3 0    139 10  3 1
136 15 4  0     131 13 5 0    121 3   3 1
177 14 2  0     68  10 5 0    9   17  2 0
139 6  10 1     2   17 2 0    140 15  4 0
72  15 5  0     2   13 3 0    120 8   5 1
51  9  7  0     102 13 3 0    130 1   4 1
114 8  7  1     81  1  4 0    118 16  3 0
118 16 4  0     17  10 4 0    195 17  2 0
159 13 4  0     18  11 4 0    15  16  5 0
158 15 4  0     127 12 4 0    87  16  4 0
206 10 4  0     11  15 3 0    178 15  4 0
157 13 3  1     26  13 7 0    120 13  2 0
42  6  7  1     36  13 4 0
;
run;

proc logistic data = kyphosis desc;
  model  kyphosis = Age StartVert NumVert / lackfit;
  score data = kyphosis out = prediction;
run;

%macro inversion(data = , score = score, x = );
/*********************************************************
* PURPOSE OF THE MACRO                                   *
*   CHECK MODEL INVERSION FOR EACH NUMERIC PREDICTOR     *
* ------------------------------------------------------ *
* MACRO INPUTS                                           *
* DATA : input dataset                                   *
* SCORE: predicted probability                           *
* X    : list of numeric predictors                      *
* ------------------------------------------------------ *
* AUTHOR: WENSUI LIU, RISK MODELER AT CHASE              *
*********************************************************/

options orientation = landscape center pageno = 1 nodate ls = 100 ps = 80;

data _tmp1;
  retain &score &x;
  set &data;
  keep &score &x;
run;

ods listing close;
ods output position = _pos;
proc contents data = _tmp1 varnum;
run;
ods listing;

proc sql noprint;
select
  count(*) - 1
into
  :nx
from
  _pos;
quit;

%do i = 1 %to &nx;
  %let var = %scan(&x, &i);
  
  proc rank data = _tmp1 groups = 10 ties = low
    out = _tmp2(keep = &score rank) ;
    var &var;
    ranks rank;
  run;
  
  proc summary nway data = _tmp2;
    class rank;
    output out = _tmp3 (drop = _freq_ _type_)
    mean(&score) = prob;
  run;
  
  data _tmp4;
    set _tmp3;
    decile = put(rank + 1, z2.);
    attrib decile label = "Decile of &var";
  run;
  
  title "Inversion Check for &var";
  proc chart data = _tmp4;
    vbar decile / sumvar = prob type = mean width = 8
                  symbol = "o" noheader space = 0 nospace
                  axis = 0 to 1 by 0.2 ref = 0 to 1 by 0.2;
  run;
  title;
%end;

proc datasets library = work;
  delete _:;
run;
quit;

%mend;

%inversion(data = prediction, score = p_1, x = Age StartVert NumVert);

/* OUTPUT:
Inversion Check for Age                      
Predicted Probability: Kyphosis=1 Mean
  1 +--------------------------------------------------------------------------------
    |
    |
    |
0.8 +--------------------------------------------------------------------------------
    |
    |
    |
0.6 +--------------------------------------------------------------------------------
    |
    |
    |
0.4 +--------------------------------------------------------------------------------
    |                                                        oooooooo
    |                                                        oooooooo
    |        oooooooo        oooooooooooooooo                oooooooo
0.2 +--------oooooooo--------oooooooooooooooo--------oooooooooooooooooooooooooooooooo
    |        oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo
    |oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo
    |oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo
    +--------------------------------------------------------------------------------
        01      02      03      04      05      06      07      08      09      10
                                      Decile of Age

Inversion Check for StartVert              
Predicted Probability: Kyphosis=1 Mean
  1 +------------------------------------------------------------------------
    |
    |
    |
0.8 +------------------------------------------------------------------------
    |
    |
    |
0.6 +oooooooo----------------------------------------------------------------
    |oooooooo
    |oooooooooooooooo
    |oooooooooooooooo
0.4 +oooooooooooooooo--------------------------------------------------------
    |oooooooooooooooooooooooo
    |oooooooooooooooooooooooo
    |oooooooooooooooooooooooo
0.2 +oooooooooooooooooooooooo------------------------------------------------
    |oooooooooooooooooooooooooooooooooooooooo
    |oooooooooooooooooooooooooooooooooooooooooooooooooooooooo
    |oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo
    +------------------------------------------------------------------------
        01      02      03      04      05      06      07      08      10
                               Decile of StartVert

Inversion Check for NumVert  
Predicted Probability: Kyphosis=1 Mean
  1 +------------------------------------------------
    |
    |
    |                                        oooooooo
0.8 +----------------------------------------oooooooo
    |                                        oooooooo
    |                                        oooooooo
    |                                        oooooooo
0.6 +----------------------------------------oooooooo
    |                                        oooooooo
    |                                        oooooooo
    |                                        oooooooo
0.4 +----------------------------------------oooooooo
    |                                oooooooooooooooo
    |                                oooooooooooooooo
    |                        oooooooooooooooooooooooo
0.2 +----------------oooooooooooooooooooooooooooooooo
    |                oooooooooooooooooooooooooooooooo
    |oooooooooooooooooooooooooooooooooooooooooooooooo
    |oooooooooooooooooooooooooooooooooooooooooooooooo
    +------------------------------------------------
        01      02      05      07      09      10
                    Decile of NumVert
*/

A MACRO FOR MEDIAN IMPUTATION

data tmp1;
  do i = 1 to 1000;
    x1   = rannor(1);
    x2   = rannor(2);
    if i < 10 then do;
      x1 = .;
      x2 = .;
    end;
    output;
  end;
run;

proc means data = tmp1 n nmiss median;
  var x1 x2;
run;

%macro impute(in = , out = output, varlist = );
/**********************************************************
* PURPOSE OF THE MACRO:                                   *
*   PERFORM MEDIAN IMPUTATION FOR NUMERIC VARIABLES       *
* ------------------------------------------------------- *
* PARAMETERS:                                             *
* IN     : INPUT DATASET                                  *
* OUT    : OUTPUT DATASET                                 *
* VARLIST: A LIST OF VARIABLES TO BE IMPUTED              *
* ------------------------------------------------------- *
* AUTHOR: WENSUI LIU, RISK MODELER AT CHASE               *
**********************************************************/

options mprint;

  /* CREATE TEMPORARY DATASETS FOR THE MACRO */
data _origin _tmp1(keep = _id &varlist);
  set &in;
  _id + 1;
run;

proc sort data = _origin sortsize = max;
  _id;
run;

  /* CALCULATE MEDIANS */
proc summary data = _tmp1 nway;
  output out = _med1(drop = _freq_ _type_)
  median(&varlist) =;
run;

ods listing close;
ods output position = _pos;
proc contents data = _med1 varnum;
run;
ods listing;

  /* REPLACE MISSINGS BY MEDIAN */
proc sql noprint;
select
  "case when a."||compress(variable, ' ')||" = . then b."||
  compress(variable, ' ')||" else a."||compress(variable, ' ')||
  " end as imp_"||compress(variable, ' ')
into 
  :replace separated by ", "
from
  _pos;

create table
  _tmp2 as
select
  a._id,
  &replace
from
  _tmp1 as a, _med1 as b;
quit;

proc sort data = _tmp2 sortsize = max;
  by _id;
run;

  /* MERGE BACK WITH THE ORIGIN DATA */
data &out(drop = _id);
  merge _origin _tmp2;
  by _id;
run;

%mend impute;

%impute(in = , varlist = x1 x2);

proc means data = output n nmiss median;
  var imp_:;
run;

A MACRO PERFORMING 1-SIDE CAPPING OF NUMERIC VARIABLES

data tmp1;
  do i = 1 to 1000;
    x1   = rannor(1);
    x2   = rannor(2);
    crap = 1;
    output;
  end;
run;

proc means data = tmp1 p99 min median max n;
  var x1 x2;
run;

%macro cap(in = , out = output, varlist = , p = 99);
/**********************************************************
* PURPOSE OF THE MACRO:                                   *
*   PERFORM 1-SIDE CAPPING OF NUMERIC VARIABLES           *
* ------------------------------------------------------- *
* PARAMETERS:                                             *
* IN     : INPUT DATASET                                  *
* OUT    : OUTPUT DATASET                                 *
* VARLIST: A LIST OF VARIABLES TO BE CAPPED               *
* P      : CAPPING PERCENTAGE                             *
* ------------------------------------------------------- *
* AUTHOR: WENSUI LIU, RISK MODELER AT CHASE               *
**********************************************************/

options mprint;

  /* CREATE TEMPORARY DATASETS FOR THE MACRO */
data _origin _tmp1(keep = _id &varlist);
  set &in;
  _id + 1;
run;

proc sort data = _origin sortsize = max;
  _id;
run;

  /* CALCULATE CAPPING LIMITS */
proc summary data = _tmp1 nway;
  output out = _sum1(drop = _freq_ _type_)
  P%trim(&p)(&varlist) =;
run;

ods listing close;
ods output position = _pos;
proc contents data = _sum1 varnum;
run;
ods listing;

  /* REPLACE EXTREMES BY CAPPED LIMITS */
proc sql noprint;
select
  "min(a."||compress(variable, ' ')||", b."||
  compress(variable, ' ')||") as cap_"||compress(variable, ' ')
into 
  :replace separated by ", "
from
  _pos;

create table
  _tmp2 as
select
  a._id,
  &replace
from
  _tmp1 as a, _sum1 as b;
quit;

proc sort data = _tmp2 sortsize = max;
  by _id;
run;

  /* MERGE BACK WITH THE ORIGIN DATA */
data &out(drop = _id);
  merge _origin _tmp2;
  by _id;
run;

%mend cap;

%cap(in = tmp1, varlist = x1 x2);

proc means data = output min median max n;
  var cap_:;
run;
 

Custom HTML

Creative Commons License
WENSUI'S BLOG IN STATISTICAL COMPUTING by WenSui Liu is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.

wensui liu

Interests
Thanks for visiting!
Please wait...
Sorry, the comment you entered is too long. Please shorten it.
You didn't enter anything. Please try again.
Sorry, we can't add your comment right now. Please try again later.
To add a comment, you need permission from your parent. Ask for permission
Your parent has turned off comments.
Sorry, we can't delete your comment right now. Please try again later.
You've exceeded the maximum number of comments that can be left in one day. Please try again in 24 hours.
Your account has had the ability to leave comments disabled because our systems indicate that you may be spamming other users. If you believe that your account has been disabled in error please contact Windows Live support.
Complete the security check below to finish leaving your comment.
The characters you type in the security check must match the characters in the picture or audio.
Ajaywrote:
Lovely helpful
 keep it simple site

Needed to know how to use SQL server for storing data and R for building scoring model do you know any stuff on that ?
Ajay,India
ohri2007@gmail.com

www.decisionstats.com
July 7
(no name) wrote:
Hello WenSui. Thank you so much for your blog. I've added your site to my reference for statistics. Your posts on count data models in SAS is fantastic. Especially since I have an older version without proc countreg. Your blog is like an ideal statistic textbook which succintly explains the theory behind the text and then immediately provides the code and output. I will definitely recommend your site to my colleagues.
 
Have a wonderful day
 
Apr. 17
慧闽 刘wrote:
It is my first time visiting your  stattistical blog. I need to say you are amazing. You must be a very smart guy and also hard working. Can not imagine how much you have done from March 2006.
I am very interested in your paper 'Improving credit scoring by GAM' and also classification using GAM.
Thanks for your great job. I am sure this will help me a lot with my GAM project after I struggle with reading your paper and codes for a few days.
Thanks again. Have a great day.
Dec. 5
No namewrote:
老大,
从你的博客中学到好多东西, 而且老大是个热心人, 从你的代码里
得到不少启发。
能不能再写个关于评分卡的代码, 比如你作出了模型, 想再进一步作出评分卡,
如以age 为变量之一,   不同年龄段的分数       这样的代码是找不到的。
 
好期待啊
June 7