wensui's profileWENSUI'S BLOG IN STATIST...PhotosBlogListsMore Tools Help

Blog


    RealMatrix++: A Real-Valued Matrix Class and Library

    RealMatrix++
    A Real-Valued Matrix Class and Library

    Daniel Hanson
    Email: HoserAnalytics (at) gmail (dot) com
    June 1, 2009

    Introduction and Purpose

    A frequent frustration for quant development in finance is the lack of a matrix class and matrix algebra library in the C++ Standard Library.  Third party libraries do exist, both open source and commercial, but open source libraries are often time consuming to learn and implement, and commercial libraries can carry a heavy price tag.  Furthermore, for quant finance, as in other domains such as predictive modeling, the vast majority of linear algebra calculations are real-valued.

    The intent of the RealMatrix++ Library is to provide an easy-to-use, non-templated matrix class and library for real-valued matrices and vectors, with the ultimate goal of providing functionality on par with that of the JAMA library for Java [1].  The code is not optimized for performance to the extent of, say, the Matrix Template Library [2], or Blitz [3], but it will likely offer more rapid development, and performance sufficient at least for prototyping models.

    Current Release Features and Design

    This initial release contains a real matrix class, and matrix operations +, -, and *, where (*) is overloaded for standard matrix multiplication, vector by matrix and matrix by vector multiplication, and scalar multiplication (see MatrixOperations.h for details).

    There is also a set of validation conditions outlined in MatrixConditions.h, which will check if two matrices are of the same dimensions (necessary for matrix addition and subtraction), if a matrix is square, or if a matrix is symmetric.

    The Matrix class itself is built as a wrapper around the valarray container in the C++ standard library.  As Stroustrup[4] says, "The standard library does not provide a matrix class.  Instead, the intent is for valarray to provide the tools for building matrices optimized for a variety of needs."

    Also, as Josuttis [5] points out, "a valarray can be used as a base both for vector and matrix operations.with good performance."  On the other hand, he notes that "you often need some inconvenient and time-consuming type conversions." This relates primarily to the need to create a separate valarray instance to use the valarray slice function.  For this reason, its use is avoided in the Matrix class here, although it may be used minimally going forward, in cases where the benefit outweighs the inconvenience.

    The current release has been unit tested, but not completely stress tested.  It should work as designed, but at this stage it should probably be best considered a beta release. Your comments are invited (email address given at the top of this document).

    Future Additions/Enhancements

    The following functionality is being considered for future
    releases (not necessarily an exhaustive list):
            Matrix Decompositions
            o    Cholesky Decomposition of square and symmetric matrices
            o    Singular Value Decomposition
            o    LU Decomposition
            Eigenvalue calculations
            Principal component analysis
            Solutions of linear systems
            Matrix inverse

    Licensing

    This software is released under the terms of the BSD License [6]:

    Copyright (c) 2009, Daniel Hanson
    All rights reserved.

    Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

    Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.  Neither the name of the copyright holder nor the names of any contributors may be used to endorse or promote products derived from this software without specific prior written permission.

    THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT  SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

    References

    [1]
    http://math.nist.gov/javanumerics/jama/
    [2] http://www.osl.iu.edu/research/mtl/
    [3] http://www.oonumerics.org/blitz/
    [4] Bjarne Stroustrup, The C++ Programming Language, 3E (Addison Wesley)
    [5] Nicolai Josuttis, The C++ Standard Library: A Tutorial and Reference  (Addison Wesley)
    [6]
    http://www.opensource.org/licenses/bsd-license.php

    A MACRO TO IMPUTE NUMERIC VARIABLES BASED UPON BAD RATES

    libname data 'E:\sas';

    data test;
      set data.credit;
      if ranuni(2) < 0.1 then x2 = .;
      if ranuni(3) < 0.3 then x3 = .;
      if ranuni(4) < 0.2 then x4 = .;
    run;

    proc means data = test n nmiss nonobs;
      class y;
      var x2 - x4;
    run;  

    /*                                       N
         Y    Variable    Label       N    Miss
         0    X2          X2       5217     591
              X3          X3       4011    1797
              X4          X4       4675    1133
         1    X2          X2        343      29
              X3          X3        268     104
              X4          X4        311      61
    */
        
    %macro impute(data = , y = , x = , outfile = );
    *********************************************;
    * THIS MACRO IS TO IMPUTE NUMERIC VARIABLES *;
    * BASED UPON BAD RATES                      *;
    * ----------------------------------------- *;
    * PARAMETERS:                               *;
    *  DATA   : INPUT DATA SET                  *;
    *  Y      : RESPONSE WITH 0 / 1             *;
    *  X      : A LIST OF NUMERIC PREDICTORS    *;
    *           WITH MISSING VALUES             *;    
    *  OUTFILE: FILEREF FOR IMPUTING RECODE     *;
    *********************************************;
        
    %if %sysfunc(fexist(&outfile)) %then %do;
      data _null_;
        rc = fdelete("&outfile");
        if rc = 0 then put "THE FILE HAS BEEN DELETE.";
      run;
    %end;

    ods output position = _pos1;
    ods listing close;
    proc contents data = &data varnum;
    run;
    ods listing;

    proc sql noprint;
      select
        upcase(variable) into :x2 separated by ' '
      from
        _pos1
      where
        compress(upcase(type), ' ') = 'NUM' and
        index("%upcase(%sysfunc(compbl(&x)))", compress(upcase(variable), ' ')) > 0;

      select
        count(variable) into :xcnt
      from
        _pos1
      where
        compress(upcase(type), ' ') = 'NUM' and
        index("%upcase(%sysfunc(compbl(&x)))", compress(upcase(variable), ' ')) > 0;    
    quit;

    data _tmp1;
      retain &x2 &y;
      set &data;
      where &y in (0, 1);
      keep &x2 &y;
    run;  

    ods output position = _pos2;
    ods listing close;
    proc contents data = _tmp1 varnum;
    run;
    ods listing;

    %let loop = 0;

    %do i = 1 %to &xcnt;
      proc sql noprint;
        select
          upcase(variable) into :var
        from
          _pos2
        where
          num = &i;

        select
          count(distinct &var) into :vcnt
        from
          _tmp1
        where
          &var ~= . and
          &y in (1, 0);

        select
          nmiss(&var) / count(*) * 100 into :mpct
        from
          _tmp1
        where
          &y in (1, 0);

        select
          count(distinct &y) into :yflg
        from
          _tmp1
        where
          &y in (1, 0);
      quit;
      
      %if &vcnt > 1 & &mpct > 0 & &mpct < 99 & &yflg = 2 %then %do;        
        %let loop = %eval(&loop + 1);
        
        proc sql noprint;
          select
            mean(&y) format = 10.6 into :mis_rate
          from
            _tmp1
          where
            &var = .     and
            &y in (0, 1);

          select
            count(*) into :mcnt
          from
            _tmp1
          where
            &var = .     and
            &y in (1, 0);

          create table
            _tmp2 as
          select distinct
            "%upcase(%trim(&var))"    as variable,
            &var                      as value,
            mean(&y)                  as bad_rate,
            &mis_rate                 as mis_rate,
            count(*)                  as freq,
            &mcnt                     as mis_cnt,
            abs(mean(&y) - &mis_rate) as dif
          from
            _tmp1
          where
            &var ~= .    and
            &y in (0, 1)
          group by
            &var;
        quit;

        proc sort data = _tmp2 sortsize = max;
          by dif descending freq;
        run;

        data _tmp3;
          set _tmp2;
          by dif descending freq;
          if _n_ = 1 then output;
        run;

        %if &loop = 1 %then %do;
          data _result;
            set _tmp3;
          run;
        %end;
        %else %then %do;
          data _result;
            set _result _tmp3;
          run;
        %end;    
      %end;        
    %end;    

    data _null_;
      set _result end = eof;
      file impute mod;
      if _n_ = 1 then do;
        put @1 93 * "*" ";";
        put @1 "* IMPUTED VARIABLE" @32 "| IMPUTED VALUE" @52 "| BAD RATE"
            @65 "| MIS. COUNT" @80 "| MIS. RATE" @93 "*;";
      end;
      put @1 "*" 91 * "-" @93 "*;";
      put @1 "*"  @3 variable @32 "|" @34 value @52 "|" @54 bad_rate percent8.2
          @65 "|" @67 mis_cnt @80 "|" @82 mis_rate percent8.2 @93 "*;";
      if eof then do;
        put @1 93 * "*" ";";
      end;  
    run;

    data _null_;
      set _result;
      file impute mod;
      if _n_ = 1 then do;
        put " ";
        put @10 3 * "*" " RECODING OF IMPUTATION " 3 * "*" ";";
      end;
      put @3  "if " @6 variable " = . " @40 "then " variable
          " = " value ";";
    run;

    %mend impute;

    filename impute 'E:\sas\impute.sas';

    %impute(data = test, y = y, x = x2 x3 x4, outfile = impute);

    /*
    $ more impute.sas
    **************************************************************;
    *IMPUTED VARIABLE|IMPUTED VALUE|BAD RATE|MIS. COUNT|MIS. RATE*;
    *------------------------------------------------------------*;
    * X2             | -0.68       |   4.55%| 620      |   4.68% *;
    *------------------------------------------------------------*;
    * X3             | -0.685      |   5.67%| 1901     |   5.47% *;
    *------------------------------------------------------------*;
    * X4             | 1.728       |   5.00%| 1194     |   5.11% *;
    **************************************************************;
     
             *** RECODING OF IMPUTATION ***;
      if X2  = .                           then X2  = -0.683 ;
      if X3  = .                           then X3  = -0.685 ;
      if X4  = .                           then X4  = 1.728 ;
    */

    *** SHOW HOW TO USE THE IMPUTING RECODE ***;
    data test2;
      set test;
      %inc impute / source2;
    run;

    proc means data = test2 n nmiss nonobs;
      class y;
      var x2 - x4;
    run;  

    /*                                       N
         Y    Variable    Label       N    Miss
         0    X2          X2       5808       0
              X3          X3       5808       0
              X4          X4       5808       0
         1    X2          X2        372       0
              X3          X3        372       0
              X4          X4        372       0
    */

    READ FIXED-WIDTH DATA FILE WITH read.fwf()

    ##############################################
    # READ FIXED-WIDTH DATA FILE WITH read.fwf() #
    # ------------------------------------------ #
    # EQUIVALENT SAS CODE:                       #
    # filename data 'E:\sas\fixed.txt';          #
    # data test;                                 #
    #   infile data truncover;                   #
    #   input @1 city $ 1 - 22 @23 population;   #
    # run;                                       #
    ##############################################

    # OPEN A CONNECTION TO THE DATA FILE
    data <- file(description = "e:\\sas\\fixed.txt", open = "r")

    # width = c(...)      ==> SPECIFIES COLUMN WIDTHS
    # col.names = c(...)  ==> GIVES COLUMN NAMES
    # colClasses = c(...) ==> DEFINES COLUMN CLASSES
    test <- read.fwf(data, header = FALSE, width = c(22, 10), 
                     col.names = c("city", "population"),
                     colClasses = c("character", "numeric"))

    close(data)

    READ DATA FROM A FILE CONNECTION WITH SCAN()

    #################################################
    # READ DATA FROM A FILE CONNECTION WITH SCAN()  #
    #-----------------------------------------------#
    # EQUIVALENT SAS CODE:                          #
    # filename data 'E:\sas\Kyphosis';              #
    # data test;                                    #
    #   infile data;                                #
    #   input x1 x2 x3 y @@;                        #
    # run;                                          #
    #################################################

    # OPEN A CONNECTION TO THE DATA FILE
    data <- file(description = "e:\\sas\\kyphosis", open = "r")

    # READ DATA FROM THE CONNECTION
    test <- data.frame(scan(data, what = list(x1 = 0, x2 = 0, x3 = 0, y = ""),
                            quiet = TRUE))

    # CLOSE THE CONNECTION
    close(data) 

    EVALUATION OF Logistic() FUNCTION IN RWeka

    ##############################################
    # EVALUATION OF Logistic() FUNCTION IN RWeka #
    ##############################################

    library(rpart)
    data(kyphosis)

    # MODEL FROM glm() FUNCTION
    model1 <- glm(Kyphosis ~., data = kyphosis, family = binomial())

    # PRINT OUT MODEL SUMMARY
    summary(model1)

    #               Estimate Std. Error z value Pr(>|z|)  
    #  (Intercept) -2.036934   1.449575  -1.405  0.15996  
    #  Age          0.010930   0.006446   1.696  0.08996 .
    #  Number       0.410601   0.224861   1.826  0.06785 .
    #  Start       -0.206510   0.067699  -3.050  0.00229 **

    # MODEL FROM Logistic() FUNCTION IN RWeka
    library(RWeka)
    model2 <- Logistic(Kyphosis ~., data = kyphosis)

    # PRINT OUT ESTIMATES
    print(model2)

    #                Class
    # Variable      absent
    # ====================
    # Age          -0.0109
    # Number       -0.4106
    # Start         0.2065
    # Intercept     2.0369

    # MODEL EVALUATION BASED ON 5-FOLD CROSS-VALIDATION
    summary(logit, newdata = kyphosis, class = TRUE, numFolds = 5, seed = 2009)

    # === 5 Fold Cross Validation ===
    # === Summary ===
    # Correctly Classified Instances          65               80.2469 %
    # Incorrectly Classified Instances        16               19.7531 %
    # Kappa statistic                          0.3481
    # Mean absolute error                      0.2503
    # Root mean squared error                  0.3624
    # Relative absolute error                 74.2609 %
    # Root relative squared error             88.868  %
    # Total Number of Instances               81    
    # === Detailed Accuracy By Class ===
    #                TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class
    #                  0.906     0.588      0.853     0.906     0.879      0.835    absent
    #                  0.412     0.094      0.538     0.412     0.467      0.835    present
    # Weighted Avg.    0.802     0.484      0.787     0.802     0.792      0.835
    # === Confusion Matrix ===
    #   a  b   <-- classified as
    #  58  6 |  a = absent
    #  10  7 |  b = present

    DO DATA INPUT IN R CONSOLE WITH SCAN()

    ############################################
    # DO DATA INPUT IN R CONSOLE WITH SCAN()   #
    #------------------------------------------#
    # COMPARABLE SAS CODE:                     #
    #------------------------------------------#
    # data test;                               #
    #   input x1 x2 x3 y $ @@;                 #
    # cards;                                   #
    # 71 . 3 0 158 14 3 0                      #
    # 128 5 4 1                                #
    # ;                                        #
    # run;                                     #
    ############################################

    test <- data.frame(scan(file = "", 
                            what = list(x1 = 0, x2 = 0, x3 = 0, y = "")))
    71 NA 3 0 158 14 3 0
    128 5 4 1

    A MACRO TO DO MONOTONIC WOE TRANSFORMATION FOR THE NUMERIC PREDICTOR

    libname data 'E:\sas';

    data test;
      set data.credit;
      keep y x2 x3 x8 x9;
    run;  

    %macro monowoe(data = , y = , x = , maxbin = );
    ******************************************************;
    * THIS MACRO IS TO DO A MONOTONIC WEIGHT OF EVIDENCE *;
    * (WOE) TRANSFORMATION FOR THE NUMERIC PREDICTOR.    *;
    *----------------------------------------------------*;
    * PARAMETERS:                                        *;
    *  DATA  : INPUT DATA                                *;
    *  Y     : GOOD/BAD RESPONSE (BAD ==> 1)             *;
    *  X     : A NUMERIC PREDICTOR                       *;
    *  MAXBIN: MAX. NUMBER OF BINS USED TO CALCULATE WOE *;
    ******************************************************;    
        
    %let minbad = 10;

    data _tmp1(keep = &y &x);
      set &data;
      where &y in (0, 1) and &x ~= .;
    run;  

    proc sql noprint;
      select min(&maxbin, count(distinct &x)) into :nbin from _tmp1;

      select count(*) into :cnt from _tmp1;

      select sum(&y) into :bad from _tmp1;
    quit;

    %do i = &nbin %to 2 %by -1;

      proc rank data = _tmp1 out = _tmp2 ties = low groups = &i;
        var &x;
        ranks rank;
      run;

      proc summary data = _tmp2 nway;
        class rank;
        output out = _tmp3(drop = _type_ rename = (_freq_ = freq))
        sum(&y)  = bad mean(&y) = bad_rate
        mean(&x) = x min(&x) = minx max(&x) = maxx;
      run;
      
      proc sql noprint;
        select min(bad) into :minbad2 from _tmp3;
        
        select min(bad_rate) into :minrate from _tmp3;

        select max(bad_rate) into :maxrate from _tmp3;    
      quit;

      %if &minbad2 >= &minbad & &minrate > 0 & &maxrate < 1 %then %do;
        ods output SpearmanCorr = _corr(rename = (x = cor));
        ods listing close;
        proc corr data = _tmp3 spearman;
          var x;
          with bad_rate;
        run;
        ods listing;

        proc sql noprint;
          select abs(cor) into :cor from _corr;
        quit;
        
        %if &cor = 1 %then %goto outside;
      %end;
      
    %end;    

    %outside:

    proc sort data = _tmp3;
      by rank;
    run;

    data _tmp4;
      retain bin minx maxx bad freq bad_rate
             gpct bpct woe iv;
      set _tmp3;  
      by rank;

      bin + 1;
      bpct = bad / &bad;
      gpct = (freq - bad) / (&cnt - &bad);
      woe  = log(gpct / bpct);
      iv   = (gpct - bpct) * log(gpct / bpct);
      keep bin minx maxx bad freq bad_rate
           gpct bpct woe iv;  
    run;

    options nocenter nodate ls = 100;
    title;
    proc report data = _tmp4 box spacing = 1 split = "*";
      column("MOTONOTIC WEIGHT OF EVIDENCE TRANSFORMATION FOR %upcase(&x)"
             bin minx maxx bad freq woe iv);
      define bin      / "BIN*LEVEL"    width = 5  format = z2. order order = data;
      define minx     / "LOWER*BOUND"  width = 10 format = 9.2;
      define maxx     / "UPPER*BOUND"  width = 10 format = 9.2;
      define bad      / "#BADS"        width = 10 format = 9.;
      define freq     / "#FREQ"        width = 10 format = 9.;
      define woe      / "WOE"          width = 10 format = 10.4;
      define iv       / "INFO.*VALUE"  width = 8  format = 8.4;
    run;

    %mend monowoe;

    options mprint mlogic nocenter;
    %monowoe(data = test, y = y, x = x3, maxbin = 20);

    /*
     +---------------------------------------------------------------------+
     |         MOTONOTIC WEIGHT OF EVIDENCE TRANSFORMATION FOR X3          |
     |  BIN      LOWER      UPPER                                     INFO.|
     |LEVEL      BOUND      BOUND      #BADS      #FREQ        WOE    VALUE|
     |---------------------------------------------------------------------|
     |   01|     -1.19|     -0.81|       134|      1576|   -0.3722|  0.0417|
     |---------------------------------------------------------------------|
     |   02|     -0.69|     -0.31|       111|      1693|   -0.0912|  0.0024|
     |---------------------------------------------------------------------|
     |   03|     -0.18|      0.32|        67|      1419|    0.2565|  0.0135|
     |---------------------------------------------------------------------|
     |   04|      0.45|      2.97|        60|      1492|    0.4244|  0.0362|
     +---------------------------------------------------------------------+
    */ 

    THE EQUIVALENT FUNCTION OF PROC RANK WITH WEIGHT VARIABLE

    ***********************************************;
    * A DEMO ON HOW TO DO THE EQUIVALENT FUNCTION *;
    * OF PROC RANK WITH WEIGHT VARIABLE           *;
    * ------------------------------------------- *;
    * SPECIAL THANKS TO MY FRIEND AND COWORKER    *;
    * Mr. Jonas Bilenas                           *;
    ***********************************************;

    data one;
      do i = 1 to 1000;
        x  = rannor(1);
        if x < 0 then wt = 1;
        else wt = 10;
        output;
      end;
    run;

    %macro rank(data = , var = , weight = , groups = , output = );

    data _tmp1(keep = &var &weight);
      set &data;
    run;

    proc univariate data = &data noprint;
      var &var;
      weight &weight;
      output out = _tmp2
      pctlpre = decile_
      pctlpts = 0 to 100 by %sysevalf(100 / &groups);
    run;

    data &output(keep = &var lower upper rank &weight);
      set _tmp1;

      if _n_ = 1 then do;
        set _tmp2;
      end;
      array d[*] decile_:;
      rank = 999;
      do i = 2 to dim(d);
        if &var <= d[i] and rank = 999 then do;
          lower  = round(d[i - 1], 0.0001);
          upper  = round(d[i], 0.0001);
          rank = i - 1;
        end;
      end;        
    run;

    %mend rank;

    %rank(data = one, var = x, weight = wt, groups = 4, output = two);

    proc means data = two nonobs n min max missing sumwgt;
      class rank;
      var x;
      weight wt;
    run;

    /* OUTPUT:

     rank       N         Minimum         Maximum        Sum Wgts
    -------------------------------------------------------------    
        1     591      -3.3158398       0.2201053         1365.00
        2     137       0.2276076       0.6209279         1370.00
        3     136       0.6211922       1.0996207         1360.00
        4     136       1.1021029       3.2444009         1360.00
    */

    A SAS MACRO TO CREATE FREQUENCY SUMMARY TABLE

    *************************************************;
    * A MACRO TO CREATE FREQUENCY SUMMARY TABLE FOR *;
    * ALL INPUT CATEGORICAL VARIABLES               *;
    *************************************************;

    libname data 'z:\XXX';

    %let cat = x10 - x11;

    data tmp1;
      set data.credit_train;
    run;

    %macro freq(data = , y = , cat_x = );

    data _tmp1;
      retain &cat_x &y;
      set &data;
      where &y in (0, 1);
      keep &cat_x &y;
    run;

    ods listing close;
    ods output position = _pos;
    proc contents data = _tmp1 varnum;
    run;
    ods listing;

    proc sql noprint;
      select count(*) - 1 into :nx from _pos;

      select variable into :xlist separated by ' ' from _pos where num <= &nx;

      select count(*) into :obs from _tmp1;
    run;

    %do i = 1 %to &nx;
      %let x2 = %scan(&xlist, &i, ' ');

      data _tmp2(keep = i var level &y);
        set _tmp1;
        i   = &i;
        var = upcase("&x2");
        level = input(&x2, $50.);
      run;
      
      proc summary data = _tmp2 nway missing;
        class i var level;
        output out = _sum1(drop = _type_ rename = (_freq_ = freq))
        sum(&y) = bads mean(&y) = bad_rate;
      run;

      %if &i = 1 %then %do;
        data _sum2;
          set _sum1;
        run;
      %end;
      %else %do;
        data _sum2;
          set _sum2 _sum1;
        run;
      %end;    
    %end;    

    data _final;
      set _sum2;
      if compress(level, ' ') = '' then level = '999. MISSING';
      pct = freq / &obs;
    run;  

    options nocenter nodate;
    title;
    proc report data = _final box spacing = 2 headskep split = "*";
      column("FREQUENCY SUMMARY OF CATEGORICAL VARIABLES* "
             var level freq pct bads bad_rate);  
      define var      / "ATTRIBUTE" order order = data width = 20 format = $20.;
      define level    / "LEVEL"     width = 20 format = $20.;
      define freq     / "FREQUENCY" width = 10 format = 10.;
      define pct      / "PERCENT"   width = 10 format = percent10.4;
      define bads     / "# BADS"    width = 10 format = 9.;
      define bad_rate / "BAD RATE"  width = 10 format = percent10.4;
    run;
      
    %mend freq;

    %freq(data = tmp1, y = y, cat_x = &cat);

    /*                                                                                                  
    -----------------------------------------------------------------------
    |          FREQUENCY SUMMARY OF CATEGORICAL VARIABLES                 |
    |                                                                     |
    |ATTRIBUTE     LEVEL     FREQUENCY     PERCENT      # BADS    BAD RATE|
    |                                                                     |
    |---------------------------------------------------------------------|
    |X10       | 1        |       1352|  28.8889% |        115|   8.5059% |  
    |          |----------+-----------+-----------+-----------+-----------|  
    |          | 2        |       3328|  71.1111% |        171|   5.1382% |  
    |----------+----------+-----------+-----------+-----------+-----------|  
    |X11       | 1        |        825|  17.6282% |        101|  12.2424% |  
    |          |----------+-----------+-----------+-----------+-----------|  
    |          | 2        |       3855|  82.3718% |        185|   4.7990% |  
    -----------------------------------------------------------------------  
    */                                                                                                

    LOGISTIC MODEL IN ENTERPRISE MINER WITH DMREG PROCEDURE

    *************************************************;
    * LOGISTIC MODEL IN ENTERPRISE MINER WITH       *;
    * DMREG PROCEDURE                               *;
    *************************************************;

    libname data 'z:\XXX';

    data tmp1;
      set data.credit_train;
    run;

    *** REGULAR WAY TO MODEL LOGISTIC REGRESSION ***;
    proc logistic data = tmp1 desc;
      class x10 x11 x13 x17 x18 X19 X22 / param = glm;
      model y = X2 X3 X8 X9 x10 x11 x13 x17 x18 X19 X22;
    run;  
    /*      Type 3 Analysis of Effects
                            Wald
    Effect      DF    Chi-Square    Pr > ChiSq
    X2           1        5.2833        0.0215
    X3           1       26.7384        <.0001
     ... ...
                   Analysis of Maximum Likelihood Estimates
                                      Standard          Wald
    Parameter       DF    Estimate       Error    Chi-Square    Pr > ChiSq
    Intercept        1     -2.9224      0.4685       38.9145        <.0001
    X2               1      0.2162      0.0941        5.2833        0.0215
    X3               1     -0.4701      0.0909       26.7384        <.0001
     ... ...
    */

    *** E-MINER WAY TO MODEL LOGISTIC REGRESSION ***;
    proc dmdb data = tmp1 out = tmp_db dmdbcat = tmp_cl;
      class x10 x11 x13 x17 x18 X19 X22 y(desc);
      var X2 X3 X8 X9;
      target y;
    run;

    proc dmreg data = tmp_db dmdbcat = tmp_cl;
      class x10 x11 x13 x17 x18 X19 X22 y;
      model y = X2 X3 X8 X9 x10 x11 x13 x17 x18 X19 X22 / coding = glm;
      code file = 'z:\XXX\score_code.sas';
    run;
    /*         Type 3 Analysis of Effects
                              Wald
    Effect        DF    Chi-Square    Pr > ChiSq
    X2             1        5.2832        0.0215
    X3             1       26.7383        <.0001
     ... ...
                             Analysis of Maximum Likelihood Estimates
                                      Standard          Wald              
    Parameter       DF    Estimate       Error    Chi-Square    Pr > ChiSq
    Intercept        1     -2.9224      0.4685         38.92        <.0001
    X2               1      0.2162      0.0941          5.28        0.0215
    X3               1     -0.4701      0.0909         26.74        <.0001
     ... ...
    */

    MODEL SEARCH BY BUMPING

    *************************************************;
    * REFERENCE:                                    *;
    *   model search and inference by bootstrap     *;
    *   -- Tibshirani and Knight (1997)             *;
    *************************************************;

    libname data 'z:\XXXX';

    %include 'z:\XXXX\ks.sas';
    %include 'z:\XXXX\roc.sas';

    data train;
      set data.credit_train;
    run;

    proc sql noprint;
      select
        count(*) into :size
      from
        train;    
    quit;

    %macro bumping;

    %let seed = 2009;

    %let n = 500;

    data _null_;
      do i = 1 to &n;
        random = put(ranuni(&seed) * (10 ** 8), 8.);
        name   = compress("random"||put(i, 3.), ' ');
        call symput(name, random);
      end;
    run;    

    proc datasets library = data nolist;
      delete catalog / memtype = catalog;
    run;
    quit;

    %do i = 1 %to &n;
        
      %put &&random&i;

      proc surveyselect data = train method = urs n = &size seed = &&random&i
        out = sample&i(rename = (NumberHits = _hits)) noprint;
      run;
      
      proc dmdb data = sample&i out = db_sample&i dmdbcat = cl_sample&i;
        class x10 - x24 y;
        var x2 - x9;
        target y;
        freq _hits;
      run;

      filename out_tree catalog "data.catalog.out_tree.source";
      
      proc split data = db_sample&i dmdbcat = cl_sample&i
        criterion    = gini
        assess       = impurity
        maxbranch    = 2
        splitsize    = 100
        subtree      = 30
        nsurrs       = 0;
        code file    = out_tree;
        input x2 - x9   / level = interval;
        input x10 - x24 / level = nominal;
        target y        / level = binary;
        freq _hits;
      run;  

      filename in_tree catalog "data.catalog.tree&i..source";

      data _null_;
        infile out_tree;
        input;
        file in_tree;
        if _n_ > 3 then put _infile_;
      run;

      data _tmp1(keep = p_y1 p_y0 y random);
        set data.credit_train;
        random = ranuni(1);
        %include in_tree;
      run;

      proc sort data = _tmp1 sortsize = max;
        by descending p_y1 random;
      run;

      data _tmp2;
        set _tmp1;
        where p_y1 ~= . and y in (0, 1);
        by descending p_y1 random;
        i + 1;
      run;

      proc rank data = _tmp2 out = _tmp3 groups = 10;
        var i;
      run;

      proc sql noprint;
        create table
          _tmp4 as
        select
          i + 1       as decile,
          count(*)    as cnt,
          sum(y)      as bad_cnt
        from
          _tmp3
        group by
          i;

        select sum(cnt) into :cnt from  _tmp4;

        select sum(bad_cnt) into :bad_cnt from _tmp4;    
      quit;
      
      data _tmp5;
        set _tmp4;
        retain cum_cnt cum_bcnt cum_gcnt;
        bpct     = bad_cnt / cnt;
        cum_cnt  + cnt;
        cum_bcnt + bad_cnt;
        cum_gcnt + (cnt - bad_cnt);
        cum_pct  = cum_cnt  / &cnt;
        cum_bpct = cum_bcnt / &bad_cnt;
        cum_gpct = cum_gcnt / (&cnt - &bad_cnt);
        ks       = (max(cum_bpct, cum_gpct) - min(cum_bpct, cum_gpct)) * 100;
      run;

      proc sql noprint;
        select max(ks) into :ks from _tmp5;
      quit;
      
      data _kstmp;
        tree = &i;
        ks   = &ks;
      run;

      %if &i = 1 %then %do;
        data _ks;
          set _kstmp;
        run;
      %end;    
      %else %do;
        data _ks;
          set _ks _kstmp;
        run;
      %end;        
      
    %end;

    proc sql noprint;
      select max(ks) into :ks from _ks;
      
      select tree into :best from _ks where round(ks, 0.01) = round(&ks., 0.01);
    quit;

    filename best catalog "data.catalog.tree%trim(&best).source";
    filename output "z:\XXXX\BestTree.txt";

    data _null_;
      infile best;
      input;
      file output;
      if _n_ = 1 then do;
        put " ******************************************************; ";
        put " ***** BEST TREE: TREE %trim(&best) WITH KS = &KS *****; ";
        put " ******************************************************; ";    
      end;
      put _infile_;
    run;

    %mend bumping;

    %bumping;

    data valid(keep = p_y1 p_y0 y);
      set data.credit_valid;
      %include output;
    run;

    %ks(data = valid, score = p_y1, y = y);
    %roc(data = valid, score = p_y0, y = y);

    HERE COMES THE SIMPSON RULE

    function [result] = simpson(f, a, b, N)
    %***************************************************************
    %* A FUNCTION OF NUMERICAL INTEGRATION WITH SIMPSON RULE       *
    %*=============================================================*
    %* PARAMETERS:                                                 *
    %*   f: A FUNCTION INPUT                                       *                  
    %*   a: LOWER BOUND OF THE FUNCTION INPUT                      *
    %*   b: UPPER BOUND OF THE FUNCTION INPUT                      *
    %*   N: # OF INTERVALS                                         *
    %*=============================================================*
    %* AN EXAMPLE:                                                 *
    %*   y = simpson('sin(x)', 0, 1, 1000);                        *
    %*=============================================================*
    %* AUTHOR: Wensui Liu (liuwensui@hotmail.com)                  *
    %***************************************************************

    dx = (b - a) / N;
    func = inline(f);
    result = 0;

    for i = 1:N
        min = a + (i - 1) * dx;
        med = a + (i - 1 / 2) * dx;
        max = a + i * dx;
        res = dx / 6 * (func(min) + 4 * func(med) + func(max));
        result = result + res;
    end
    return

    BAGGING CART (CLASSIFICATION AND REGRESSION TREE)

    libname data 'z:\XXX';

    data train;
      set data.credit_train;
    run;

    proc sql noprint;
      select
        count(*) into :size
      from
        train;    
    quit;

    %macro bagging;

    %let seed = 2009;

    %let n = 100;

    data _null_;
      do i = 1 to &n;
        random = put(ranuni(&seed) * (10 ** 8), 8.);
        name   = compress("random"||put(i, 3.), ' ');
        call symput(name, random);
      end;
    run;    

    %do i = 1 %to &n;
        
      %put &&random&i;

      proc surveyselect data = train method = urs n = &size seed = &&random&i
        out = sample&i(rename = (NumberHits = _hits)) noprint;
      run;
      
      proc dmdb data = sample&i out = db_sample&i dmdbcat = cl_sample&i;
        class x10 - x24 y;
        var x2 - x9;
        target y;
        freq _hits;
      run;

      filename out_tree catalog "data.catalog.out_tree.source";
      
      proc split data = db_sample&i dmdbcat = cl_sample&i
        criterion    = gini
        assess       = impurity
        maxbranch    = 2
        splitsize    = 100
        nsurrs       = 0;
        code file    = out_tree noleafid;
        input x2 - x9   / level = interval;
        input x10 - x24 / level = nominal;
        target y        / level = binary;
        freq _hits;
      run;  

      filename in_tree catalog "data.catalog.tree&i..source";

      data _null_;
        infile out_tree;
            input;
            file in_tree;
            if _n_ > 3 then put _infile_;
      run;

      data train_score&i(keep = id p_y1 p_y0 y);
        set data.credit_train;
        %include in_tree;
      run;

      data valid_score&i(keep = id p_y1 p_y0 y);
        set data.credit_valid;
        %include in_tree;
      run;

      %if &i = 1 %then %do;
        data train_score;
          set train_score&i;
        run;

        data valid_score;
          set valid_score&i;
        run;
      %end;    
      %else %do;
        data train_score;
          set train_score train_score&i;
        run;

        data valid_score;
          set valid_score valid_score&i;
        run;
      %end;    

    %end;

    proc summary data = train_score nway;
      class id;
      output out = train_predict(drop = _type_ rename = (_freq_ = freq))
      mean(p_y1) = p_1 mean(p_y0) = p_0 mean(y) = y;
    run;

    proc summary data = valid_score nway;
      class id;
      output out = valid_predict(drop = _type_ rename = (_freq_ = freq))
      mean(p_y1) = p_1 mean(p_y0) = p_0 mean(y) = y;
    run;

    %mend bagging;

    %bagging;

    data data.scored;
      set train_predict(in = a)
          valid_predict(in = b);
      
      if a then train = 1;
      else train = 0;
    run;

    proc freq data = data.scored;
      tables train * y / list missing;
    run;

    %include 'z:\XXXX\ks.sas';
    %include 'z:\XXXX\roc.sas';

    data valid train;
       set data.scored;
       if train = 1 then output train;
       else if train = 0  then output valid;
    run;  

    title 'KS statistics for trainning dataset';
    %ks(data = train, score = p_1, y = y);

    title 'KS statistics for validation dataset';
    %ks(data = valid, score = p_1, y = y);

    title 'ROC statistics for trainning dataset';
    %roc(data = train, score = p_0, y = y);

    title 'ROC statistics for validation dataset';
    %roc(data = valid, score = p_0, y = y);

    /* OUTPUT
    KS FOR TRAINING DATA
                                     CU.    CU.                  MIN      MAX
    DECILE  FREQ   BAD#      BAD%   FREQ   BAD#    CU. BAD%    SCORE    SCORE     %OF BAD       KS
       1     468    156   33.33%     468    156     33.33%      0.13     0.35     54.55%     47.44
       2     468     53   11.32%     936    209     22.33%      0.09     0.13     73.08%     56.53
       3     468     34    7.26%    1404    243     17.31%      0.07     0.09     84.97%     58.54
       4     468     26    5.56%    1872    269     14.37%      0.06     0.07     94.06%     57.57
       5     468      9    1.92%    2340    278     11.88%      0.05     0.06     97.20%     50.28
       6     468      7    1.50%    2808    285     10.15%      0.04     0.05     99.65%     42.23
       7     468      0    0.00%    3276    285      8.70%      0.03     0.04     99.65%     31.58
       8     468      0    0.00%    3744    285      7.61%      0.02     0.03     99.65%     20.93
       9     468      1    0.21%    4212    286      6.79%      0.02     0.02    100.00%     10.65
      10     468      0    0.00%    4680    286      6.11%      0.00     0.02    100.00%      0.00

    KS FOR VALIDATION DATA
                                    CU.    CU.                  MIN      MAX
    DECILE  FREQ   BAD#     BAD%   FREQ   BAD#    CU. BAD%     SCORE    SCORE     %OF BAD       KS
       1     150    31    20.67%     150    31      20.67%      0.12     0.36     36.05%     27.63
       2     150    13     8.67%     300    44      14.67%      0.09     0.12     51.16%     33.06
       3     150    12     8.00%     450    56      12.44%      0.07     0.09     65.12%     37.25
       4     150     5     3.33%     600    61      10.17%      0.06     0.07     70.93%     32.81
       5     150     7     4.67%     750    68       9.07%      0.05     0.06     79.07%     30.84
       6     150     4     2.67%     900    72       8.00%      0.04     0.05     83.72%     25.16
       7     150     2     1.33%    1050    74       7.05%      0.03     0.04     86.05%     17.02
       8     150     6     4.00%    1200    80       6.67%      0.03     0.03     93.02%     13.82
       9     150     5     3.33%    1350    85       6.30%      0.02     0.03     98.84%      9.37
      10     150     1     0.67%    1500    86       5.73%      0.00     0.02    100.00%      0.00

    AREA UNDER ROC FOR TRAINING DATA
    --------
    0.888135

    AREA UNDER ROC FOR VALIDATION DATA
    --------
    0.726259
    */

    ANOTHER FUNCTION FOR NUMERICAL INTEGRATION

    function result = trape(f, a, b, N)
    %***************************************************************
    %* A FUNCTION OF NUMERICAL INTEGRATION WITH TRAPEZOIDAL RULE   *
    %*=============================================================*
    %* PARAMETERS:                                                 *
    %*   f: A FUNCTION INPUT                                       * 
    %*   a: LOWER BOUND OF THE FUNCTION INPUT                      *
    %*   b: UPPER BOUND OF THE FUNCTION INPUT                      *
    %*   N: # OF INTERVALS                                         *
    %*=============================================================*
    %* AN EXAMPLE:                                                 *
    %*   y = trape('sin(x)', 0, 1, 1000);                          *
    %*=============================================================*
    %* AUTHOR: Wensui Liu (liuwensui@hotmail.com)                  *
    %***************************************************************

    dx = (b - a) / N;
    func = inline(f);
    result = 0;

    for i = 1:N
        res = dx * (func(a + (i - 1) * dx) + func(a + i * dx)) / 2;
        result = result + res;
    end
    return

    A PRACTICE IN MATLAB

    function [result] = mpoint(f, a, b, N)
    %***************************************************************
    %* A FUNCTION OF NUMERICAL INTEGRATION WITH MIDPOINT RULE      *
    %*=============================================================*
    %* PARAMETERS:                                                 *
    %*   f: A FUNCTION INPUT                                       *
    %*   a: LOWER BOUND OF THE FUNCTION INPUT                      *
    %*   b: UPPER BOUND OF THE FUNCTION INPUT                      *
    %*   N: # OF INTERVALS                                         *
    %*=============================================================*
    %* AN EXAMPLE:                                                 *
    %*   y = mpoint('sin(x)', 0, 1, 1000);                         *
    %*=============================================================*
    %* AUTHOR: Wensui Liu (liuwensui@hotmail.com)                  *
    %***************************************************************

    dx = (b - a) / N;
    func = inline(f);
    result = 0;

    for i = 1:N
        res = dx * (func(a + (i - 1/2) * dx));
        result = result + res;
    end
    return

    SAS MACRO TO CALCULATE AUC (AREA UNDER CURVE)

    %macro roc(data = , score =, y = );
    *******************************************;
    * THIS MACRO IS TO CALCULATE AUC (AREA    *;
    * UNDER CURVE) USING TRAPEZOIDAL RULE     *;
    *-----------------------------------------*;
    * INPUT PARAMETERS:                       *;
    *   DATA : INPUT DATASET                  *;
    *   SCORE: SCORE (HIGHER ==> BETTER)      *;
    *   Y    : 0/1 TARGET (1 ==> BAD)         *;
    *******************************************;
        
    data _tmp1;
      set &data;
      where &y in (0, 1) and &score ~= .;
      random = ranuni(1);
      keep &y &score random;
    run;

    proc sql noprint;
      select sum(&y) into :bcnt from _tmp1;
      select count(*) - sum(&y) into :gcnt from _tmp1;
    quit;

    proc sort data = _tmp1 sortsize = max;
      by &score random;
    run;  

    data _tmp2;
      set _tmp1;
      by &score random;
      retain gcum bcum;
      gcum + (1 - &y);
      bcum + &y;
      gpct = gcum / &gcnt;
      bpct = bcum / &bcnt;
    run;

    proc sort data = _tmp2 sortsize = max;
      by gpct bpct;
    run;

    data _tmp3;
      set _tmp2;
      by gpct bpct;
      if last.gpct then output;
    run;

    title 'AREA UNDER ROC';
    proc sql;
      create table
        _tmp4 as
      select
        a.gcum as gcum,
        (b.gpct - a.gpct) * (b.bpct + a.bpct) / 2 as dx
      from
        _tmp3 as a, _tmp3 as b
      where
        a.gcum + 1 = b.gcum;

      select sum(dx) from _tmp4;
    quit;  
    title;

    proc datasets library = work nolist;
      delete _tmp:/memtype = data;
    run;
    quit;
          
    %mend roc;    

    data test;
      do i = 1 to 10000;
        x = ranuni(1);
        if x + rannor(1) * 0.2 > 0.7 then y = 1;
        else y = 0;
        output;
      end;
    run;

    proc logistic data = test;
      model y = x;
      score data = test out = predicted;
    run;  

    %roc(data = predicted, score = p_0, y = y);

    /*
    -- STANDARD AUC RESULT FROM PROC LOGISTIC:

    Association of Predicted Probabilities and Observed Responses

    Percent Concordant        91.1    Somers' D    0.824
    Percent Discordant         8.8    Gamma        0.825
    Percent Tied               0.1    Tau-a        0.352
    Pairs                 21336604    c            0.912

    -- CALCULATED AUC RESULT FROM MY MACRO:

    AREA UNDER ROC          

    0.912021
    */

    SAS MACRO TO CALCULATE GAINS CHART WITH KS

    %macro ks(data = , score = , y = );

    options nocenter mprint nodate;

    data _tmp1;
      set &data;
      where &score ~= . and y in (1, 0);
      random = ranuni(1);
      keep &score &y random;
    run;

    proc sort data = _tmp1 sortsize = max;
      by descending &score random;
    run;

    data _tmp2;
      set _tmp1;
      by descending &score random;
      i + 1;
    run;

    proc rank data = _tmp2 out = _tmp3 groups = 10;
      var i;
    run;

    proc sql noprint;
    create table
      _tmp4 as
    select
      i + 1       as decile,
      count(*)    as cnt,
      sum(&y)     as bad_cnt,
      min(&score) as min_scr format = 8.2,
      max(&score) as max_scr format = 8.2
    from
      _tmp3
    group by
      i;

    select
      sum(cnt) into :cnt
    from
      _tmp4;

    select
      sum(bad_cnt) into :bad_cnt
    from
      _tmp4;    
    quit;

    data _tmp5;
      set _tmp4;
      retain cum_cnt cum_bcnt cum_gcnt;
      cum_cnt  + cnt;
      cum_bcnt + bad_cnt;
      cum_gcnt + (cnt - bad_cnt);
      cum_pct  = cum_cnt  / &cnt;
      cum_bpct = cum_bcnt / &bad_cnt;
      cum_gpct = cum_gcnt / (&cnt - &bad_cnt);
      ks       = (max(cum_bpct, cum_gpct) - min(cum_bpct, cum_gpct)) * 100;

      format cum_bpct percent9.2 cum_gpct percent9.2
             ks       6.2;
      
      label decile    = 'DECILE'
            cnt       = '#FREQ'
            bad_cnt   = '#BAD'
            min_scr   = 'MIN SCORE'
            max_scr   = 'MAX SCORE'
            cum_gpct  = 'CUM GOOD%'
            cum_bpct  = 'CUM BAD%'
            ks        = 'KS';
    run;

    title "%upcase(&score) KS";
    proc print data  = _tmp5 label noobs;
      var decile cnt bad_cnt min_scr max_scr cum_bpct cum_gpct ks;
    run;    
    title;

    proc datasets library = work nolist;
      delete _: / memtype = data;
    run;
    quit;

    %mend ks;    

    data test;
      do i = 1 to 1000;
        score = ranuni(1);
        if score * 2 + rannor(1) * 0.3 > 1.5 then y = 1;
        else y = 0;
        output;
      end;
    run;

    %ks(data = test, score = score, y = y);

    /*
    SCORE KS               
                                    MIN         MAX
    DECILE    #FREQ    #BAD       SCORE       SCORE     CUM BAD%    CUM GOOD%        KS

       1       100      87         0.91        1.00      34.25%        1.74%      32.51
       2       100      78         0.80        0.91      64.96%        4.69%      60.27
       3       100      49         0.69        0.80      84.25%       11.53%      72.72
       4       100      25         0.61        0.69      94.09%       21.58%      72.51
       5       100      11         0.51        0.60      98.43%       33.51%      64.91
       6       100       3         0.40        0.51      99.61%       46.51%      53.09
       7       100       1         0.32        0.40     100.00%       59.79%      40.21
       8       100       0         0.20        0.31     100.00%       73.19%      26.81
       9       100       0         0.11        0.19     100.00%       86.60%      13.40
      10       100       0         0.00        0.10     100.00%      100.00%       0.00
    */

    SAS MACRO TO ASSIGN RANDOM INTEGERS TO A GROUP OF MACRO VARIABLES

    %macro seeds(n = , prefix = , seed = 1);

    data _null_;
      do i = 1 to &n;
        random = put(ranuni(&seed) * (10 ** 10), 10.);
        name   = compress("prefix"||put(i, 4.), ' ');
        call symput(name, random);
      end;
    run;    

    %do i = 1 %to &n;
      %put SEED VALUE &i  : &prefix&i =  &&prefix&i.;
    %end;  

    %mend seeds;

    %seeds(n = 50, prefix = random);

    /*
    SEED VALUE 1  : random1 =  1849625698
    SEED VALUE 2  : random2 =  9700887157
     ... ...
    SEED VALUE 49  : random49 =  5113178829
    SEED VALUE 50  : random50 =  4332045635
    */

    AN EXAMPLE OF USING DMDB PROCEDURE

    /************************************************
    * AN EXAMPLE TO USE DMDB PROCEDURE IN SAS       *
    * ENTERPRISE MINER CREATING DATA MINING DB WITH *
    * DMDB PROCEDURE                                *
    ************************************************/

    data test;
      input X1 X2 X3 Y;
    datalines;
    1 -0.04926 -0.61599 0
    2 0.31526 -1.65430 1
    3 0.64517 -0.06878 0
    1 0.02592 -0.71203 1
    2 1.45284 -0.39064 1
    3 0.22533 -0.44507 0
    1 -1.24641 -0.73156 0
    1 0.88542 -1.63595 1
    2 -1.35613 -1.58209 0
    3 -1.22333 1.98124 1
    2 -0.36128 0.77962 0
    1 -0.01054 -0.76720 0
    2 1.43263 0.53820 0
    3 -0.17737 0.25381 0
    2 -1.64183 -0.34028 0
    1 -0.83537 -2.00245 0
    3 0.00284 -0.75016 0
    2 -0.28847 -0.77437 0
    3 -0.21167 -0.53833 0
    1 -1.23276 0.11452 1
    ;
    run;

    *** CREATE OF DATA-MINING DATABASE WITH DMDB PROCEDURE ***;
    proc dmdb batch
      data      = test      /* SPECIFY INPUT DATA                 */
      out       = out_data  /* SPECIFY OUTPUT DATA                */
      dmdbcat   = out_cat   /* SPECIFY OUTPUT CATALOG             */
      classout  = out_class /* OUTPUT SUMMARY OF CLASS VARIABLES  */
      varout    = out_var   /* OUTPUT SUMMARY OF NUMERIC VARIABLES*/;
      class x1(asc) y;
      var x2 x3;
      target y;
    run;

    title "SUMMARY OF CLASS VARIABLES";
    proc print data = out_class noobs;
    run;

    /*
    NAME    LEVEL    FREQUENCY    TYPE    CRAW    NRAW
     X1       1              7     N                1
     X1       2              7     N                2
     X1       3              6     N                3
     Y        0             14     N                0
     Y        1              6     N                1
    */

    title "SUMMARY OF CONTINUOUS VARIABLES";
    proc print data = out_var noobs;
    run;

    /*
    NAME        NMISS       MIN        MAX        MEAN        STD      SKEWNESS    KURTOSIS
     X2             0    -1.64183    1.45284    -0.18245    0.88684     0.17469    -0.49032
     X3             0    -2.00245    1.98124    -0.46709    0.92896     0.72040     1.38541
    */

    THE BEAUTY OF RECURSION IN F#

    #light

    // GAMMA FUNCTION: AN EXAMPLE USING RECURSION IN F#

    let rec gamma x =
      match x with
      |1 -> 1
      |x -> (x - 1) * gamma (x - 1)
      
    printfn "%i" (gamma 5)