Page 307 - Special Topic Session (STS) - Volume 1
P. 307

STS441 David B. et al.
            we generate a binary outcome variable for each transaction i that takes the
            value one if there is a successful match and zero otherwise with:

                                           1         Δ < ,
                                                     
                                    = {     0       ℎ,
                                     

                                         ∗
                                 ∗
            where   Δ = | −  |  and      is  the  threshold  for  a  match.  By
                      
                                 
                                        
            altering  , we can adjust the deviation that we accept for a transaction that
                                 3
            we consider a match.
                To learn rules with a decision tree, we use   as an outcome. We further
                                                            
            provide the tree with a number of features in the form of a dummy variable
            for  data  points  that  do  not  appear  in  the  SHS  data  and  dummies  for  the
            currency of the stock and the issuer country. To allow the tree to consider
            information on the investors, we include an indicator variable for the cluster
            to which a  k-means (k=3) clustering algorithm assigns the investor  on the
            basis of her aggregated volume of transactions (bank cluster). On the basis of
            these  features,  the  tree  successively  splits  the  data  into  subsamples  by
            selecting the variable and cutoff rule that maximizes the homogeneity of the
            subsamples in terms of the outcome. This way, the tree provides sample splits
            that produce groups of transactions g for which an accurate match is possible
                  
            (∑       → 1) and  groups  of  transactions  that  do  not  allow  for  an  accurate
               
                                           
            matching of transactions (∑       → 0).
                                       
                Unsupervised learning of association rules complements the supervised
            approach. The goal of association rule learning is to find groups of feature
            values that are common (support) in the data and indicate a high propensity
            for a successful matching of the datasets (confidence).  Because the algorithm
                                                                4
            provides us with a  broad set of association rules and does not specifically
            isolate rules that allow for the integration of the datasets, we have to proceed
            in two steps. In the first step, we mine association rules. In the second step, we
            filter  rules  that  associate  feature  values  with  the  indicator  variable  for  a
            successful matching of the data ().
                To  derive  the  final  heuristic  ruleset,  we  combine  the  results  from  the
            decision tree with the association rules. We then exclude splits (rules) that
            intuitively do not make sense. This way, we ensure that the heuristic ruleset
            has a foundation in the experience of domain experts.




            3  We set the threshold to EUR 10,000.
            4  Because  we  filter  the  resulting  set  of  rules  for  association  rules  that  have  the  matching
            indicator y as a consequence, the share of matches in the  subsample that satisfies the rule
            equals the confidence of the rule.
                                                               296 | I S I   W S C   2 0 1 9
   302   303   304   305   306   307   308   309   310   311   312