Page 306 - Special Topic Session (STS) - Volume 1
P. 306

STS441 David B. et al.
                  matched. Figure 4 provides an example for a structural mismatch between the
                  datasets.  The  figure  shows  the  relationship  between  the  transactions
                  (normalized)  in  both  datasets  for  securities  that  were  issued  by  Canadian
                  issuers.  The  darkness  of  the  hexagons  is  proportional  to  the  number  of
                  observations.  We  find  that  a  large  fraction  of  the  differences  is  due  to
                  transactions only showing up in one dataset (data points on the horizontal
                  and  vertical  axes).  For  securities  of  Canadian  issuers,  this  absence  of
                  transactions is more prevalent in the SHS data. Because the structure of the
                  mismatch correlates with observable features of the data, namely the issuer
                  country  of  the  security,  there  is  a  chance  that  a  learning  algorithm  can
                  successfully isolate groups of transactions that can be matched accurately.

                        Figure 4:Transactions in MiFID and SHS data for Securities with Canadian Issuers





























                  Note: The figure provides an illustrative example of a structural mismatch between the datasets.
                  The Figure shows the relationship between the transactions (normalized) in both datasets for
                  securities that were issued by Canadian issuers.

                  3.  Methodology
                      We proceed in two steps. First, we use a two-tier approach to derive rules
                  for  matching  the  datasets  that  combines  supervised  learning  and
                  unsupervised association rule discovery. Second, we develop a set of heuristic
                  rules based on of the first-step results.
                      For the discovery of rules that allow us to integrate the datasets, we use
                  decision trees (supervised) and association rules (unsupervised). Our goal is to
                  find subsamples, for which an integration of the MiFID  data with the SHS  data
                                                                                        *
                                                                      *
                  does not result in a mismatch of transaction volumes. To train the algorithms,

                                                                     295 | I S I   W S C   2 0 1 9
   301   302   303   304   305   306   307   308   309   310   311