Page 307 - Special Topic Session (STS)

Page 307 - Special Topic Session (STS) - Volume 1

P. 307

STS441 David B. et al.
we generate a binary outcome variable for each transaction i that takes the
value one if there is a successful match and zero otherwise with:

1 Δ < ,

= { 0 ℎ,

∗
∗
where Δ = | − | and is the threshold for a match. By

altering , we can adjust the deviation that we accept for a transaction that
3
we consider a match.
To learn rules with a decision tree, we use as an outcome. We further

provide the tree with a number of features in the form of a dummy variable
for data points that do not appear in the SHS data and dummies for the
currency of the stock and the issuer country. To allow the tree to consider
information on the investors, we include an indicator variable for the cluster
to which a k-means (k=3) clustering algorithm assigns the investor on the
basis of her aggregated volume of transactions (bank cluster). On the basis of
these features, the tree successively splits the data into subsamples by
selecting the variable and cutoff rule that maximizes the homogeneity of the
subsamples in terms of the outcome. This way, the tree provides sample splits
that produce groups of transactions g for which an accurate match is possible

(∑ → 1) and groups of transactions that do not allow for an accurate

matching of transactions (∑ → 0).

Unsupervised learning of association rules complements the supervised
approach. The goal of association rule learning is to find groups of feature
values that are common (support) in the data and indicate a high propensity
for a successful matching of the datasets (confidence). Because the algorithm
4
provides us with a broad set of association rules and does not specifically
isolate rules that allow for the integration of the datasets, we have to proceed
in two steps. In the first step, we mine association rules. In the second step, we
filter rules that associate feature values with the indicator variable for a
successful matching of the data ().
To derive the final heuristic ruleset, we combine the results from the
decision tree with the association rules. We then exclude splits (rules) that
intuitively do not make sense. This way, we ensure that the heuristic ruleset
has a foundation in the experience of domain experts.

3 We set the threshold to EUR 10,000.
4 Because we filter the resulting set of rules for association rules that have the matching
indicator y as a consequence, the share of matches in the subsample that satisfies the rule
equals the confidence of the rule.
296 | I S I W S C 2 0 1 9

302 303 304 305 306 307 308 309 310 311 312