Macaluso21a

Partial capture of text on file.
                  Cluster Trellis: Data Structures & Algorithms for Exact Inference
                                                    in Hierarchical Clustering
                                       ∗,1,2                         ∗,3                      1             1                      1
               Craig S. Greenberg          , Sebastian Macaluso        , Nicholas Monath , Ji Ah Lee , Patrick Flaherty ,
                                                        3                         1                         1
                                       Kyle Cranmer , Andrew McGregor , Andrew McCallum
                                                      1 University of Massachusetts Amherst
                                                2National Institute of Standards and Technology
                                                               3New York University
                                      Abstract
                     Hierarchical clustering is a fundamental task
                     often used to discover meaningful structures
                     in data. Due to the combinatorial number of             Figure 1: Schematic representation of a hierarchical
                     possible hierarchical clusterings, approximate          clustering. H denotes the hierarchical clustering and X
                     algorithms are typically used for inference. In         the dataset.
                     contrast to existing methods, we present novel          a natural data representation of data generated by a
                     dynamic-programming algorithms for exact                Markov tree, i.e., a tree-shaped model where the state
                     inference in hierarchical clustering based on           variables are dependent only on their parent or children.
                     a novel trellis data structure, and we prove            Wedeﬁne a hierarchical clustering as a recursive split-
                     that we can exactly compute the partition                                                               N
                                                                             ting of a dataset of N elements, X = {xi}           into
                     function, maximum likelihood hierarchy, and                                                             i=1
                     marginal probabilities of sub-hierarchies and           subsets until reaching singletons. This can equivalently
                     clusters. Our algorithms scale in time and              be viewed as starting with the set of singletons and
                     space proportional to the powerset of N el-             repeatedly taking the union of sets until reaching the
                     ements, which is super-exponentially more               entire dataset. We show a schematic representation in
                     eﬃcient than explicitly considering each of             Figure 1, where we identify each xi with a leaf of the
                     the (2N −3)!! possible hierarchies. Also, for           tree and the hierarchical clustering as H. Formally,
                                                                                                                          1
                     larger datasets where our exact algorithms be-          Deﬁnition 1. (Hierarchical Clustering ) Given a
                     come infeasible, we introduce an approximate            dataset of elements, X = {x }N , a hierarchical clus-
                                                                                                         i i=1
                     algorithm based on a sparse trellis that out-           tering, H, is a set of nested subsets of X, s.t. X ∈ H,
                     performs greedy and beam search baselines.              {{x }}N    ⊂ H, and ∀X ,X ∈ H, either X ⊂ X ,
                                                                                i   i=1      T         i   j                 i      j
                                                                             X ⊂X,orX X =∅.Further,∀X ∈H,if∃X ∈H
                                                                               j     i      i    j                Si            j
                                                                             s.t. X ⊂ X , then ∃X ∈ H s.t. X        X =X.
                                                                                   j     i          k           j     k     i
                1    Introduction
                                                                             Given a subset X ∈ H, then X is referred to as a
                                                                                                L              L             S
                                                                             cluster in H. When X ,X ,X ∈ H and X              X =
                Hierarchical clustering is often used to discover mean-                             P   L    R             L     R
                                                                             X ,werefer to X and X as children of X , and X
                ingful structures, such as phylogenetic trees of organ-        P               L        R                 P         P
                                                                             the parent of X   and X ; if X ⊂ X we refer to X
                isms (Kraskov et al., 2005), taxonomies of concepts                          L        R      L     P                P
                                                                             as an ancestor of X   and X a descendent of X .(We
               (Cimiano and Staab, 2005), subtypes of cancer (Sørlie                             L        L                    P
                                                                             also denote the sibling of X , as X       = X \X .)
                et al., 2001), and jets in particle physics (Cacciari et al.,                               L       R       P     L
                2008). Among the reasons that hierarchical clustering        For binary trees, the total number of possible pairs
                                                                             of siblings (X ,X ) for a parent with N elements
                has been found to be broadly useful is that it forms                        L   R
                                                                             is given by the Stirling number of the second kind
                    *The ﬁrst two authors contributed equally.                           N−1
                                                                             S(N,2) = 2       −1.
                                     th
                Proceedings of the 24  International Conference on Artiﬁ-       1Welimitourexpositiontobinaryhierarchicalclustering.
                cial Intelligence and Statistics (AISTATS) 2021, San Diego,  Binary structures encode more tree-consistent clusterings
                California, USA. PMLR: Volume 130. Copyright 2021 by         than k-ary (Blundell et al., 2010). Natural extensions may
                the author(s).                                               exist for k-ary clustering, which are left for future work.
                       Cluster Trellis: Data Structures & Algorithms for Exact Inference in Hierarchical Clustering
                In our work, we consider an energy-based probabilistic          the number of hierarchies grows extremely rapidly,
                model for hierarchical clustering. We provide a general         namely (2N −3)!! (see (Callan, 2009; Dale and Moon,
                (and ﬂexible) deﬁnition of the probabilistic model and          1993) for more details and proof), where !! is double fac-
                then give three speciﬁc examples of the distribution            torial. To overcome the computational burden, in this
                in section 4. Our model is based on measuring the               paper we introduce a cluster trellis data structure for
                compatibility of all pairs of sibling nodes in a binary         hierarchical clustering. The cluster trellis, inspired by
                tree structure. Formally,                                       (Greenberg et al., 2018), enables us to use dynamic pro-
                Deﬁnition 2. (Energy-based Hierarchical Clus-                   gramming algorithms to exactly compute MAP struc-
                tering) Let X be a dataset, H be a hierarchical cluster-        tures and the partition function, as well as compute
                                   X     X       +                              marginal distributions, including the probability of any
                ing of X, let ψ : 2   ×2 →R beapotentialfunction                sub-hierarchy or cluster. We further show how to sample
                describing the compatibility of a pair of sibling nodes
                in H, and let φ(X|H) be a potential function for the H          exactly from the posterior distribution over hierarchical
                structure. Then, the probability of H for the dataset           clusterings (i.e., the probability of sampling a given
                X, P(H|X), is equal to the unnormalized potential of            hierarchy is equal to the probability of that hierarchy).
                H normalized by the partition function, Z(X):                   Our algorithms compute these quantities without hav-
                                                                                ing to iterate over each possible hierarchy in the O(3N)
                 P(H|X) = φ(X|H) with φ(X|H) =          Y ψ(XL,XR)              time, which is super-exponentially more eﬃcient than
                             Z(X)
                                                   X ,X ∈sibs(H)                explicitly considering each of the (2N − 3)!! possible
                                                     L R
                                                                         (1)    hierarchies (see Corollary 2 for more details). Thus,
                 where sibs(H) = {(XL,XR)|XL ∈ H,XR ∈ H,XL ∩XR =                while still exponential, this is feasible in regimes where
                ∅,XL∪XR∈H}.Thepartition function Z(X) is given by:              enumerating all possible trees would be infeasible, and
                                  Z(X)= X φ(X|H).                        (2)    is to our knowledge the fastest exact MAP/partition
                                           H∈H(X)                               function result(See §A.5 and §A.7 for proofs), making
                where H(X) represents all binary hierarchical clusterings       practical exact inference for datasets on the order of
                                                                                                       9                       22
                of the elements X.                                              20 points (∼ 3 × 10 operations vs ∼ 10            trees) or
                                                                                fewer. For larger datasets, we introduce an approxi-
                Werefer to our model as an energy-based model given             mate algorithm based on a sparse hierarchical cluster
                that ψ(·,·) is often deﬁned by the unnormalized Gibbs           trellis and we outline diﬀerent strategies for building
                distribution, i.e., ψ(X ,X ) = exp(−βE(X ,X )),                 this sparse trellis. We demonstrate our methods’ capa-
                                          L   R                    L    R
                where β is the inverse temperature and E(·,·) is the            bilities for exact inference in discovering cascades of
                energy. This probabilistic model allows us to express           particle decays in jet physics and subtype hierarchies in
                many familiar distributions over tree structures. It also       cancer genomics, two applications where there is a need
                has a connection to the classic algorithmic hierarchical        for exact inference on datasets made feasible by our
                clustering technique, agglomerative clustering, in that         methods. We ﬁnd that greedy and beam search meth-
                ψ(·,·) has the same signature as a “linkage function”           ods frequently return estimates that are sub-optimal
                (i.e., single, average, complete linkage). We note that in      compared to the exact MAP clustering.
                this work we do not use informative prior distributions
                over trees P(H) and instead assume a uniform prior.             Contributions of this Paper. We achieve exact,
                                                                                not approximate, solutions to the following:
                Often, probabilistic approaches, such as coalescent mod-           • Compute the Partition Function Z(X) (§2.2).
                els (Teh et al., 2008; Boyles and Welling, 2012; Hu
                et al., 2013) and diﬀusion trees (Neal, 2003; Knowles              • MAP Inference, i.e. ﬁnd the maximum likeli-
                and Ghahramani, 2011), model which tree structures                   hood tree structure argmax         P(H|X) (§2.3).
                are likely for a given dataset. For instance, in particle                                           H∈H
                physics generative models of trees are used to model               • Sample Hierarchies from the Posterior Dis-
                jets (Cacciari et al., 2008), and similarly coalescent               tribution, i.e. weighted by their probability,
                models have been used in phylogenetics (Suchard et al.,              P(H|X) (§2.5).
                2018). Inference in these approaches is done by ap-
                proximate, rather than exact, methods that lead to              2    Hierarchical Cluster Trellis
                local optima, such as greedy best-ﬁrst, beam-search, se-        Exactly performing MAP inference and ﬁnding the par-
                quential Monte Carlo (Wang et al., 2015), and MCMC              tition function by enumerating all hierarchical cluster-
                (Neal, 2003). Also, these methods do not have eﬃcient           ings over N elements is intractable since the number of
                ways to compute an exact normalized distribution over           hierarchies grows extremely rapidly, namely (2N −3)!!
                all tree structures.                                            (see (Callan, 2009; Dale and Moon, 1993) for more de-
                Exactly performing MAP inference and ﬁnding the par-            tails and proof), where !! is double factorial. To address
                tition function by enumerating all hierarchical cluster-        this challenge, we introduce a cluster trellis data struc-
                ings over N elements is exceptionally diﬃcult because           ture for hierarchical clustering. We describe how this
                  C. Greenberg, S. Macaluso, N. Monath, J. Lee, P. Flaherty, K. Cranmer, A. McGregor, A. McCallum
                 data structure enables us to use dynamic programming            for X and its complement X \X , thus enabling the
                                                                                       i                               i
                 algorithms to exactly compute the partition function,           application of Proposition 1 to get Z(X). For a full trel-
                 MAPhierarchical clusterings, and marginals, as well             lis, the algorithm can straightforwardly be written in a
                 as how to sample from the exact distribution over hier-         bottom-up, non-recursive way. In this case, the parti-
                 archical clusterings.                                           tion function for every node in the trellis is computed in
                                                                                 order (in a bottom-up approach), from the nodes with
                 2.1   Trellis Data Structure                                    the smallest number of elements to the nodes with the
                 The trellis data structure is a directed acyclic graph          largest number of elements, memoizing the partition
                 that encodes a set of hierarchical clusterings. Each ver-       function value at each node. By computing the partial
                 tex in the trellis corresponds to a node in a hierarchical      partition functions in this order, whenever computing
                 clustering, and edges between vertices in the trellis cor-      the partition function of a given node in the trellis, the
                 respond to a parent/child relationship in a hierarchical        corresponding ones of all of the descendent nodes will
                 clustering. As in a hierarchical clustering, the trellis has    have already been computed and memoized. In Figure
                 a root node, that corresponds to the entire dataset, and        2, we show a visualization comparing the computation
                 leaf nodes that correspond to the individual elements of        of the partition function with the trellis to the brute
                 the dataset. The dataset associated with a trellis vertex       force method for a dataset of four elements. Next, we
                 V is denoted X(V) and the trellis vertex associated             present the complexity result for Algorithm 1:
                 with a dataset X is denoted V(X). Each vertex in the            Algorithm 1 PartitionFunction(X)
                 trellis stores memoized values of Z(V) for computing
                 the partition function, as well as the value φ(H∗[V])              Pick xi ∈ X and set Z(X) ← 0
                 and the backpointer Ξ(H∗[V]) for computing the MAP                 for X in C(X)        do
                                                                                           i          x
                                                                                                       i
                 tree. We denote C(X) as the children of V(X). We                     if Z(X ) not set then
                                                                                              i
                 refer to a full trellis as the data structure where every            Z(X ) ←PartitionFunction(X )
                                                                                           i                               i
                 possible hierarchical clustering given a dataset X can               if Z(X \X ) not set then
                                                                                                   i
                 be realised, i.e., there is a bijection between the set              Z(X\X)←PartitionFunction(X\X )
                                                                                                i                                    i
                                                                                      Z(X)←Z(X)+ψ(X,X\X)·Z(X)·Z(X\X)
                 of trellis vertices and P(X)\∅, where P indicates the                                       i       i       i            i
                 power set, and there is an edge between Vi and Vj if               return Z(X)
                 X(Vi) ⊂ X(Vj). In contrast, a sparse trellis will only
                 contain a subset of all possible hierarchies by omitting
                 some of the vertices and edges in a full trellis.               Theorem 1. For a given dataset X of N elements,
                                                                                 Algorithm 1 computes Z(X) in O(3N) time.
                 2.2   Computing the Partition Function
                                                                                 The time-complexity of the algorithm is O(3N), which
                                                          N
                 Given a dataset of elements, X = {xi}        , the partition
                                                          i=1                    is is signiﬁcantly smaller than the (2N − 3)!! possible
                 function, Z(X), for the set of hierarchical clusterings         hierarchies.
                 over X, H(X), is given by Equation 2. The trellis im-           Corollary 2. For a given dataset X of N elements,
                 plements a memoized dynamic program to compute
                 the partition function and the MAP. To achieve this,            Algorithm 1 is super-exponentially more eﬃcient than
                 we need to re-write the partition function in the corre-        brute force methods that consider every possible hier-
                 sponding recursive way. In particular,                          archy. In particular the ratio is O((2)N Γ(N −1/2)).
                                                                                                                         3
                 Proposition 1. For any x ∈ X, the hierar-                       The proofs of Algorithm 1 and Corollary 2 are given
                 chical   partition    function    can    be   written    re-    in § A.7 of the Appendix.
                 cursively,   as   Z(X)       =     P          φ(X|H)      =
                 P                                     H∈H(X)                    2.3    Computing the MAP Hierarchical
                             ψ(X ,X \X )·Z(X )·Z(X \X )               where
                   Xi∈C(X)x       i        i        i            i
                 C(X) is the set of all children of X containing the                    Clustering
                       x
                 element x, i.e,. C(X) = {X : X ∈ C(X)∧x ∈ X }.                  Similar to other dynamic programming algorithms,
                                        x        j    j                   j
                 In the particular case of a full trellis, then                  such as Viterbi, we can adapt Algorithm 1 in order to
                                          X                                      ﬁnd the MAP hierarchical clustering.
                 C(X) ={X :X ∈2 \X∧x∈X }.
                       x       j    j                     j
                                                                                 The MAP clustering for dataset X, is H⋆(X) =
                 The proof is given in § A.1 in the Appendix. Algorithm          argmax           φ(H). Here we can also use a recursive
                 1 describes in a recursive way how to eﬃciently com-                     H∈H(X)
                 pute the partition function using the trellis based on          memoized technique, where each node will store a value
                                                                                 for the MAP, denoted by φ(H⋆(X)) and a backpointer
                 Proposition 1. We ﬁrst set the partition function of            Ξ(H⋆(X)). Speciﬁcally,
                 the leaf nodes in the trellis to 1. Then, we start by
                 selecting any element in the dataset, x , and consider          Proposition 2. For any x ∈ C(X), let C(X)                 =
                                                            i                                                                            x
                 all clusters X ∈ C(X) such that x ∈ X . Next, the               {X : X ∈ C(X) ∧ x ∈ X }, then φ(H⋆(X)) =
                                i                        i     i                     j      j                       j
                 partition function is computed (memoized, recursively)          max            ψ(X ,X \X )·φ(H⋆(X ))·φ(H⋆(X\X )).
                                                                                      X∈C(X)         i        i           i               i
                                                                                        i      x
                                  Cluster Trellis: Data Structures & Algorithms for Exact Inference in Hierarchical Clustering
                           Exhaustive Computation of the                                                                                                    N
                           Partition Function - O((2N-3)!!)                                          Computation using Trellis  - O(3 ) << O((2N-3)!!)
                           Z({a,b,c,d})=!({a,b,c},{d})á!({a,b},{c})á!({a},{b})    Z({a,b,c,d})=!({a,b,c},{d})áZ({a,b,c})áZ({d})+!({a,b,d},{c})áZ({a,b,d})áZ({c}) Recursive Computation 
                                       +!({a,b,c},{d})á!({a,c},{b})á!({a},{c})                 +!({a,c,d},{b})áZ({a,c,d})áZ({b})+!({b,c,d},{a})áZ({b,c,d})áZ({a}) of Z using Memoization
                                       +!({a,b,c},{d})á!({b,c},{a})á!({b},{c})                 +!({a,b},{c,d})áZ({a,b})áZ({c,d})+!({a,c},{b,d})áZ({a,c})áZ({b,d})
                                       +!({a,c,d},{b})á!({a,c},{d})á!({a},{c})                 +!({a,d},{b,c})áZ({a,d})áZ({b,c})
                                       +!({a,c,d},{b})á!({a,d},{c})á!({a},{d})        Tree Potential is                                                                        {a,b,c,d}
                                       +!({a,c,d},{b})á!({c,d},{a})á!({c},{d})                                        Partition Function is 
                                       +!({a,b,d},{c})á!({a,b},{d})á!({a},{b})            Product of                 Sum of Tree Potentials
                                       +!({a,b,d},{c})á!({a,d},{b})á!({a},{d})       Sibling Potentials                                                           {a,b,c}  {a,b,d} {a,c,d} {b,c,d}
                                       +!({a,b,d},{c})á!({b,d},{a})á!({b},{d})                                     ({a,b,c} ) + !({a,b,c} ) + !({a,b,c})
                                                                                               {a,b,c}          !
                                       +!({b,c,d},{a})á!({b,c},{d})á!({b},{c})
                                       +!({b,c,d},{a})á!({b,d},{c})á!({b},{d})          !({b,c} , {a} )          {a,b}  {c}   {a,c}  {b}   {b,c} {a}       {a,b}    {a,c}    {a,d}   {b,c}    {b,d}   {c,d}
                                       +!({b,c,d},{a})á!({c,d},{d})á!({c},{d})                                    Z({a,b,c})=!({a,b},{c})áZ({a,b})áZ({c})
                                       +!({a,b},{c,d})á!({a},{b})á!({c},{d})            ( {b}   {c} )                      +!({a,c},{b})áZ({a,c})áZ({b})
                                       +!({a,c},{b,d})á!({a},{c})á!({b},{d})         !        ,                            +!({b,c},{a})áZ({b,c})áZ({a})            {a}      {b}      {c}     {d}
                                       +!({a,d},{b,c})á!({a},{d})á!({b},{c})
                        Figure 2: Computing the partition function for the dataset {a,b,c,d}. Left: exhaustive computation, consisting of
                        the summation of (2·4−3)!! = 15 energy equations. Right: computation using the trellis. The sum for the partition function
                        is over 24−1 − 1 = 7 equations, each making use of a memoized Z value. Colors indicate corresponding computations over
                        siblings in the trellis.
                        Algorithm 2 MAP(X)                                                                             X is contained in H. In this case, we marginalize over
                                                                                                                          i
                            if φ(X) set then                                                                           every possible sub-hierarchy that contains the cluster
                                                                                                                       X while keeping the rest of the hierarchy H ﬁxed. The
                                return φ(X),Ξ(X)                                                                          i
                            Pick xi ∈ X                                                                                value of P(Hi|X) can be computed using the same al-
                            φ(X)←−∞                                                                                    gorithm used for the partition function, except that we
                                                                                                                       ﬁrst merge H into a single leaf node and use φ(H (X ))
                            Ξ(X)←null {Backpointer to give MAP tree struc-                                                                 i                                                         i     i
                            ture.}                                                                                     for the energy of the newly merged leaf. The same
                                                                                                                       is true for computing the value of P(X |X), except
                            for X in C(X)                 do                                                                                                                              i
                                      i              x
                                                       i                                                               that after merging X into a single leaf node, the value
                                t ← ψ(X ,X \X )·φ(V(X ))·φ(V(X \X ))                                                                                     i
                                              i            i               i                        i                  Z(X ) should be used. See Appendix § A.4 for proofs.
                                if φ(X) < t then                                                                              i
                                   φ(X)←t
                                   Ξ(X)←{X,X\X}∪Ξ(X)∪Ξ(X\X)                                                            2.5       Sampling from the Posterior Distribution
                                                       i            i             i                    i
                            return φ(X),Ξ(X)                                                                           Drawing samples from the true posterior distribution
                                                                                                                       P(H|X) is also diﬃcult because of the extremely large
                                                                                                                       numberoftrees.Inthissection,weintroduceasampling
                        See §A.6 in the Appendix for the proof. As in the                                              procedure for hierarchical clusterings Hi implemented
                        partition function algorithm described in Section 2.2,                                         using the trellis which gives samples from the exact true
                        the time complexity for ﬁnding the MAP clustering is                                           posterior without enumerating all possible hierarchies.
                                      N
                        also O(3 ). The main diﬀerence is that to compute the                                          The sampling procedure will build a tree structure in
                        maximallikelihood hierarchical clustering, the maximal
                        energy of the sub-hierarchy rooted at each node is                                             a top-down way. We start with the cluster of all the
                        computed, instead of the partition function. Pointers                                          elements, X, then sample one child of that cluster,
                                                                                                                       X ⊂X,(Eq.3)and set the other one to be the com-
                        to the children of the maximal sub-hierarchy rooted                                               L
                                                                                                                       plement of X , i.e., X\X . This is repeated recursively
                        at each node are stored at that node. A proof of the                                                                L                  L
                        time complexity, analogous to the one for the partition                                        from each of the children and terminates when a cluster
                                                                                                                       contains a single element. A child X                              of parent X ,
                        function, can be found in §A.5 of the Appendix.                                                                                                              L                       p
                                                                                                                       i.e., X      ⊂X issampled according to:
                                                                                                                                 L         p
                        2.4       Computing Marginals                                                                                          1
                                                                                                                       p(X |X ) =                     · ψ(X ,X \X ) ·Z(X )·Z(X \X ).
                        In this section, we describe how to compute two types                                                L     p       Z(X )               L     p     L             L           p      L
                                                                                                                                                  p
                        of marginal probabilities. The ﬁrst is for a given sub-                                                                                                                           (3)
                        hierarchy rooted at X , i.e., H ∈ H(X ), deﬁned as                                              Pseudocode for this algorithm is given in Algorithm 3.
                                           P                  i            i              i
                        P(Hi|X) =              H∈A(Hi)P(H|X), where A(Hi) = {H : H ∈                                   Theorem 3. Sample(X) (Alg. 3) gives samples from
                        H(X)∧Hi ⊂ H}, and Hi ⊂ H indicates that Hi is a                                                P(H|X).
                        subtree of H. Thus, we marginalize over every possi-
                        ble hierarchy while keeping ﬁxed the sub-hierarchy                                             The proof is given in Appendix § A.2. This algorithm
                        H . The second is for a given cluster, X , deﬁned as                                           is notable in that it does not require computing a cate-
                          i                 P                                              i
                        P(X |X)=                            P(H|X), where A(X ) = {H : H ∈                             gorical distribution over all trees and samples exactly
                                i               H∈A(Xi)                                     i
                        H(X)∧X ⊂ H}, and X ⊂ H indicates that cluster                                                  according to P(H|X).
                                         i                        i
The words contained in this file might help you see if this file matches what you are looking for:

...Cluster trellis data structures algorithms for exact inference in hierarchical clustering craig s greenberg sebastian macaluso nicholas monath ji ah lee patrick flaherty kyle cranmer andrew mcgregor mccallum university of massachusetts amherst national institute standards and technology new york abstract is a fundamental task often used to discover meaningful due the combinatorial number figure schematic representation possible clusterings approximate h denotes x are typically dataset contrast existing methods we present novel natural generated by dynamic programming markov tree i e shaped model where state based on variables dependent only their parent or children structure prove wedene as recursive split that can exactly compute partition n ting elements xi into function maximum likelihood hierarchy marginal probabilities sub hierarchies subsets until reaching singletons this equivalently clusters our scale time be viewed starting with set space proportional powerset el repeatedly ta...
Share

Help

Share

Share to social media

Help

Login Area