180x Filetype PDF File size 2.79 MB Source: proceedings.mlr.press
Cluster Trellis: Data Structures & Algorithms for Exact Inference in Hierarchical Clustering ∗,1,2 ∗,3 1 1 1 Craig S. Greenberg , Sebastian Macaluso , Nicholas Monath , Ji Ah Lee , Patrick Flaherty , 3 1 1 Kyle Cranmer , Andrew McGregor , Andrew McCallum 1 University of Massachusetts Amherst 2National Institute of Standards and Technology 3New York University Abstract Hierarchical clustering is a fundamental task often used to discover meaningful structures in data. Due to the combinatorial number of Figure 1: Schematic representation of a hierarchical possible hierarchical clusterings, approximate clustering. H denotes the hierarchical clustering and X algorithms are typically used for inference. In the dataset. contrast to existing methods, we present novel a natural data representation of data generated by a dynamic-programming algorithms for exact Markov tree, i.e., a tree-shaped model where the state inference in hierarchical clustering based on variables are dependent only on their parent or children. a novel trellis data structure, and we prove Wedefine a hierarchical clustering as a recursive split- that we can exactly compute the partition N ting of a dataset of N elements, X = {xi} into function, maximum likelihood hierarchy, and i=1 marginal probabilities of sub-hierarchies and subsets until reaching singletons. This can equivalently clusters. Our algorithms scale in time and be viewed as starting with the set of singletons and space proportional to the powerset of N el- repeatedly taking the union of sets until reaching the ements, which is super-exponentially more entire dataset. We show a schematic representation in efficient than explicitly considering each of Figure 1, where we identify each xi with a leaf of the the (2N −3)!! possible hierarchies. Also, for tree and the hierarchical clustering as H. Formally, 1 larger datasets where our exact algorithms be- Definition 1. (Hierarchical Clustering ) Given a come infeasible, we introduce an approximate dataset of elements, X = {x }N , a hierarchical clus- i i=1 algorithm based on a sparse trellis that out- tering, H, is a set of nested subsets of X, s.t. X ∈ H, performs greedy and beam search baselines. {{x }}N ⊂ H, and ∀X ,X ∈ H, either X ⊂ X , i i=1 T i j i j X ⊂X,orX X =∅.Further,∀X ∈H,if∃X ∈H j i i j Si j s.t. X ⊂ X , then ∃X ∈ H s.t. X X =X. j i k j k i 1 Introduction Given a subset X ∈ H, then X is referred to as a L L S cluster in H. When X ,X ,X ∈ H and X X = Hierarchical clustering is often used to discover mean- P L R L R X ,werefer to X and X as children of X , and X ingful structures, such as phylogenetic trees of organ- P L R P P the parent of X and X ; if X ⊂ X we refer to X isms (Kraskov et al., 2005), taxonomies of concepts L R L P P as an ancestor of X and X a descendent of X .(We (Cimiano and Staab, 2005), subtypes of cancer (Sørlie L L P also denote the sibling of X , as X = X \X .) et al., 2001), and jets in particle physics (Cacciari et al., L R P L 2008). Among the reasons that hierarchical clustering For binary trees, the total number of possible pairs of siblings (X ,X ) for a parent with N elements has been found to be broadly useful is that it forms L R is given by the Stirling number of the second kind *The first two authors contributed equally. N−1 S(N,2) = 2 −1. th Proceedings of the 24 International Conference on Artifi- 1Welimitourexpositiontobinaryhierarchicalclustering. cial Intelligence and Statistics (AISTATS) 2021, San Diego, Binary structures encode more tree-consistent clusterings California, USA. PMLR: Volume 130. Copyright 2021 by than k-ary (Blundell et al., 2010). Natural extensions may the author(s). exist for k-ary clustering, which are left for future work. Cluster Trellis: Data Structures & Algorithms for Exact Inference in Hierarchical Clustering In our work, we consider an energy-based probabilistic the number of hierarchies grows extremely rapidly, model for hierarchical clustering. We provide a general namely (2N −3)!! (see (Callan, 2009; Dale and Moon, (and flexible) definition of the probabilistic model and 1993) for more details and proof), where !! is double fac- then give three specific examples of the distribution torial. To overcome the computational burden, in this in section 4. Our model is based on measuring the paper we introduce a cluster trellis data structure for compatibility of all pairs of sibling nodes in a binary hierarchical clustering. The cluster trellis, inspired by tree structure. Formally, (Greenberg et al., 2018), enables us to use dynamic pro- Definition 2. (Energy-based Hierarchical Clus- gramming algorithms to exactly compute MAP struc- tering) Let X be a dataset, H be a hierarchical cluster- tures and the partition function, as well as compute X X + marginal distributions, including the probability of any ing of X, let ψ : 2 ×2 →R beapotentialfunction sub-hierarchy or cluster. We further show how to sample describing the compatibility of a pair of sibling nodes in H, and let φ(X|H) be a potential function for the H exactly from the posterior distribution over hierarchical structure. Then, the probability of H for the dataset clusterings (i.e., the probability of sampling a given X, P(H|X), is equal to the unnormalized potential of hierarchy is equal to the probability of that hierarchy). H normalized by the partition function, Z(X): Our algorithms compute these quantities without hav- ing to iterate over each possible hierarchy in the O(3N) P(H|X) = φ(X|H) with φ(X|H) = Y ψ(XL,XR) time, which is super-exponentially more efficient than Z(X) X ,X ∈sibs(H) explicitly considering each of the (2N − 3)!! possible L R (1) hierarchies (see Corollary 2 for more details). Thus, where sibs(H) = {(XL,XR)|XL ∈ H,XR ∈ H,XL ∩XR = while still exponential, this is feasible in regimes where ∅,XL∪XR∈H}.Thepartition function Z(X) is given by: enumerating all possible trees would be infeasible, and Z(X)= X φ(X|H). (2) is to our knowledge the fastest exact MAP/partition H∈H(X) function result(See §A.5 and §A.7 for proofs), making where H(X) represents all binary hierarchical clusterings practical exact inference for datasets on the order of 9 22 of the elements X. 20 points (∼ 3 × 10 operations vs ∼ 10 trees) or fewer. For larger datasets, we introduce an approxi- Werefer to our model as an energy-based model given mate algorithm based on a sparse hierarchical cluster that ψ(·,·) is often defined by the unnormalized Gibbs trellis and we outline different strategies for building distribution, i.e., ψ(X ,X ) = exp(−βE(X ,X )), this sparse trellis. We demonstrate our methods’ capa- L R L R where β is the inverse temperature and E(·,·) is the bilities for exact inference in discovering cascades of energy. This probabilistic model allows us to express particle decays in jet physics and subtype hierarchies in many familiar distributions over tree structures. It also cancer genomics, two applications where there is a need has a connection to the classic algorithmic hierarchical for exact inference on datasets made feasible by our clustering technique, agglomerative clustering, in that methods. We find that greedy and beam search meth- ψ(·,·) has the same signature as a “linkage function” ods frequently return estimates that are sub-optimal (i.e., single, average, complete linkage). We note that in compared to the exact MAP clustering. this work we do not use informative prior distributions over trees P(H) and instead assume a uniform prior. Contributions of this Paper. We achieve exact, not approximate, solutions to the following: Often, probabilistic approaches, such as coalescent mod- • Compute the Partition Function Z(X) (§2.2). els (Teh et al., 2008; Boyles and Welling, 2012; Hu et al., 2013) and diffusion trees (Neal, 2003; Knowles • MAP Inference, i.e. find the maximum likeli- and Ghahramani, 2011), model which tree structures hood tree structure argmax P(H|X) (§2.3). are likely for a given dataset. For instance, in particle H∈H physics generative models of trees are used to model • Sample Hierarchies from the Posterior Dis- jets (Cacciari et al., 2008), and similarly coalescent tribution, i.e. weighted by their probability, models have been used in phylogenetics (Suchard et al., P(H|X) (§2.5). 2018). Inference in these approaches is done by ap- proximate, rather than exact, methods that lead to 2 Hierarchical Cluster Trellis local optima, such as greedy best-first, beam-search, se- Exactly performing MAP inference and finding the par- quential Monte Carlo (Wang et al., 2015), and MCMC tition function by enumerating all hierarchical cluster- (Neal, 2003). Also, these methods do not have efficient ings over N elements is intractable since the number of ways to compute an exact normalized distribution over hierarchies grows extremely rapidly, namely (2N −3)!! all tree structures. (see (Callan, 2009; Dale and Moon, 1993) for more de- Exactly performing MAP inference and finding the par- tails and proof), where !! is double factorial. To address tition function by enumerating all hierarchical cluster- this challenge, we introduce a cluster trellis data struc- ings over N elements is exceptionally difficult because ture for hierarchical clustering. We describe how this C. Greenberg, S. Macaluso, N. Monath, J. Lee, P. Flaherty, K. Cranmer, A. McGregor, A. McCallum data structure enables us to use dynamic programming for X and its complement X \X , thus enabling the i i algorithms to exactly compute the partition function, application of Proposition 1 to get Z(X). For a full trel- MAPhierarchical clusterings, and marginals, as well lis, the algorithm can straightforwardly be written in a as how to sample from the exact distribution over hier- bottom-up, non-recursive way. In this case, the parti- archical clusterings. tion function for every node in the trellis is computed in order (in a bottom-up approach), from the nodes with 2.1 Trellis Data Structure the smallest number of elements to the nodes with the The trellis data structure is a directed acyclic graph largest number of elements, memoizing the partition that encodes a set of hierarchical clusterings. Each ver- function value at each node. By computing the partial tex in the trellis corresponds to a node in a hierarchical partition functions in this order, whenever computing clustering, and edges between vertices in the trellis cor- the partition function of a given node in the trellis, the respond to a parent/child relationship in a hierarchical corresponding ones of all of the descendent nodes will clustering. As in a hierarchical clustering, the trellis has have already been computed and memoized. In Figure a root node, that corresponds to the entire dataset, and 2, we show a visualization comparing the computation leaf nodes that correspond to the individual elements of of the partition function with the trellis to the brute the dataset. The dataset associated with a trellis vertex force method for a dataset of four elements. Next, we V is denoted X(V) and the trellis vertex associated present the complexity result for Algorithm 1: with a dataset X is denoted V(X). Each vertex in the Algorithm 1 PartitionFunction(X) trellis stores memoized values of Z(V) for computing the partition function, as well as the value φ(H∗[V]) Pick xi ∈ X and set Z(X) ← 0 and the backpointer Ξ(H∗[V]) for computing the MAP for X in C(X) do i x i tree. We denote C(X) as the children of V(X). We if Z(X ) not set then i refer to a full trellis as the data structure where every Z(X ) ←PartitionFunction(X ) i i possible hierarchical clustering given a dataset X can if Z(X \X ) not set then i be realised, i.e., there is a bijection between the set Z(X\X)←PartitionFunction(X\X ) i i Z(X)←Z(X)+ψ(X,X\X)·Z(X)·Z(X\X) of trellis vertices and P(X)\∅, where P indicates the i i i i power set, and there is an edge between Vi and Vj if return Z(X) X(Vi) ⊂ X(Vj). In contrast, a sparse trellis will only contain a subset of all possible hierarchies by omitting some of the vertices and edges in a full trellis. Theorem 1. For a given dataset X of N elements, Algorithm 1 computes Z(X) in O(3N) time. 2.2 Computing the Partition Function The time-complexity of the algorithm is O(3N), which N Given a dataset of elements, X = {xi} , the partition i=1 is is significantly smaller than the (2N − 3)!! possible function, Z(X), for the set of hierarchical clusterings hierarchies. over X, H(X), is given by Equation 2. The trellis im- Corollary 2. For a given dataset X of N elements, plements a memoized dynamic program to compute the partition function and the MAP. To achieve this, Algorithm 1 is super-exponentially more efficient than we need to re-write the partition function in the corre- brute force methods that consider every possible hier- sponding recursive way. In particular, archy. In particular the ratio is O((2)N Γ(N −1/2)). 3 Proposition 1. For any x ∈ X, the hierar- The proofs of Algorithm 1 and Corollary 2 are given chical partition function can be written re- in § A.7 of the Appendix. cursively, as Z(X) = P φ(X|H) = P H∈H(X) 2.3 Computing the MAP Hierarchical ψ(X ,X \X )·Z(X )·Z(X \X ) where Xi∈C(X)x i i i i C(X) is the set of all children of X containing the Clustering x element x, i.e,. C(X) = {X : X ∈ C(X)∧x ∈ X }. Similar to other dynamic programming algorithms, x j j j In the particular case of a full trellis, then such as Viterbi, we can adapt Algorithm 1 in order to X find the MAP hierarchical clustering. C(X) ={X :X ∈2 \X∧x∈X }. x j j j The MAP clustering for dataset X, is H⋆(X) = The proof is given in § A.1 in the Appendix. Algorithm argmax φ(H). Here we can also use a recursive 1 describes in a recursive way how to efficiently com- H∈H(X) pute the partition function using the trellis based on memoized technique, where each node will store a value for the MAP, denoted by φ(H⋆(X)) and a backpointer Proposition 1. We first set the partition function of Ξ(H⋆(X)). Specifically, the leaf nodes in the trellis to 1. Then, we start by selecting any element in the dataset, x , and consider Proposition 2. For any x ∈ C(X), let C(X) = i x all clusters X ∈ C(X) such that x ∈ X . Next, the {X : X ∈ C(X) ∧ x ∈ X }, then φ(H⋆(X)) = i i i j j j partition function is computed (memoized, recursively) max ψ(X ,X \X )·φ(H⋆(X ))·φ(H⋆(X\X )). X∈C(X) i i i i i x Cluster Trellis: Data Structures & Algorithms for Exact Inference in Hierarchical Clustering Exhaustive Computation of the N Partition Function - O((2N-3)!!) Computation using Trellis - O(3 ) << O((2N-3)!!) Z({a,b,c,d})=!({a,b,c},{d})á!({a,b},{c})á!({a},{b}) Z({a,b,c,d})=!({a,b,c},{d})áZ({a,b,c})áZ({d})+!({a,b,d},{c})áZ({a,b,d})áZ({c}) Recursive Computation +!({a,b,c},{d})á!({a,c},{b})á!({a},{c}) +!({a,c,d},{b})áZ({a,c,d})áZ({b})+!({b,c,d},{a})áZ({b,c,d})áZ({a}) of Z using Memoization +!({a,b,c},{d})á!({b,c},{a})á!({b},{c}) +!({a,b},{c,d})áZ({a,b})áZ({c,d})+!({a,c},{b,d})áZ({a,c})áZ({b,d}) +!({a,c,d},{b})á!({a,c},{d})á!({a},{c}) +!({a,d},{b,c})áZ({a,d})áZ({b,c}) +!({a,c,d},{b})á!({a,d},{c})á!({a},{d}) Tree Potential is {a,b,c,d} +!({a,c,d},{b})á!({c,d},{a})á!({c},{d}) Partition Function is +!({a,b,d},{c})á!({a,b},{d})á!({a},{b}) Product of Sum of Tree Potentials +!({a,b,d},{c})á!({a,d},{b})á!({a},{d}) Sibling Potentials {a,b,c} {a,b,d} {a,c,d} {b,c,d} +!({a,b,d},{c})á!({b,d},{a})á!({b},{d}) ({a,b,c} ) + !({a,b,c} ) + !({a,b,c}) {a,b,c} ! +!({b,c,d},{a})á!({b,c},{d})á!({b},{c}) +!({b,c,d},{a})á!({b,d},{c})á!({b},{d}) !({b,c} , {a} ) {a,b} {c} {a,c} {b} {b,c} {a} {a,b} {a,c} {a,d} {b,c} {b,d} {c,d} +!({b,c,d},{a})á!({c,d},{d})á!({c},{d}) Z({a,b,c})=!({a,b},{c})áZ({a,b})áZ({c}) +!({a,b},{c,d})á!({a},{b})á!({c},{d}) ( {b} {c} ) +!({a,c},{b})áZ({a,c})áZ({b}) +!({a,c},{b,d})á!({a},{c})á!({b},{d}) ! , +!({b,c},{a})áZ({b,c})áZ({a}) {a} {b} {c} {d} +!({a,d},{b,c})á!({a},{d})á!({b},{c}) Figure 2: Computing the partition function for the dataset {a,b,c,d}. Left: exhaustive computation, consisting of the summation of (2·4−3)!! = 15 energy equations. Right: computation using the trellis. The sum for the partition function is over 24−1 − 1 = 7 equations, each making use of a memoized Z value. Colors indicate corresponding computations over siblings in the trellis. Algorithm 2 MAP(X) X is contained in H. In this case, we marginalize over i if φ(X) set then every possible sub-hierarchy that contains the cluster X while keeping the rest of the hierarchy H fixed. The return φ(X),Ξ(X) i Pick xi ∈ X value of P(Hi|X) can be computed using the same al- φ(X)←−∞ gorithm used for the partition function, except that we first merge H into a single leaf node and use φ(H (X )) Ξ(X)←null {Backpointer to give MAP tree struc- i i i ture.} for the energy of the newly merged leaf. The same is true for computing the value of P(X |X), except for X in C(X) do i i x i that after merging X into a single leaf node, the value t ← ψ(X ,X \X )·φ(V(X ))·φ(V(X \X )) i i i i i Z(X ) should be used. See Appendix § A.4 for proofs. if φ(X) < t then i φ(X)←t Ξ(X)←{X,X\X}∪Ξ(X)∪Ξ(X\X) 2.5 Sampling from the Posterior Distribution i i i i return φ(X),Ξ(X) Drawing samples from the true posterior distribution P(H|X) is also difficult because of the extremely large numberoftrees.Inthissection,weintroduceasampling See §A.6 in the Appendix for the proof. As in the procedure for hierarchical clusterings Hi implemented partition function algorithm described in Section 2.2, using the trellis which gives samples from the exact true the time complexity for finding the MAP clustering is posterior without enumerating all possible hierarchies. N also O(3 ). The main difference is that to compute the The sampling procedure will build a tree structure in maximallikelihood hierarchical clustering, the maximal energy of the sub-hierarchy rooted at each node is a top-down way. We start with the cluster of all the computed, instead of the partition function. Pointers elements, X, then sample one child of that cluster, X ⊂X,(Eq.3)and set the other one to be the com- to the children of the maximal sub-hierarchy rooted L plement of X , i.e., X\X . This is repeated recursively at each node are stored at that node. A proof of the L L time complexity, analogous to the one for the partition from each of the children and terminates when a cluster contains a single element. A child X of parent X , function, can be found in §A.5 of the Appendix. L p i.e., X ⊂X issampled according to: L p 2.4 Computing Marginals 1 p(X |X ) = · ψ(X ,X \X ) ·Z(X )·Z(X \X ). In this section, we describe how to compute two types L p Z(X ) L p L L p L p of marginal probabilities. The first is for a given sub- (3) hierarchy rooted at X , i.e., H ∈ H(X ), defined as Pseudocode for this algorithm is given in Algorithm 3. P i i i P(Hi|X) = H∈A(Hi)P(H|X), where A(Hi) = {H : H ∈ Theorem 3. Sample(X) (Alg. 3) gives samples from H(X)∧Hi ⊂ H}, and Hi ⊂ H indicates that Hi is a P(H|X). subtree of H. Thus, we marginalize over every possi- ble hierarchy while keeping fixed the sub-hierarchy The proof is given in Appendix § A.2. This algorithm H . The second is for a given cluster, X , defined as is notable in that it does not require computing a cate- i P i P(X |X)= P(H|X), where A(X ) = {H : H ∈ gorical distribution over all trees and samples exactly i H∈A(Xi) i H(X)∧X ⊂ H}, and X ⊂ H indicates that cluster according to P(H|X). i i
no reviews yet
Please Login to review.