- 1Department of Mathematics and Statistics, University of Otago, Dunedin, New Zealand
- 2Algorithms in Bioinformatics, University of Tübingen, Tübingen, Germany
- 3Cluster of Excellence: Controlling Microbes to Fight Infection, University of Tübingen, Tübingen, Germany
NeighborNet constructs phylogenetic networks to visualize distance data. It is a popular method used in a wide range of applications. While several studies have investigated its mathematical features, here we focus on computational aspects. The algorithm operates in three steps. We present a new simplified formulation of the first step, which aims at computing a circular ordering. We provide the first technical description of the second step, the estimation of split weights. We review the third step by constructing and drawing the network. Finally, we discuss how the networks might best be interpreted, review related approaches, and present some open questions.
1 Introduction
Evolutionary relationships between species are usually visualized using a phylogenetic tree. When reticulate events are suspected to play an important role, a phylogenetic network is sometimes considered a more suitable representation. Even when reticulation is not present, networks can be useful for detecting problems with the data or ambiguities in the inferred phylogeny. One of the most widely used methods for computing such networks is NeighborNet, which was published over 20 years ago (Bryant and Moulton, 2002; Bryant and Moulton, 2004). It takes a distance matrix as input and produces a planar network as output, aiming to show both evolutionary relationships and conflicts in the data (see Figure 1).
FIGURE 1. Split network of 100 buttercup and related species, computed using the NeighborNet method, based on distances inferred from a multiple sequence alignment of protein sequences of the cytochrome c gene, downloaded from NCBI. This analysis is available in the examples directory of SplitsTreeCE.
The NeighborNet method has been applied within a wide range of contexts. A cursory survey of recent citations reveals applications to sea slugs, monkey pox, angelica, daisies, butterfly parasites, linguistics, entomopathogenic fungi, mussels, and Wenchang chickens. The main appeal of the method is that it does not force the given data onto a single phylogenetic tree, but instead can display incompatibilities in the data.
The computation of a phylogenetic network using the NeighborNet is performed in three main steps:
1. An agglomerative algorithm is used to compute a circular ordering of the taxa.
2. Non-negative least squares (NNLS) are used to estimate the split weights compatible with the given ordering.
3. A split network construction method is used to calculate the final network.
In this article, we first provide a new, simplified description of the agglomerative algorithm, focusing directly on the task of computing a cycle. We then describe (for the first time) the best-performing methods used to estimate split weights. This includes novel, low-level matrix multiplication algorithms and a recent survey of relevant NNLS optimization algorithms, followed by a review of techniques for constructing the network from a set of splits. We provide some thoughts on the interpretation and misinterpretation of NeighborNet and conclude with a list of open questions.
2 Splits, compatibility, and circularity
Throughout this article, we use X to denote a set of taxa of size n. We use
For a set of splits
where the sum is over all splits that separate x and y, that is, contain x and y in separate parts.
Let T be a phylogenetic tree on X, that is, a tree with no nodes of degree 2 and leaves labeled by elements of X one-to-one. Any edge (or branch) e in T defines a unique split S = A|B on X, with A and B being the set of taxa reachable from one end of e or the other, respectively, without crossing e. The set
A distance matrix d on X is called additive if there exists a phylogenetic tree T such that, for any two taxa x and y, the sum of edge weights along the path that connects them in T equals d(x, y). Equivalently, formulated in terms of splits, d is additive if and only if there exists a compatible set of splits
A set of splits
A distance matrix d is called circular if there exists a circular set of splits
3 First step: calculation of a circular ordering
Linkage cluster algorithms (Johnson, 1967) and distance-based phylogenetic tree algorithms such as the neighbor-joining algorithm (Saitou and Nei, 1987) use an agglomerative approach to constructing a tree. Initially, all n taxa are placed on isolated nodes. The methods then choose two connected components of the graph and link them via a new parent node. This is repeated until the graph is connected, and the result is a tree. Algorithms differ by how they select which components to link and how they set the edge lengths.
Here, we use an agglomerative approach to create a cycle rather than a tree. The cycle defines a circular ordering θ of the taxa. The general outline of the method is presented in Algorithm 1, where we use G = (V, E) to denote a graph with node set V and edge set E.
Algorithm 1.NeighborNet: Agglomerative Cycle Calculation
1: set G ← (X, ∅) ⊳ Initially, n isolated nodes
2: set A ← X ⊳ set of “active” nodes
3: while G has
4: select two different connected components P and Q in G, with |P ∩ A| ≤ |Q ∩ A|
5: if P and Q are both singletons then
6: we have P = {p} and Q = {q} with p, q ∈ A
7: create a new edge (p, q) ∈ E
8: else if P is a singleton and Q is a chain then
9: we have P = {p} with p ∈ A and |Q ∩ A| = 2
10: select q ∈ Q ∩ A
11: create a new edge (p, q) ∈ E and remove q from A
12: else⊳ P and Q both chains
13: we have |P ∩ A| = 2 and |Q ∩ A| = 2
14: select p ∈ P ∩ A and q ∈ Q ∩ A
15: create a new edge (p, q) ∈ E and remove p, q from A
16: end if
17: end while
18: create a new edge (p, q) ∈ E between the two remaining active nodes p, q ∈ A
19: return graph G = (X, E), consisting of a single cycle
Proposition 1. Algorithm 1 returns a cycle.
To see that Proposition 1 holds, note that if the first conditional expression is true, then two isolated nodes are connected by an edge, forming a chain of length 1 between two active nodes. If the second condition holds, then a chain Q is extended by a node p and the set of active nodes is updated to ensure that precisely the two ends of the extended chain are active. Otherwise, two chains P and Q are concatenated into a single chain and the set of active nodes is updated to ensure that the chain contains precisely two active nodes, one at each end. Each iteration reduces the number of connected components by one and so the while loop will terminate after n − 1 iterations.
Assume that we are given a distance matrix d on X as input. To complete the definition of the agglomerative part of the NeighborNet, we have to specify how the three different selections are respectively made in lines 4, 10, and 14 on the basis of d. Throughout Algorithm 1, we maintain and update the matrix d on A, the set of active nodes or taxa.
For any two connected components P and Q, we define the distance between P and Q as the average distance between the active nodes contained in P and the active nodes contained in Q, that is,
In line 4, we select a pair of components P and Q that minimizes the adjusted distance
summing over all components S ≠ P, Q, with m as the total number of components.
In line 10, we select the node q = Q ∩ A for which the adjusted distance between p and q is minimized. In more detail, for q ∈ {q1, q2} = Q ∩ A, we define
summing over all components S ≠ P, Q. We select q = q1 if d′(q1, p) ≤ d′(q2, p), else select q = q2.
Similarly, in line 14 we select the pair of nodes p and q that have minimal adjusted distance. In more detail, for p ∈ {p1, p2} = P ∩ A and q ∈ {q1, q2} = Q ∩ A, we define
summing over all components S ≠ P, Q. We select p and q that minimize d′(p, q).
We now describe how to update d. In the first conditional statement, the set of active nodes A is not changed and so d does not require updating.
In the second conditional statement, the selected node q is removed from the active set A and we update
where
In the third conditional statement, the selected nodes p and q are removed from the active set A and we update
where
In previous descriptions of the algorithm, the update was performed by applying the update formulas of the second conditional statement twice, first to
These calculations for selecting components and nodes, and for updating d, may seem quite complicated; however, they ensure that, if the input distance matrix is circular, then the computed circular ordering belongs to the associated set of circular splits. This is based on the following result.
Theorem 2. (Bryant et al., 2007; Levy and Pachter, 2011). Let d be a circular metric on X and let n = |X|. The pair x, y minimizing
is adjacent in some circular ordering compatible with d.
Note that Eq. 2 is the criterion used to select components to agglomerate in the neighbor-joining algorithm. If the input distances d are additive, then the pair x, y minimizing d′(x, y) corresponds to a cherry (leaves adjacent to the same internal node); for a simple proof of this fact, see Bryant (2005). It is remarkable that this result extends to circular metrics.
In the next section, we will discuss how to compute a set of weighted splits that are compatible with the calculated ordering, and we have the following result (Bryant et al., 2007).
Theorem 3. NeighborNet is consistent on circular distance matrices. In more detail, let d be a distance matrix on X and let
4 Second step: estimation of split weights
The first step of the NeighborNet method computes a circular ordering θ. We now describe the second step, in which we set up all splits that are compatible with the given ordering and then use least squares to determine their weights. This is a difficult problem to tackle in practice, and we discuss multiple algorithms to address it.
4.1 Setting up the problem
4.1.1 Linear algebra
Suppose we have a distance matrix d and a circular ordering θ = (x1, x2, …, xn) of the taxa. There is a set of O(n2) splits that are compatible with any such ordering, given by
Any choice of non-negative weights
where δA|B denotes the semi-metric defined as
Note that
This is an example of a NNLS problem.
The first step is to rewrite the problem using linear algebra.
Definition 4. Let θ = (x1, x2, …, xn) be an ordering of the taxa X. For each k < ℓ, let σ(kℓ) denote the split
Let A denote the
The matrix A has determinant
We let d and λ denote vectors of observed distances and split weights, so
for all ij and kℓ.
The NNLS problem to be solved is to minimize
subject to the constraint that λkℓ ≥ 0 for all k < ℓ. The function f has gradient
and a (non-negatively constrained) optimum given by the unique vector λ satisfying (∇f(λ))kℓ ≥ 0 for all kℓ and (∇f(λ))kℓ = 0 for all kℓ such that λkℓ > 0.
4.1.2 Fast matrix multiplication
In practical applications, the matrix A can be quite large. For example, if n = 1,000, then A has approximately 500,000 rows and columns and over 250 billion entries. However, we do not construct A explicitly in memory. Instead, we use the matrix implicitly, taking advantage of the structure in the matrix to derive efficient algorithms for computing Ax, ATx, and A−1x for a vector x. As we shall see, each of these can be computed in O(n2) time, which is linear in the number of entries of x.
1. If y = Ax, then
2. If y = ATx, then
3. If y = A−1x, then
Eq. 3 is obtained by summing over all splits separating two taxa adjacent in the order, while (6) involves a sum over all n − 1 pairs separated by a split {xi}|X − {xi}. All other identities are consequences of the observation that if y = Ax, then
This is essentially the combinatorial Crofton formula given by Chepoi and Fichet (1998), though with different indexing.
4.1.3 Numerical error
As the number of taxa increases, the runtime complexity of the algorithms clearly becomes critical. In our experience, the control of numerical errors is of equal, or possibly greater, importance. These issues are not new; there is a vast literature on numerical errors and their impact on least squares problems; for comprehensive introductions, see Dahlquist and Björck (2003) and Golub and Van Loan (2013).
As an illustration, consider the algorithms for computing Ax and A−1x, outlined in the previous section. Suppose that the number of taxa n equals 500, and we simulate an n (n − 1)/2-dimensional vector x by drawing each entry independently from a standard uniform distribution. If we compute y = Ax and then compute z = A−1y, then we might expect
In practice, we have found that, on average,
while
The exact figure will depend on the choice of the norm, on n and also on the architecture of the computer and software. Independent of the details, we should not expect the calculation of gradients, function values, and estimates of split weights to be exact.
By itself, this level of imprecision will not necessarily create problems, as it is hard to envisage a data set where differences to the order of 10–7 would have a noticeable impact on the analysis. Unfortunately, the NNLS problem that we have to solve is ill-conditioned, that is, small changes in the data or small errors in the computation can get amplified and cause serious difficulties. The condition number of a matrix with respect to the standard Euclidean norm ‖ ⋅‖ is defined (Golub and Van Loan, 2013) as
The larger the condition number of A, the more sensitive the solution of the linear equation Ax = y is to changes in y.
We do not have an exact analytical formula for the condition number κ2(A) of A, but calculations for n ≤ 50 suggest that κ2(A) grows faster than
These numerical issues have practical ramifications. Because of the size of the problems that we consider, we will typically use iterative algorithms to solve the various linear systems which arise. Numerical issues can lead to a failure of convergence for these methods. Even when the methods do converge, we still have to specify some kind of stopping condition. Any stopping condition needs to be realistic with respect to the level of accuracy which could possibly be achieved. Even deciding whether a solution is approximately optimal becomes challenging.
4.2 Methods
NNLS is a classical problem of numerical optimization and several strategies are available. We review three separate approaches and describe the modifications that we have developed to adapt them to our problems.
To facilitate comparisons between the methods, we use the same initial split weights and the same criterion for convergence each time. The initial split weights are determined by
using the aforementioned methods and replacing the negative entries with zeroes. If all entries of A−1d are initially non-negative, then this will be the optimal NNLS solution and no iterations are necessary. As a consequence, if d is already a circular metric, then split weights are determined optimally in O(n2) time.
If the initial conditions are not already optimal, the iterative algorithms are called until a convergence criterion is satisfied. For any putative solution λ, we compute the projected gradient g defined by
Then, ‖g‖2 = 0 if and only if λ solves the NNLS problem. To account for numerical imprecision, we consider that the method has converged when ‖g‖2 < δ. The default value that we use for δ is 10–8‖ATd‖2, which scales with n and is similar to the stopping criteria used in the SplitsTree4 method.
4.2.1 Active-set method
The active-set method (Lawson and Hanson, 1995; Nocedal and Wright, 2006) is one of the most widely used algorithms for solving NNLS problems. It is the method used for computing split weights in the SplitsTree4 (Huson and Bryant, 2006). The active set is a set of indices
subject to the constraint that λij = 0 for all
1. The solution λ* to the equality-constrained problem is non-negative and
2.
During each iteration, we update the active set
Algorithm 2.Active-Set Method
1: λ ← any feasible solution
2:
3: loop
4: repeat
5: let λ* minimize ‖Aλ − d‖ such that λij = 0 for all
6: if λ* is infeasible then
7: let λ be the feasible point on the line from λ to λ* which is closest to λ*
8:
9: end if
10: until λ* is feasible
11: g ←AT(Aλ* − d)
12: if gij ≥ 0 for all
13: Return λ*
14: end if
15: Remove the pair ij from
16: λ ←λ*
17: end loop
A key component of the active-set method is the algorithm used to minimize ‖Aλ − d‖ over all λ such that λij = 0 for all
Two standard algorithms for solving this kind of linear system are QR decomposition and Cholesky decomposition (Golub and Van Loan, 2013). Neither is practical for the size of problem that we are dealing with here due to both running time and memory requirements. In SplitstreeCE, we use CGNR (Saad, 2003), a version of the conjugate gradient algorithm designed for solving normal equations. The algorithms of Section 4.1.2 can be used to efficiently multiply vectors by B or BT without having to construct either matrix in memory.
The classical implementation of the active-set method only allows the active set
Under exact arithmetic, CGNR typically converges to a solution in finite time much more quickly than exact linear equation solvers. In practice, for large problems, numerical problems can cause the method to break down and either converge very slowly or not converge at all. We found that for large problems, the algorithm often took too long to converge. Figure 2A gives a plot of the (log) residual versus iteration for a typical call to CGNR on the Streptococcus agalactiae data of Morach et al. (2018). The graph shows the initial rapid convergence followed by a long tail of linear convergence. The residual reduces each iteration, but too slowly. We tried implementing periodic restarts, but this had little effect. We also designed a number of preconditioners (Saad, 2003) but were unable to find one which reliably improved the performance.
FIGURE 2. (A) Convergence (in the norm of the gradient) of the CGNR method and (B) convergence (in the norm of the projected gradient) of the active-set, gradient projection, and APGD methods, as a function of wall-clock time.
Our strategy is to not run CGNR to convergence. This is similar to a standard restart, the difference being that we update the active set between runs. We bound the number of iterations of the CGNR by 50 or the number of taxa, whichever was larger.
Figure 2B shows the rate of convergence plot for the active-set method as applied to the S. agalactiae data, both with and without running CGNR to convergence, as a function of wall-clock time (on a MacBook Air 2021). For both curves, the error initially remains high, before dropping rapidly. This is to be expected, and reflects the fact that once a good active set is identified, the convergence is extremely rapid. Also note that restricting the number of iterations of CGNR gives a two-fold increase in speed.
4.2.2 Gradient projection method
In each iteration of the active set method, we start at a feasible point λ and move as far as we can toward the (approximate) solution λ* of the subproblem, while still remaining feasible. The gradient projection method takes a different approach. Instead of moving along the line from λ to λ* and stopping as soon as the point becomes infeasible, it moves along the line from λ to λ* and projects any infeasible points back into the feasible region (Nocedal and Wright, 2006).
Given a point λ, let π(λ) denote the feasible point which is closest to λ, that is, π(λ) is a vector with
More compactly, π(λ)ij = max(λij, 0). For the gradient projection algorithm, we select a search direction (in this case, the search direction p = −∇f(λ) of the steepest descent) and carry out a one-dimensional line search with respect to the function
Each pair ij of indices with pij < 0 is associated to a breakpoint tij such that
Nocedal and Wright (2006) propose a line search strategy which examines breakpoints in the order of increasing magnitude, stopping when q(t) reaches a local minimum. This strategy works well for small numbers of taxa, however, when n is large, the number of breakpoints becomes large, the gaps between them become tiny, and the algorithm grinds to a halt.
Instead, we implemented a version of the gradient projection method due to Cartis et al. (2012), though simplified, as we do not require trust regions and do not incorporate regularization. In this approach, the line search can terminate before reaching a local optimum, provided the technical conditions are satisfied (see Conn et al., 2000, Section 12.1; Cartis et al., 2012), conditions which guarantee convergence of the method.
Algorithm 3.Gradient Projection Method
1: λ ← any feasible solution
2: while λ is not optimal, do
3: p ← − AT(Aλ − d)
4: Let t* be the first local optimum of q(t) = ‖A(π(λ + tp)) − d‖
5: λc ← π(λ + t*p)
6:
7: end while
We carry out several iterations of the conjugate gradient algorithm to find λ* such that f(λ*) ≤ f(λc) and
and let λ = π(λc + t(λ* − λc)).
The rate of the convergence plot for the S. agalactiae data set is shown in Figure 2B. Initially, the error drops quickly, more quickly than for the active-set method. The rate of convergence then slows, and the projected gradient norm fluctuates significantly from one iteration to the next.
4.2.3 Accelerated projected gradient descent method
The accelerated projected gradient descent (APGD) method (Nesterov, 2003; Mazhar et al., 2015) is a first-order method (using only first derivatives) and yet exhibits a guaranteed rate of convergence. The basis of the method is a projected version of the steepest descent
where (as previously mentioned) π(λ) is the vector formed by replacing the negative entries of λ with 0 and α is a carefully chosen step length. Nesterov devised several acceleration modifications. These are often described as including “momentum” in the optimization algorithm. We implement an adaptive scheme which resets the momentum term if the objective function increases during an iteration.
Algorithm 4.Accelerated Projected Gradient Descent Method
1: λ(0) ← any feasible solution, y(0) ←λ(0)
2: Choose θ0 ∈ (0, 1)
3: for k = 0, 1, 2, … until convergence do
4: g ←AT(Ay(k) − d))
5:
6:
7:
8: y(k+1) ←λ(k+1) + βk+1(λ(k+1) − λ(k))
9: if gT(λ(k+1) − λ(k)) > 0 then
10: y(k+1) ←λ(k+1)
11: θk+1 ←θ0
12: end if
13: end for
In our experience, there was essentially no difference in performance for α0 = 0.1, 0.5, 0.9, or 1.0. The plot in Figure 2B shows a steady rate of convergence, initially fast and then leveling off.
4.3 Performance comparison
To compare the different approaches described above, we selected 1,200 prokaryotic genomes that have a mash distance of <0.3 from Escherichia coli K12, using a sketch size of 10,000 and k-mer size of 21 (Ondov et al., 2016) and computed all pair-wise mash distances between them. From this distance matrix, we randomly subsampled 20 replicates of smaller distance matrices of sizes n = 50, 100, 140, …, 1,000. For each such replicate, we computed a circular ordering and then applied the active-set method, gradient projection method, or APGD method, as well as the “SplitsTree4” method that is the implementation of the active-set method that uses SplitsTree4 (Huson and Bryant, 2006).
The results of this study, summarized in Figure 3, suggest that the APGD is the fastest method, while the active-set method is the second fastest, providing the best fit (equal to the fit of the old implementation of the same method in our program SplitsTree4), and the smallest number of splits. The gradient projection method runs the slowest, producing many more splits, with a much poorer fit. Times reported are the wall-clock times, running all four methods in parallel on a Mac Pro 2020 workstation. Based on these observations (which we also confirmed on two other data sets that are not shown here), in our program the SplitsTreeCE, we made the (modified) active-set method the default method.
FIGURE 3. For each data set of size 50–1,000, averaged over 20 replicates per size, and for each of four constrained least-squares approaches, we report (A) wall-clock time in seconds, (B) number of internal splits, and (C) percent fit (as defined in Section 6.2.1).
5 Third step: construction of a network
The third main step of the NeighborNet is to construct and draw a network that represents the set of circular splits calculated in the first two parts of the method. This step is described in Dress and Huson (2004) and a simplified visualization is provided in Bagci et al. (2021).
5.1 Split networks
As discussed above, if we are given a set of splits
More generally, any set of splits
The first two steps of the NeighborNet compute a circular ordering θ = (x1, x2, …, xn) and a set of splits
FIGURE 4. For 11 listed splits on X = {a, b, c, d, e, f} and weights, we show their representation as a split network, with edges labeled 1 − 11 to indicate the associated splits.
Here, we summarize the main properties of a split network.
(N1) Each edge is associated with a single split, and each split is associated with at least one edge.
(N2) Removing all the edges associated with some split divides the network into two connected components. Each component contains the taxa on one side of the split.
Both of these properties also hold trivially for unrooted phylogenetic trees. They imply that, for any two taxa x and y, a path from x to y in the network will cross at least one edge associated to each split that separates x and y. In fact, a stronger property holds.
(N3) The edges along any geodesic (shortest path) in the graph are associated with different splits.
Hence, for any taxa x and y, any shortest path from x to y contains exactly one edge associated with each split separating x and y. Alternatively, we can replace (N3) by the following convexity property.
(N3’) For any split, the two associated connected components are convex, that is, each contains all the shortest paths between any two nodes.
Properties (N1) to (N3) guarantee that the edges along any shortest path between the taxa correspond exactly to the splits separating those taxa. As a consequence, the total length of the shortest path between the two taxa x and y is exactly
where the sum is over all splits that separate x and y. This implies that the split network is a faithful representation of the decomposition into split metrics.
Split networks are known in other branches of mathematics as partial cubes, which mean that there is a map from the graph to a hypercube that preserves distances. It follows from this that we can assume the following property for any drawing of a split network.
(N4) The edges associated with a split A|B are parallel line segments with the length equal to the weight of the split A|B.
5.2 Planar split networks
To draw a split network, we have to assign coordinates to all nodes. We will discuss this for circular splits. The NeighborNet is an attractive visualization technique because of the following result.
Theorem 5. A set of splits
One way to show this is using de Bruijn’s dualization (de Bruijn, 1986). We place the taxa on the unit circular in the order θ and then represent each split A∣B by a line that separates those two parts of the split. This is known as a line arrangement. The dual is the graph N obtained by placing a node on each taxon and in each bounded region of the arrangement. The edges are placed between any two nodes whose regions intersect along a line segment. This construction is demonstrated in Figure 5.
A general characterization of when a collection of splits
In both the equal-angle algorithm and outline algorithm (Bagci et al., 2021), we first assign an angle to each taxon based on its position in the ordering θ = (x1, …, xn), setting
FIGURE 5. (A) Splits from Figure 4 drawn as lines separating points on a circle and (B) the corresponding dual graph.
6 Interpretation of NeighborNet output
Split networks produced using the NeighborNet are a generalization of phylogenetic trees and must be interpreted accordingly (see Figure 1). In this section, we give some general guidelines to help this process.
6.1 The networks do not explicitly depict evolutionary scenarios
The most important fact to take into account is that a split network does not provide an explicit evolutionary scenario (Huson et al., 2010); internal nodes usually do not correspond to putative ancestors and edges do not always represent different lineages or reticulation events. Such a network provides an implicit representation of evolution in which the key features are the splits and their weights (or lengths).
6.2 The networks represent distances
In a phylogenetic tree, the length of the path between two taxa represents the inferred evolutionary (patristic) distance between the two taxa. Distance-based methods typically work by first estimating pair-wise distances between sequences and then finding an evolutionary tree so that the distances in the tree approximate the distances used as input.
In a split network, the length of the shortest path between two taxa represents the inferred evolutionary (patristic) distance. Because split networks are a generalization of phylogenetic trees, that often allow a better approximation of the input distances than can be realized using a tree.
In SplitsTree, the closeness of approximation is measured using a fit statistic. Let dij denote the input distances and pij denote the distances in the network. The fit statistic is defined as
reported as a percentage. If the network distances exactly match the input distances, then the fit is 100%.
In our experience, the fit statistic for biological sequence data is usually above 90%, indicating that the distances in the network provide a good approximation of the input distances. However, it is easy to construct the input data that give rise to a poor fit. In particular, Euclidean distances are not well suited as input, although they fit well with multi-dimensional scaling techniques. A low-fit statistic indicates that the network provides only a poor approximation of the input distances and so inferences should be made from the network with caution.
6.3 The networks are not based on generative model
A widespread trend in statistical phylogenetics is to carry out inference using complex stochastic and generative models. The models are constructed so as to mimic as many different evolutionary processes as possible. These analyses prioritize statistical power, which makes good sense if the model is reasonable and there are no surprises in the data.
The NeighborNet algorithm is not based on a generative model. There are no model parameters or prior distributions controlling where and how frequently reticulations occur. The only model of data variability is the assumption of noise in the distance data implicit in the least squares approach.
Because the method is statistically consistent on circular metrics, we can say something about what the NeighborNet will do when applied to data generated according to a corresponding phylogenetic model. When the input distances are additive, the NeighborNet will return the corresponding tree. When the distances are almost additive, the NeighborNet will return a split network that is close to a tree. If a distance is generated from a mixture of trees and the combined splits of those trees are circular, then the NeighborNet will represent the averaged splits of the trees. This applies, for example, to a pair of trees which differ by a single subtree transfer operation.
It is possible to bind the expected error that is introduced by applying distance corrections for alignments composed of multiple heterogeneous blocks (Bryant et al., 2003).
6.4 The networks are akin to phylogenetic scatter plots
A useful analogy to use when interpreting the output of the NeighborNet is a scatter plot. Suppose that we have a collection of pairs of values (xi, yi), with 1 ≤ i ≤ n, and assume that we suspect that the values are generated using a simple model
where the two variables α and β are the parameters being inferred and the ϵi values are independent random variables. The true model is essentially a line and so a model-based inference would focus on the line inferred or the corresponding parameters, perhaps with their uncertainties. When we produce the scatter plot, we are not making assumptions about the underlying model, nor are we necessarily making concrete progress toward inferring the true values of the parameters. Instead, we are learning more about the data and their suitability for the model-based analysis that we might have planned.
Just as a scatter plot might reveal outliers or strange patterns in the data, a NeighborNet might reveal errors in sequencing or labeling, or perhaps indicate the potential of conflicting signals in the data which might make use wary of assuming the suitability of an analysis based on a single tree.
In this sense, the process of going from distance data to a split network in the NeighborNet is closer to a data transform than a model-based inference. The method shares similarities to Hadamard conjugation (Hendy and Penny, 1993), which also produces a set of splits with weights. In the case that the splits are circular, the NeighborNet applied to correct distances provides an approximation of the spectrum produced by Hadamard conjugation; an approximation which gets more and more accurate as the sequence length increases.
7 Open problems and related work
Despite 20 years of work examining and improving the NeighborNet, there are several problems that remain open.
7.1 Simplifying the NeighborNet ordering algorithm
At present, the NeighborNet algorithm uses a two-stage selection process when choosing how to join chains: first the two chains are chosen and next the ends to be joined are selected for either chain. We wish to determine whether this step can be reduced to a single-stage criterion or whether such a simplified algorithm is impossible is determined, see Bryant et al. (2007).
7.2 Searching through circular orderings
A standard strategy for inferring a phylogeny is to start with a tree that is determined using a heuristic such as the neighbor-joining and then carry out local search to optimize some criteria. The same strategy could be implemented for inferring circular networks; however, the question is which criterion to use.
Bandelt and Dress (1992b) observed that if d is a circular metric, then a permutation σ corresponds to a circular ordering (xσ(1), xσ(2), …, xσ(n)) compatible with d if and only if the tour length
is minimal. Hence, we can infer a circular split network consistently by solving the travelling salesman problem (TSP). This approach was explored by Eslahchi et al. (2010), who proposed a simple insertion scheme followed by randomized local search to find an ordering with small total length.
One problem in using the TSP to infer the ordering is that it appears highly vulnerable to noise in the distance estimates. With such a large number of different pairs, there is a reasonable chance that one distance might be substantially underestimated, with significant, random, impact on the minimal tour.
A potential solution for this problem is to use a related criterion which involves averages of larger numbers of distances, and thereby reduces the impact of the outliers. Suppose that X is the set of taxa and Y ⊆ X. The restriction d|Y of the distance matrix to elements in Y will also be a circular metric. Furthermore, if σ corresponds to a circular ordering compatible with d, then the restriction σ|Y of σ to elements in Y will be a circular ordering compatible with d|Y. This suggests a criterion
where w is a set of non-negative weights such that w|X| > 0. Then, σ is compatible with a circular metric d if and only if ℓw(σ) is minimized. This criterion is consistent, involves averages of many estimates and can be computed efficiently by carefully determining the contribution of each distance pair d(xσ(i), xσ(j)).
7.3 Faster estimation of split weights
Say that practical implementations of the NeighborNet can take a prohibitive amount of time to run on a data set of several thousand taxa. Then, the computational bottleneck lies in the NNLS estimation of the split weights. The experimental results that we report represent only a small fraction of the total strategies attempted to make the NNLS algorithm run more quickly. As mentioned, the running time is only one factor, and the increase in numerical errors with more taxa is of comparable significance.
In the past (SplitsTree4), we used the active-set method; now, we use an improved implementation of such a method (SplitsTreeCE). Both implementations make repeated calls to CGNR, the conjugate gradient algorithm, so speeding up CGNR would make a direct and substantial impact on the running time of our implementation of the NeighborNet. The standard technique for making conjugate gradients run better is to use preconditioning (Saad, 2003); however, we have been unable to design a preconditioner that gives a reliable improvement in running time. Such a preconditioner would have to take advantage of the special structure of the matrix A, restricted to a subset of columns.
It may make more sense to avoid NNLS completely. The least squares method is familiar and mathematically attractive, but does not best capture the error in the observed distances. It may be possible to adopt a different criterion that retains some of the regularization ability of NNLS but can be computed far more efficiently.
7.4 NeighborNet for non-distance data
Usually, reducing a data set down to a distance matrix entails a significant loss in information. There have been several investigations into adapting NeighborNet for other types of data.
A quartet is an unrooted, binary (fully resolved) phylogeny on four taxa. The quartet with taxa a and b separated from taxa c and d by the internal branch is denoted by ab|cd. One persistent paradigm for constructing phylogenetic trees is to first infer a collection of quartets on different subsets of taxa and use combinatorial algorithms to assemble these quartets into a full phylogeny.
The problem of constructing circular split networks adapts well to quartet data. Grünewald et al. (2007) explored this approach extensively, resulting in the QNet, a method that can be described as a quartet version of the NeighborNet. Hassanzadeh et al. (2012) implemented a simulated annealing algorithm to maximize a quartet-based criterion for circular “super”-networks.
7.5 Inferring circular networks from trees
As previously mentioned, we considered different coefficients for updating the distance matrices during the agglomeration step. There is also scope for different weights when computing the distances between clusters in the first selection step. Levy and Pachter (2011) explored the effect of these weights and demonstrated that the weights can be chosen such that the neighbor-joining tree is embedded in the network. A consequence (which also follows from Theorem 2) is that the neighbor-joining algorithm could, by itself, be used to help construct circular split networks (Guo and Grünewald, 2023).
The theorem suggests a new approach to infer split networks:
1. Infer an unrooted phylogenetic tree T (e.g., using the neighbor-joining).
2. Search through circular orderings which are compatible with the splits of T.
3. Estimate split weights for the corresponding circular splits.
Guo and Grünewald (2023) propose an integer linear programming algorithm for the second step. The PQ-tree–based algorithm for the TSP of Burkard et al. (2005) could also be used, though it might be worth adapting the algorithm to optimize a criterion that is not so vulnerable to random error.
7.6 Taking advantage of structure in the alignment
One of the strengths of the NeighborNet is that it only requires distance data, so it can be applied in a wide variety of contexts. This is also one of its weaknesses. The reason is that the process of computing distances from an alignment discards all of the structural information on which groups of sites support which phylogenetic signals.
As an illustration, consider the phi test (Bruen et al., 2006) for recombination, a method which performs well in many situations. The phi test evaluates a statistic that measures the extent to which nearby sites are more compatible than distant sites and tests for recombination by seeing how this statistic compares to those for the same alignment with sites randomly permuted.
In a NeighborNet analysis, permuting the sites has no effect on the distance estimation, so all of the signals in the data that the phi test uses to detect recombination is ignored by the NeighborNet. Addressing this while still maintaining the speed and practicality of the NeighborNet would represent a significant step forward.
8 Summary
The NeighborNet algorithm is related to the split decomposition (Bandelt and Dress, 1992a), neighbor-joining, and pyramidal clustering methods (Diday, 1984), yet differs substantially from all these methods. The algorithm is widely used in many different fields due to its ability to quickly visualize data and incompatibilities.
The second step of the program is computationally intensive. This has been a significant practical limitation, one which was quite challenging to overcome. There is also scope for exploring facets of the data which are not preserved in distance data.
The kind of analysis carried out using the NeighborNet is complementary to many of the accepted approaches to phylogenetic, phylogenomic, and phylodynamic analyses. The analysis more resembles a signal transform or spectral analysis than an estimation of model parameters. While the NeighborNet does not address the problem of inferring the finer parameters of a sophisticated model, it is widely used for data representation and exploration.
Data availability statement
The original contributions presented in the study are included in the article/supplementary material; further inquiries can be directed to the corresponding author.
Author contributions
DB wrote the initial draft of the manuscript and did the work on split weight estimation. DH created the SplitsTreeCE software package and implemented the network visualization algorithms. Both authors developed the new formulation of the agglomerative cycle calculation algorithm and both wrote the full manuscript and approved the final version.
Funding
The authors thank the Royal Society Te Apārangi (New Zealand) for funding under the Catalyst Leader program (Agreement # ILF-UOC1901). They acknowledge support by the Open Access Publishing Fund of the University of Tübingen.
Conflict of interest
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Publisher’s note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, editors, and reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
References
Bagci, C., Bryant, D., Cetinkaya, B., and Huson, D. H. (2021). Microbial phylogenetic context using phylogenetic outlines. Genome Biol. Evol. 13, evab213. doi:10.1093/gbe/evab213
Balvociute, M., Bryant, D., and Spillner, A. (2017). When can splits be drawn in the plane? SIAM J. Discrete Math. 31, 839–856. doi:10.1137/15m1040852
Bandelt, H.-J., and Dress, A. (1992a). Split decomposition: A new and useful approach to phylogenetic analysis of distance data. Mol. Phylogenetics Evol. 1 (3), 242–252. doi:10.1016/1055-7903(92)90021-8
Bandelt, H.-J., and Dress, A. W. M. (1992b). A canonical decomposition theory for metrics on a finite set. Adv. Math. 92, 47–105. doi:10.1016/0001-8708(92)90061-o
Bandelt, H. J., Forster, P., Sykes, B. C., and Richards, M. B. (1995). Mitochondrial portraits of human populations using median networks. Genetics 141, 743–753. doi:10.1093/genetics/141.2.743
Bohne, J., Dress, A. W. M., and Fischer, S. (1992). A simple proof for de Bruijn’s dualization principle. Sankhya. Ser. A 54, 77–84.
Bruen, T. C., Philippe, H., and Bryant, D. (2006). A simple and robust statistical test for detecting the presence of recombination. Genetics 172, 2665–2681. doi:10.1534/genetics.105.048975
Bryant, D., and Dress, A. W. M. (2006). Linearly independent split systems. Eur. J. Comb. 28, 1814–1831. doi:10.1016/j.ejc.2006.04.007
Bryant, D., Huson, D., Kloepper, T., and Nieselt-Struwe, K. (2003). “Distance corrections on recombinant sequences,” in Wabi (Cham: Springer), 271–286.
Bryant, D., and Moulton, V. (2002). “NeighborNet: An agglomerative method for the construction of planar phylogenetic networks,” in Algorithms in Bioinformatics, WABI 2002. Editors R. Guigó, and D. Gusfield (Cham: Springer Science), 2452, 375–391. LNCS.
Bryant, D., and Moulton, V. (2004). NeighborNet: An agglomerative algorithm for the construction of planar phylogenetic networks. Mol. Biol. Evol. 21, 255–265.
Bryant, D., Moulton, V., and Spillner, A. (2007). Consistency of the neighbor-net algorithm. Algorithms Mol. Biol. 2, 8. doi:10.1186/1748-7188-2-8
Bryant, D. (2005). On the uniqueness of the selection criterion in neighbor-joining. J. Classif. 22, 3–15. doi:10.1007/s00357-005-0003-x
Buneman, P. (1971). “The recovery of trees from measures of dissimilarity,” in Mathematics in the archaeological and historical sciences. Editors F. R. Hodson, D. G. Kendall, and P. Tautu (Edinburgh: Edinburgh University Press), 387–395.
Burkard, R. E., Deineko, V. G., and Woeginger, G. J. (2005). “The travelling salesman and the pq-tree,” in Integer Programming and Combinatorial Optimization: 5th International IPCO Conference Vancouver, British Columbia, Canada, June 3–5, 1996 Proceedings, British Columbia, Canada, June 3–5, 1996 (Cham: Springer), 490–504.
Cartis, C., Gould, N. I. M., and Toint, P. L. (2012). An adaptive cubic regularization algorithm for nonconvex optimization with convex constraints and its function-evaluation complexity. IMA J. Numer. Analysis 32, 1662–1695. doi:10.1093/imanum/drr035
Chepoi, V., and Fichet, B. (1998). A note on circular decomposable metrics. Geom. Dedicata 69, 237–240. doi:10.1023/a:1004907919611
Conn, A., Gould, N. I. M., and Toint, P. L. (2000). Trust-region methods, vol. 1 of MPS-SIAM series on optimization. Philadelphia, PA: SIAM.
Dahlquist, G., and Björck, Å. (2003). Numerical methods (Courier corporation). New York, United States: Dover Publications.
de Bruijn, N. G. (1986). Dualization of multigrids. J. de Physique 47, C3-C9–C3-18. doi:10.1051/jphyscol:1986302
Diday, E. (1984). Une représentation visuelle des classes empiétantes: Les pyramides. Ph.D. thesis (INRIA.
Dress, A. W. M., and Huson, D. (2004). Constructing splits graphs. IEEE/ACM Trans. Comput. Biol. Bioinforma. 1, 109–115. doi:10.1109/tcbb.2004.27
Eslahchi, C., Habibi, M., Hassanzadeh, R., and Mottaghi, E. (2010). Mc-net: A method for the construction of phylogenetic networks based on the monte-carlo method. BMC Evol. Biol. 10, 254. doi:10.1186/1471-2148-10-254
Gambette, P., and Huson, D. H. (2008). Improved layout of phylogenetic networks. IEEE/ACM Trans. Comput. Biol. Bioinforma. 5, 472–479. doi:10.1109/tcbb.2007.1046
Grünewald, S., Forslund, K., Dress, A., and Moulton, V. (2007). Qnet: An agglomerative method for the construction of phylogenetic networks from weighted quartets. Mol. Biol. Evol. 24, 532–538. doi:10.1093/molbev/msl180
Guo, M., and Grünewald, S. (2023). Lpnet: Reconstructing phylogenetic networks from distances using integer linear programming. Methods Ecol. Evol. 14, 1276–1286. doi:10.1111/2041-210X.14086
Hassanzadeh, R., Eslahchi, C., and Sung, W.-K. (2012). Constructing phylogenetic supernetworks based on simulated annealing. Mol. phylogenetics Evol. 63, 738–744. doi:10.1016/j.ympev.2012.02.009
Hendy, M. D., and Penny, D. (1993). Spectral analysis of phylogenetic data. J. Classif. 10, 5–24. doi:10.1007/bf02638451
Huson, D. H., and Bryant, D. (2006). Application of phylogenetic networks in evolutionary studies. Mol. Biol. Evol. 23, 254–267. doi:10.1093/molbev/msj030
Huson, D. H., Rupp, R., and Scornavacca, C. (2010). Phylogenetic networks. Cambridge: Cambridge University Press.
Johnson, S. C. (1967). Hierarchical clustering schemes. Psychometrika 32, 241–254. doi:10.1007/bf02289588
Levy, D., and Pachter, L. (2011). The neighbor-net algorithm. Adv. Appl. Math. 47, 240–258. doi:10.1016/j.aam.2010.09.002
Mazhar, H., Heyn, T., Negrut, D., and Tasora, A. (2015). Using Nesterov’s method to accelerate multibody dynamics with friction and contact. ACM Trans. Graph. (TOG) 34, 1–14. doi:10.1145/2735627
Morach, M., Stephan, R., Schmitt, S., Ewers, C., Zschöck, M., Reyes-Velez, J., et al. (2018). Population structure and virulence gene profiles of Streptococcus agalactiae collected from different hosts worldwide. Eur. J. Clin. Microbiol. Infect. Dis. 37, 527–536.
Nesterov, Y. (2003). Introductory lectures on convex optimization: A basic course, 87. Cham: Springer Science and Business Media.
Ondov, B. D., Treangen, T. J., Melsted, P., Mallonee, A. B., Bergman, N. H., Koren, S., et al. (2016). Mash: Fast genome and metagenome distance estimation using minhash. Genome Biol. 17, 132. doi:10.1186/s13059-016-0997-x
Phipps, P., and Bereg, S. (2011). Optimizing phylogenetic networks for circular split systems. IEEE/ACM Trans. Comput. Biol. Bioinforma. 9, 535–547. doi:10.1109/tcbb.2011.109
Keywords: NeighborNet, phylogenetic networks, SplitsTree, split networks, planar graph drawing
Citation: Bryant D and Huson DH (2023) NeighborNet: improved algorithms and implementation. Front. Bioinform. 3:1178600. doi: 10.3389/fbinf.2023.1178600
Received: 03 March 2023; Accepted: 04 August 2023;
Published: 20 September 2023.
Edited by:
Sean O’Donoghue, Garvan Institute of Medical Research, AustraliaReviewed by:
Mario Inostroza-Ponta, University of Santiago, ChileDaming Zhu, Shandong University, China
Copyright © 2023 Bryant and Huson. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: David Bryant, david.bryant@otago.ac.nz
†These authors have contributed equally to this work