Hierarchical Quantification of Synergy in Channels

Perrone, Paolo; Ay, Nihat

doi:10.3389/frobt.2015.00035

ORIGINAL RESEARCH article

Front. Robot. AI, 08 January 2016

Sec. Computational Intelligence in Robotics

Volume 2 - 2015 | https://doi.org/10.3389/frobt.2015.00035

This article is part of the Research Topic Theory and Applications of Guided Self-Organisation in Real and Synthetic Dynamical Systems View all 5 articles

Hierarchical Quantification of Synergy in Channels

Paolo Perrone¹* $Nihat Ay,,\r\n$ Nihat Ay^1,2,3

¹Max Planck Institute for Mathematics in the Sciences, Leipzig, Germany
²Faculty of Mathematics and Computer Science, University of Leipzig, Leipzig, Germany
³Santa Fe Institute, Santa Fe, NM, USA

The decomposition of channel information into synergies of different order is an open, active problem in the theory of complex systems. Most approaches to the problem are based on information theory and propose decompositions of mutual information between inputs and outputs in several ways, none of which is generally accepted yet. We propose a new point of view on the topic. We model a multi-input channel as a Markov kernel. We can project the channel onto a series of exponential families, which form a hierarchical structure. This is carried out with tools from information geometry in a way analogous to the projections of probability distributions introduced by Amari. A Pythagorean relation leads naturally to a decomposition of the mutual information between inputs and outputs into terms, which represent single node information, pairwise interactions, and in general n-node interactions. The synergy measures introduced in this paper can be easily evaluated by an iterative scaling algorithm, which is a standard procedure in information geometry.

1. Introduction

In complex systems like biological networks, for example neural networks, a basic principle is that their functioning is based on the correlation and interaction of their different parts. While correlation between two sources is well understood, and can be quantified by Shannon’s mutual information (see, for example, Kakihara (1999)), there is still no generally accepted theory for interactions of three nodes or more. If we label one of the nodes as “output,” the problem is equivalent to determine how much two (or more) input nodes interact to yield the output. This concept is known in common language as “synergy,” which means “working together,” or performing a task that would not be feasible by the single parts separately.

There are a number of important works which address the topic, but the problem is still considered open. The first generalization of mutual information was interaction information (introduced in McGill (1954)), defined for three nodes in terms of the joint and marginal entropies:

\begin{align} I (X : Y : Z) = - H (X, Y, Z) + H (X, Y) + H (X, Z) + H (Y, Z) + \\ - H (X) - H (Y) - H (Z) . \end{align}

(1)

Interaction information is defined symmetrically on the joint distribution, but most approaches interpret it by looking at a channel, rather than a joint distribution, (X,Y) → Z. For example, we can rewrite equation (1) equivalently in terms of mutual information (choosing Z as “output”):

I (X : Y : Z) = I (X, Y : Z) - I (X : Z) - I (Y : Z),

(2)

where we see that it can mean intuitively “how much the whole (X,Y) gives more (or less) information about Z than the sum of the parts separately.” Another expression, again equivalent, is

I (X : Y : Z) = I (X : Y | Z) - I (X : Y),

(3)

which we can interpret as “how much conditioning over Z changes the correlation between X and Y ” (see Schneidmann et al. (2003a)). Unlike mutual information, interaction information carries a sign:

• I > 0: synergy. Conditioning on one node increases the correlation between the remaining nodes. Or, the whole gives more information than the sum of the parts. Example: XOR function.

• I < 0: redundancy. Conditioning on one node decreases, or explains away the correlation between the remaining nodes. Or, the whole gives less information than the sum of the parts. Example: X = Y = Z.

• I = 0: 3-independence. Conditioning on one node has no effect on the correlation between the remaining nodes. Or, the whole gives the same amount of information as the parts separately. The nodes can nevertheless still be conditionally dependent. Example: independent nodes.¹

As argued in Schneidmann et al. (2003b), Williams and Beer (2010), and Griffith and Koch (2014), however, this is not the whole picture. There are systems which exhibit both synergetic and redundant behavior, and interaction information only quantifies the difference of synergy and redundancy, with a priori no way to tell the two apart. In a system with highly correlated inputs, for example, the synergy would remain unseen, as it would be canceled by the redundancy. Moreover, this picture breaks down for more than three nodes. Another problem pointed out in Schneidmann et al. (2003b) and Amari (2001) is that redundancy (as for example in X = Y = Z) can be described in terms of pairwise interactions, not triple, while synergy (as in the XOR function) is purely threewise. Therefore, I compares and mixes information quantities of different nature.

A detailed explanation of the problem for two inputs is presented in Williams and Beer (2010) and it yields a decomposition (“Partial Information Decomposition,” PID) as follows: there exist two non-negative quantities, Synergy and Redundancy, such that

I (X, Y : Z) = I (X : Z) + I (Y : Z) + Syn - Red,

(4)

or equivalently:

I (X : Y : Z) = Syn - Red .

(5)

Moreover, they define unique information for the inputs X and Y as

UI (X) = I (X : Z) - Red,

(6)

UI (Y) = I (Y : Z) - Red,

(7)

so that the total mutual information is decomposed positively:

I (X, Y : Z) = UI (X) + UI (Y) + Red + Syn .

(8)

What these quantities intuitively mean is

• Redundancy – information available in both inputs;

• Unique information – information available only in one of the inputs;

• Synergy – information available only when both inputs are present, arising purely from their interaction.

In this formulation, if one finds a measure of synergy, one can automatically define compatible measures of redundancy and unique information (and vice versa), provided that the measure of synergy is always larger or equal to I(X:Y :Z), and that the resulting measure of redundancy is less or equal than I(X:Z) and I(Y :Z). Synergy, redundancy, and unique information are defined on a channel, and choosing a different channel with the same joint distribution (e.g., (Y, Z) → X) may yield a different decomposition.

In Griffith and Koch (2014), an overview of (previous) measures of synergy and their shortcomings in standard examples is presented. In the same paper is then presented a newer measure for synergy, defined equivalently in Bertschinger et al. (2014) as

CI (X, Y; Z) : = I (X, Y : Z) - \min_{p^{*} \in \land} I_{p^{*}} (X, Y : Z),

(9)

where ∧ is the space of distributions with prescribed marginals:

\land = {q \in P (X, Y, Z) | q (X, Z) = p (X, Z), q (Y, Z) = p (Y, Z)} .

(10)

This measure satisfies interesting properties (proven in Griffith and Koch (2014) and Bertschinger et al. (2014)), which make it compatible with Williams and Beer’s PID, and with the intuition in most examples. However, it was proven in Rauh et al. (2015) that such an approach can not work in the desired way for more than three nodes (two inputs).

Our approach uses information geometry (Amari and Nagaoka, 2000), extending previous work on hierarchical decompositions (Amari, 2001) and complexity (Ay, 2015). (Compare the related approach on information decomposition pursued in Harder et al. (2013).) The main tools of the present paper are KL-projections and the Pythagorean relation that they satisfy. This allows (as in Amari (2001)) to form hierarchies of interactions of different orders in a geometrical way. In the present problem, we decompose mutual information between inputs and outputs of a channel k, for two inputs, as

I (X, Y : Z) = d_{1} (k) + d_{2} (k),

(11)

where d₂ quantifies synergy (as in Equation (8)), and d₁ integrates all the lower order terms (UI, Red), quantifying the so-called union information (see Griffith and Koch (2014)). One may want to use this measure of synergy to form a complete decomposition analogous to Equation (8), but this does not work, as in general it is not true that d₂ ≤ I(X:Y :Z). For this reason, we keep the decomposition more coarse, and we do not divide union information into unique and redundant.

For more inputs X₁, … , X_N, the decomposition generalizes to

I (X_{1}, \dots, X_{N} : Z) = d_{1} (k) + \dots + d_{N} (k) = \sum_{i = 1}^{N} d_{i} (k),

(12)

where higher orders of synergy appear.

Until now, there seems to be no way of rewriting the decomposition of Griffith and Koch (2014) and Bertschinger et al. (2014) in a way consistent with information geometry, and more in general, Williams and Beer’s PID seems hard to write as a geometric decomposition. A comparison between d₂ and the measure CI of Griffith and Koch (2014) and Bertschinger et al. (2014) is presented in Section 5. There we show that d₂ ≤ CI, and we argue, with a numerical example, that CI overestimates synergy at least in one case.

For a small number of inputs (≲ 5), our quantities are easily computable with the standard algorithms of information geometry (like iterative scaling (Csiszár and Shields, 2004)). This allowed us to get precise quantities for all the examples considered.

1.1. Technical Definitions

We consider a set of N input nodes V = {1, … , N}, taking values in the sets X₁, … , X_N, and an output node, taking values in the set Y. We write the input globally as X: = X₁ × … × X_N. For example, in biology, Y can be the phenotype and X can be a collection of genes determining Y. We denote by F(Y) the set of real functions on Y and with P(X) the set of probability measures on X.

We can model the channel from X to Y as a Markov kernel (called also stochastic kernel, transition kernel, or stochastic map) k, that assigns to each x ∈ X a probability measure on Y (for a detailed treatment, see Kakihara (1999)). Here, we will consider only finite systems, so we can think of a channel simply as a transition matrix (or stochastic matrix) whose rows sum to one.

k (x; y) \geq 0 \forall x, y; \sum_{y} k (x; y) = 1 \forall x .

(13)

The space of channels from X to Y will be denoted by K(X;Y). We will denote by X and Y also the corresponding random variables, whenever this does not lead to confusion.

Conditional probabilities define channels: if p(X,Y) ∈ P(X,Y) and the marginal p(X) is strictly positive, then p(Y |X) ∈ K(X;Y) is a well-defined channel. Vice versa, if k ∈ K(X;Y), given p ∈ P(X) we can form a well-defined joint probability:

pk (x, y) : = p (x) k (x; y) \forall x, y .

(14)

An “input distribution” p ∈ P(X) is crucial also to extend the notion of divergence from probability distributions to channels. The most natural way of doing it is the following.

Definition 1: Let p ∈ P(X), let k, m ∈ K(X;Y). Then

D_{p} (k | | m) : = \sum_{x, y} p (x) k (x; y) \log \frac{k (x; y)}{m (x; y)} .

(15)

Defined this way, D_p is affine in p. Moreover, it has an important compatibility property. Let p,q be joint probability distributions on X × Y, and let D be the KL-divergence. Then

D (p (X, Y) | | q (X, Y)) = D (p (X) | | q (X)) + D_{p (X)} (p (Y | X) | | q (Y | X)) .

(16)

We will now illustrate our geometric ideas in channels with one, two, and three input nodes, and then we present some examples. The general case will be addressed in Section 4.

2. Geometric Idea of Synergy

2.1. Mutual Information as Motivation

It is a well-known fact in information theory that Shannon’s mutual information can be written as a KL-divergence:

I_{p} (X : Y) = D (p (X, Y) | | p (X) p (Y)) .

(17)

From the point of view of information geometry, this can be interpreted as a “distance” between the real distribution and a product distribution that has exactly the same marginals, but maximal entropy (see Figure 1). In other words, we have

I_{p} (X : Y) = inf_{\begin{matrix} q \in P (X) \\ r \in P (Y) \end{matrix}} D (p (X, Y) | | q (X) r (Y)) .

(18)

FIGURE 1

Figure 1. For two binary nodes, the family of product distributions is a surface in a 3-dimensional simplex.

The distribution given by p(X)p(Y) is optimal in the sense that

p (X) p (Y) = \underset{\begin{matrix} q \in P (X) \\ r \in P (Y) \end{matrix}}{\arg \min} D (p (X, Y) | | q (X) r (Y)) .

(19)

The divergence between p and a submanifold is, as usual in geometry, the “distance” between p and the “closest point” on that submanifold, which in our case is the geodesic projection w.r.t. the mixture connection.

2.2. Extension to Channels

We can use the same insight with channels. Instead of a joint distribution on N nodes, we consider a channel from an input X to an output Y. Suppose we have a family $E$ of channels, and a channel k that may not be in $E$ . Then, just as in geometry, we can define the “distance” between k and $E$ .

Definition 2: Let p be an input distribution. The divergence between a channel k and a family of channels $E$ is given by

D_{p} (k | | E) : = inf_{m \in E} D_{p} (k | | m) .

(20)

If the minimum is uniquely realized, we call the channel

π_{E} k : = \underset{m \in E}{\arg \min} D_{p} (k | | m)

(21)

the KL-projection of k on $E$ (and simply “a” KL-projection if it is not unique).

We will always work with compact families, so the minima will always be realized, and for strictly positive p they will be unique (see Section 4 for the details).

We will consider families $E$ for which the KL-divergence satisfies a Pythagorean equality (see Figure 2 below for some intuition):

D_{p} (k | | m) = D_{p} (k | | π_{E} k) + D_{p} (π_{E} k | | m)

(22)

for every m ∈ $E$ . These families (technically, closures of exponential families) are defined in Section 4.

FIGURE 2

Figure 2. Illustration of the Pythagoras theorem for projections.

2.3. One Input

Consider first one-input node X, with input distribution p(X), and one output node Y. A constant channel k in K(X;Y) is a channel whose entries do not depend on X (more precisely: k(x;y) = k(x′;y) for any x,x′, y). This denomination is motivated by the following properties:

• They correspond to channels that do not use any information from the input to generate the output.

• The output distribution given by k is a probability distribution on Y which does not depend on X.

• Deterministic constant channels are precisely constant functions.

We call $E$ ₀ the family of constant channels. Take now any channel k ∈ K(X;Y). If we want to quantify the dependence in k of Y on X, we can then look at the divergence of k from the constant channels:

d_{1} (k) : = D_{p} (k | | E_{0}) .

(23)

The minimum is realized in $π_{E_{0}} k$ . We have that

d_{1} (k) = D_{p} (k | | π_{E_{0}} k) = \sum_{x, y} p (x) k (x; y) \log \frac{k (x; y)}{π_{E_{0}} k (y)}

(24)

= H_{p π_{E_{0}} k} (Y) - H_{pk} (Y | X) = I_{pk} (X : Y),

(25)

so that consistently with our intuition, the dependence of Y on X is just the mutual information. From the channel point of view, it is simply the divergence from the constant channels. (A rigorous calculation is done in Section 4.)

2.4. Two Inputs

Consider now two input nodes with input probability p and one output node. We can again define the family $E$ ₀ of constant channels, and the same calculations give

D_{p} (k | | E_{0}) = I_{pk} (X_{1}, X_{2} : Y) .

(26)

This time, though, we can say a lot more: the quantity above can be decomposed. In analogy with the independence definition for probability distributions, we would like to define a split channel as a product channel of its parts: p(y|x₁, x₂) = p(y|x₁)p(y|x₂). Unfortunately, the term on the right would be in general not normalized, so we replace the condition by a weaker one. We call the channel k(X₁, X₂; Y) split if it can be written as

k (x_{1}, x_{2}; y) = ϕ_{0} (x_{1}, x_{2}) ϕ_{1} (x_{1}; y) ϕ_{2} (x_{2}; y)

(27)

for some functions ϕ₀, ϕ₁, and ϕ₂, which in general are not themselves channels (in particular, ϕ_i(x_i; y) ≠ p(y|x_i)). We call $E$ ₁ the family of split channels. This family corresponds to those channels that do not have any synergy. This is a special case of an exponential family, analogous to the family of product distributions of Figure 1. The examples “single node” and “split channel” in the next section belong exactly to this family. Take now any channel k(X₁, X₂; Y). In analogy with mutual information, we call synergy the divergence:

d_{2} (k) : = D_{p} (k | | E_{1}) .

(28)

Simply speaking, our synergy is quantified as the deviation of the channel from the set $E$ ₁ of channels without synergy.

We can now project k first to $E$ ₁, and then to $E$ ₀. Since $E$ ₀ is a subfamily of $E$ ₁, the following Pythagoras relation holds from Equation (22):

D_{p} (k | | π_{E_{0}} k) = D_{p} (k | | π_{E_{1}} k) + D_{p} (π_{E_{1}} k | | π_{E_{0}} k) .

(29)

If, in analogy with the one-input case, we call the last quantity d₁, we get from Equations (26) and (28):

I_{pk} (X_{1}, X_{2} : Y) = d_{2} (k) + d_{1} (k) .

(30)

The term d₁ measures how much information comes from single nodes (but it does not tell which nodes). The term d₂ measures how much information comes from the synergy of X₁ and X₂ in the channel. The example “XOR” in the next section will show this.

If we call $E$ ₂ the whole K(X;Y), we get $E$ ₀ ⊂ $E$ ₁ ⊂ $E$ ₂ and

d_{i} (k) : = D_{p} (π_{E_{i}} k | | π_{E_{i - 1}} k) .

(31)

2.5. Three Inputs

Consider now three nodes X₁, X₂, and X₃ with input probability p and a channel k. We have again

D_{p} (k | | E_{0}) = I_{pk} (X_{1}, X_{2}, X_{3} : Y) .

(32)

This time we can decompose the mutual information in different ways. We can, for example, look at split channels, i.e., in the form:

k (x_{1}, x_{2}, x_{3}; y) = ϕ_{0} (x) ϕ_{1} (x_{1}; y) ϕ_{2} (x_{2}; y) ϕ_{3} (x_{3}; y)

(33)

for some ϕ₀, ϕ₁, ϕ₂, and ϕ₃. As in the previous case, we call this family $E$ ₁. Or we can look at more interesting channels, the ones in the form:

k (x_{1}, x_{2}, x_{3}; y) = ϕ_{0} (x) ϕ_{12} (x_{1}, x_{2}; y) ϕ_{13} (x_{1}, x_{3}; y) ϕ_{23} (x_{2}, x_{3}; y)

(34)

for some ϕ₀, ϕ₁₂, ϕ₁₃, and ϕ₂₃. We call this family $E$ ₂, and it is easy to see that

E_{0} \subset E_{1} \subset E_{2} \subset E_{3},

(35)

where $E$ ₀ denotes again the constant channels, and $E$ ₃ denotes the whole K(X;Y). We define again

d_{i} (k) : = D_{p} (π_{E_{i}} k | | π_{E_{i - 1}} k) .

(36)

This time, the Pythagorean relation can be nested, and it gives us

I_{pk} (X_{1}, X_{2}, X_{3} : Y) = d_{3} (k) + d_{2} (k) + d_{1} (k) .

(37)

The difference between pairwise synergy and threewise synergy is shown in the “XOR” example in the next section.

Now that we have introduced the measure for a small number of inputs, we can study the examples from the literature (Griffith and Koch, 2014) and show that our measure is consistent with the intuition. The general case will be more in rigor introduced in Section 4.

3. Examples

Here, we present some examples of decomposition for well-known channels. All the quantities have been computed using an algorithm analogous to iterative scaling (as in Csiszár and Shields (2004)).

3.1. Single Node Channel

The easiest example is considering a channel which only depends on X₁, i.e.,

I (X : Y) = I (X_{1} : Y) .

(38)

For example, consider 3 binary input nodes X₁, X₂, and X₃ with constant input probability and one binary output node Y which is an exact copy of X₁.

Then, we have exactly one bit of single node information and no higher order terms (see Figure 3). Geometrically, k lies in $E$ ₁, so the only non-zero divergence in equation (37) is d₁(k). As one would expect, d₂(k) and d₃(k) vanish, as there is no synergy of orders 2 and 3.

FIGURE 3

Figure 3. Synergies of different orders for the single-node channel, Example 3.1.

3.2. Split Channel

The second easiest example is a more general channel which obeys equation (33). In particular, consider 3 binary input nodes X₁, X₂, and X₃ with constant input probability (so, the x_i are independent), and output Y = X₁ × X₂ × X₃. As channel, we simply take the identity map (x₁, x₂, x₃) ↦ (x₁, x₂, x₃) ∈ Y. In this particular case:

I (X : Y) = \sum_{i} I (X_{i}; Y) .

(39)

We have 3 bits of mutual information (see Figure 4), which are all single node (but from different nodes). Since

k (x_{1}, x_{2}, x_{3}; y) = ϕ_{1} (x_{1}; y_{1}) ϕ_{2} (x_{2}; y_{2}) ϕ_{3} (x_{3}; y_{3}),

(40)

which is a special case of Equation (33), k ∈ $E$ ₁, and so d₂(k) and d₃(k) in Equation (37) are again zero.

FIGURE 4

Figure 4. Synergies of different orders for the split channel, Example 3.2.

3.3. Correlated Inputs

Consider 3 perfectly correlated binary nodes, each one with uniform marginal probability. As output take a perfect copy of one (hence, all) of the inputs. We have again one bit of mutual information, which could come from any of the nodes, but no synergy, as no two nodes are interacting in the channel. The input distribution has correlation, but this has no effect on the channel, since the channel is simply copying the value of X₁ (or X₂ or X₃, equivalently). Therefore, again k ∈ $E$ ₁. Of the terms in Equation (37), again the only non-zero is d₁(k) (see Figure 5).

FIGURE 5

Figure 5. Synergies of different orders for correlated inputs, Example 3.3.

This example in the literature is used to motivate the notion of redundancy. A “redundant channel” is in our decomposition exactly equivalent to a single node channel, since it contains exactly the same amount of information.

3.4. Parity (XOR)

The standard example of synergy is given by the XOR function and more generally by the parity function between two or more nodes.

For example, consider 3 binary input nodes X₁, X₂, and X₃ with constant input probability, and one binary output node Y which is given by X₁ ⊻ X₂. We have 1 bit of mutual information, which is purely arising from a pairwise synergy (of X₁ and X₂), so this time k ∈ $E$ ₂. The function XOR is pure synergy, so d₂(k) is the only non-zero term in Equation (37) (see Figure 6).

FIGURE 6

Figure 6. Synergies of different orders for the binary XOR function, Example 3.4.

If instead Y is given by the threewise parity function, or X₁ ⊻ X₂ ⊻ X₃, we have again 1 bit of mutual information, which now is purely arising from a threewise synergy, so here k ∈ $E$ ₃, and the only non-zero term in Equation (37) is d₃(k) (see Figure 7).

FIGURE 7

Figure 7. Synergies of different orders for the three wise parity function, Example 3.4.

In these examples, there are no terms of lower order synergy, but the generic elements of $E$ ₂ and $E$ ₃ usually do contain a non-zero lower part. Consider, for instance, the next example.

3.5. AND and OR

The other two standard logic gates, AND and OR, share the same decomposition. Consider two binary nodes X₁, and X₂ with uniform probability, and let Y be X₁ ∨ X₂ (or X₁ ∧ X₂). There is again one bit of mutual information, which comes mostly from single nodes, but also from synergy (see Figure 8).

FIGURE 8

Figure 8. Synergies of different orders for the AND (and OR) function, Example 3.5.

Geometrically, this means that AND and OR are channels in $E$ ₂, which lie close to the submanifold $E$ ₁.

3.6. XorLoses

Here, we present a slightly more complicated example, coming from Griffith and Koch (2014). We have three binary nodes X₁, X₂, and X₃, where X₁ and X₂ have uniform probabilities, and an output node Y = X₁ ⊻ X₂, just like in the “XOR” example. Now we take X₃ to be perfectly correlated with Y = X₁ ⊻ X₂, so that Y could get the information either from X₃ or from the synergy between X₁ and X₂. We have one bit of mutual information, which can be seen as entirely coming from X₃, and so the synergy between X₁ and X₂ is not adding anything (see Figure 9).

FIGURE 9

Figure 9. Synergies of different orders for the XorLoses channel, Example 3.6.

3.7. XorDuplicate

Again from Griffith and Koch (2014), we have 3 binary nodes X₁, X₂, and X₃, where X₁ and X₂ have uniform probabilities, while X₃ = X₁. The output is X₁ ⊻ X₂ = X₃ ⊻ X₂, so it could get the information either from the synergy between X₁ and X₂ or X₂ and X₃. There is one bit of mutual information, which is coming from a pairwise interaction (see Figure 10). Again, it does not matter between whom.

FIGURE 10

Figure 10. Synergies of different orders for the XorDuplicate channel, Example 3.7.

It should be clear from the examples here that decomposing only by order, and not by the specific subsets, is crucial. For example, in the “input correlation” example, there is no natural way to decide from which single node the information comes, even if it is clear that the interaction is of order 1.

4. General Case

Here, we try to give a general formulation, for N inputs, of the quantities defined in Section 2. As in the Section “Introduction,” we call the set of input nodes V of cardinality N, and we consider a subset I of the nodes. We denote the joint random variable (X_i, i ∈ I) by X_I, and we denote the complement of I in V by I^c. The case N = 3 in Section 2 should motivate the following definition.

Definition 3: Let I ⊆ V. We call F_I the space of functions that only depend on X_I and Y :

F_{I} : = {f \in F (X, Y) | f (x_{I}, x_{I^{c}}; y) = f (x_{I}, {x'}_{I^{c}}; y) \forall x_{I^{c}}, {x'}_{I^{c}}} .

(41)

Let 0 ≤ i ≤ N. We call F_i the space of channels which can be written as a product of functions of F_I with the order of I at most i:

E_{i} : = cl \{k \in K (X; Y) | \exists ϕ_{I} \in F_{I}, ϕ_{0} \in F (X) | k = ϕ_{0} \prod_{I} ϕ_{I}; | I | \leq i\},

(42)

where cl denotes the closure in K(X;Y). Intuitively, this means that $E$ _i does not only contain terms in the form given in the curly brackets, but also limits of such terms. Stated differently, the closure of a set includes not only the set itself, but also its boundary. This is important because when we project to a family, the projection may lie on the boundary. In order for the result to exist, the boundary must then be included.

This way:

• $E$ ₀ is the space of constant channels;

• $E$ _N is the whole K(X;Y);

• $E$ _i ⊆ $E$ _j if and only if i ≤ j;

• For N ≤ 3 we recover exactly the quantities of Section 2.

The family $E$ _i is also the closure of the family in the form:

\{\frac{1}{Z (X)} \exp (\sum_{I} f_{I} (X; Y)) | f_{I} \in F_{I}; | I | \leq i\},

(43)

where

Z (x) : = \sum_{y} \exp (\sum_{I} f_{I} (x; y)) .

(44)

Such families are known in the literature as exponential families (see, for example, Amari and Nagaoka (2000)). In particular, it is compact (for finite N), so that the infimum of any function on $E$ _i is always a minimum. This means that for a channel k and an input distribution p:

D_{p} (k | | E_{i}) : = inf_{m \in E_{i}} D_{p} (k | | m) = \min_{m \in E_{i}} D_{p} (k | | m)

(45)

always exists. If it is unique, for example if p is strictly positive, we define the unique KL-projection as

π_{E_{i}} k : = \underset{m \in E_{i}}{\arg \min} D_{p} (k | | m) .

(46)

$π_{E_{i}} k$ has the property that it defines the same output probability on Y.

Definition 4: Let k ∈ K(X;Y), let 1 ≤ i ≤ N. Then the i-wise synergy of k is (if the KL-projections are unique):

d_{i} (k) : = D_{p} (π_{E_{i}} k | | π_{E_{i - 1}} k) .

(47)

For more clarity, we call the 1-wise synergy “single node information” or “single-node dependence.”

For k ∈ K(X;Y) = $E$ _N, we can look at its divergence from $E$ ₀. If we denote $π_{E_{0}} k$ by k₀:

D_{p} (k | | E_{0}) = D_{p} (k | | k_{0}) = \sum_{x, y} p (x) k (x; y) \log \frac{k (x; y)}{k_{0} (y)} .

(48)

If k is not strictly positive, we take the convention 0 log(0/0) = 0, and we discard the zero terms from the sum. Since the output distributions are the same, but k₀ is constant in x, it can not happen that for some (x;y), k₀(x;y) = 0 but k(x;y) ≠ 0. (The very same is true for all KL-projections $π_{E_{i}} k$ , since $D_{p} (π_{E_{i}} k | | k_{0}) \leq D_{p} (k | | k_{0})$ .) For all other terms, Equation (48) becomes:

D_{p} (k | | E_{0}) = \sum_{x, y} p (x) k (x; y) \log k (x; y) - \sum_{x, y} p (x) k (x; y) \log k_{0} (y)

(49)

= - H_{pk} (Y | X) - \sum_{y} k_{*} p (y) \log k_{0} (y)

(50)

= - H_{pk} (Y | X) - \sum_{y} {k_{0}}_{*} p (y) \log k_{0} (y)

(51)

= - H_{pk} (Y | X) + H_{{k_{0}}_{*} p} (Y) = - H_{pk} (Y | X) + H_{k_{*} p} (Y)

(52)

= I_{pk} (X : Y) .

(53)

On the other hand, the Pythagorean relation (22) implies:

D_{p} (k | | k_{0}) = D_{p} (k | | π_{E_{N - 1}} k) + D_{p} (π_{E_{N - 1}} k | | k_{0}),

(54)

and iterating:

D_{p} (k | | k_{0}) = D_{p} (k | | π_{E_{N - 1}} k) + D_{p} (π_{E_{N - 1}} k | | π_{E_{N - 2}} k) + \dots + D_{p} (π_{E_{1}} k | | k_{0}) .

(55)

In the end, we get

I (X : Y) = \sum_{i = 1}^{N} D_{p} (π_{E_{i}} k | | π_{E_{i - 1}} k) = \sum_{i = 1}^{N} d_{i} (k) .

(56)

This decomposition is always non-negative, and it depends on the input distribution. The terms in Equation (56) can be in general difficult to compute exactly. Nevertheless, they can be approximated with iterative procedures.

5. Comparison with Two Recent Approaches

The measure of synergy, or respectively complementary information, defined in Griffith and Koch (2014) and Bertschinger et al. (2014) is

C I_{p} (Y : X_{1}, X_{2}) : = I_{p} (Y : X_{1}, X_{2}) - \min_{p^{*} \in \land} I_{p^{*}} (Y : X_{1}, X_{2}),

(57)

where ∧ is the space of prescribed marginals:

\land = {q \in P (X_{1}, X_{2}, Y) | q (X_{1}, Y) = p (X_{1}, Y), q (X_{2}, Y) = p (Y_{2}, Y)} .

(58)

Our measure of synergy can be written, for two inputs, in a similar form:

d_{2} (k) = D_{p} (k | | π_{E_{1}} k) = I_{p} (Y : X_{1}, X_{2}) - \min_{p^{*} \in △} I_{p^{*}} (Y : X_{1}, X_{2}),

(59)

where △, in addition to the constraints of ∧, prescribes also the input:

\begin{align} △ = {q \in P (X_{1}, X_{2}, Y) | \\ q (X_{1}, Y) = p (X_{1}, Y), q (X_{2}, Y) = p (Y_{2}, Y), q (X_{1}, X_{2}) = p (X_{1}, X_{2})} . \end{align}

(60)

Clearly △ ⊆ ∧, so

\min_{p^{*} \in △} I_{p^{*}} (Y : X_{1}, X_{2}) \geq \min_{p^{*} \in \land} I_{p^{*}} (Y : X_{1}, X_{2}),

(61)

which implies that

d_{2} (k) \leq C I_{p} (Y : X_{1}, X_{2}) .

(62)

We argue that not prescribing the input leads to overestimating synergy, because the subtraction in Equation (57) includes a possible difference in the correlation of the input distributions.

For example, consider X₁, X₂, Y binary and correlated, but not perfectly correlated. (For perfectly correlated nodes, as in Section 3, △ = ∧, so there is no difference between the two measures.) In detail, consider the channel:

k_{β} (x_{1}, x_{2}; y) : = \frac{\exp (β y (x_{1} + x_{2}))}{\sum_{y'} \exp (β y' (x_{1} + x_{2}))},

(63)

and the input distribution:

p_{α} (x_{1}, x_{2}) : = \frac{\exp (α x_{1} x_{2})}{\sum_{{x'}_{1}, {x'}_{2}} \exp (α {x'}_{1} {x'}_{2})} .

(64)

For α, β → ∞, the correlation becomes perfect, and the two measures of synergy are both zero. For 0 < α, β < ∞, our measure d₂(k_β) is zero, as clearly k_β ∈ $E$ ₁. CI is more difficult to compute, but we can give a (non-zero) lower bound in the following way. First, we fix two values β = β₀, α = α₀. We consider the joint distribution $p_{α_{0}} k_{β_{0}}$ , and look at the marginals:

p_{α_{0}} k_{β_{0}} (X_{1}, Y), p_{α_{0}} k_{β_{0}} (X_{2}, Y) .

(65)

We define the family ∧ as the set of joint probabilities which have exactly these marginals. If we increase β, we can always find an α such that the marginals do not change:

p_{α} k_{β} (X_{1}, Y) = p_{α_{0}} k_{β_{0}} (X_{1}, Y), p_{α} k_{β} (X_{2}, Y) = p_{α_{0}} k_{β_{0}} (X_{2}, Y),

(66)

i.e., such that p_α k_β ∈ ∧. Now we can look at the mutual information of p_α k_β and of $p_{α_{0}} k_{β_{0}}$ . If they differ, and (for example) the former is larger, then

\begin{align} I_{p_{α} k_{β}} (Y : X_{1}, X_{2}) - I_{p_{α_{0}} k_{β_{0}}} (Y : X_{1}, X_{2}) \\ \leq I_{p_{α} k_{β}} (Y : X_{1}, X_{2}) - \min_{p^{*} \in \land} I_{p^{*}} (Y : X_{1}, X_{2}) = C I_{p_{α} k_{β}} \end{align}

(67)

is a well-defined lower bound for $C I_{p_{α} k_{β}}$ . With a numerical simulation, we can show graphically that the mutual information is indeed not constant within the families ∧ (see Figure 11).

FIGURE 11

Figure 11. Mutual information and fixed marginals. The shades of blue represent the amount of I_p(Y :X₁,X₂) as a function of α,β (brighter is higher). Each red line represents a family ∧ of fixed marginals. While the lines of fixed mutual information and the families of fixed marginals look qualitatively similar, they do not coincide exactly, which means that I_p varies within the ∧.

From the picture, we can see that the red lines (families ∧ for different initial values) approximate well the lines of constant mutual information, at least qualitatively, but they are not exactly equal. This means that for most points p of ∧, the quantity:

C I_{p} (Y : X_{1}, X_{2}) : = I_{p} (Y : X_{1}, X_{2}) - \min_{p^{*} \in \land} I_{p^{*}} (Y : X_{1}, X_{2})

(68)

will be non-zero. More explicitly, we can plot the increase in mutual information as p varies in ∧, for example, as a function of β (see Figure 12). This is always larger than or equal to the difference between the mutual information and its minimum in ∧ (i.e., CI). We can see that it is positive, which implies that CI_p is also positive.

FIGURE 12

Figure 12. Lower bound for CI versus β. For each β ∈ [0.7, 3] we can find an α such that the joint p_α k_β lies in ∧. The increase in mutual information as β varies is a lower bound for CI, which is then in general non-zero.

We can see in Figure 11 that especially for very large or very small values of α and β (i.e., very strong or very weak correlation), CI captures the behavior of mutual information quite well. These limits are precisely deterministic and constant kernels, for which most approaches in quantifying synergy coincide. This is the reason why the examples studied in Griffith and Koch (2014) give quite a satisfying result for CI (in their notation, S_VK). For the less studied (and computationally more complex) intermediate values, like 1 < α, β < 2, the approximation is instead far from accurate, and in that interval (see Figure 12), there is a sharp increase in I, which leads to overestimating synergy.

6. Conclusion

Using information geometry, we have defined a non-negative decomposition of the mutual information between inputs and outputs of a channel.

The decomposition divides the mutual information into contributions of the different orders of synergy in a unique way. It does not, however, divide the mutual information into contributions of the different subsets of input nodes as Williams and Beer’s PID (Williams and Beer, 2010) would require.

For two inputs, our measure of synergy is closely related to the well-received quantification of synergy in Griffith and Koch (2014) and Bertschinger et al. (2014). Our measure though works in the desired way for an arbitrary (finite) number of inputs. Differently from Griffith and Koch (2014) and Bertschinger et al. (2014), anyway, we do not define a measure for redundant or “shared” information, nor unique information of the single inputs or subsets.

The decomposition depends on the choice of an input distribution. In case of input correlation, redundant information is counted automatically only once. This way, there is no need to quantify redundancy separately.

In general, there is no way to compute our quantities in closed form, but they can be approximated by an iterative-scaling algorithm (see, for example, Csiszár and Shields (2004)). The results are consistent with the intuitive properties of synergy, outlined in Williams and Beer (2010) and Griffith and Koch (2014).

Author Contributions

The research was initiated by NA. It was carried out by both authors, with main contribution by PP, who also wrote the article. Both authors read and approved the final manuscript.

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The handling editor Mikhail Prokopenko declares that despite having collaborated with author Nihat Ay, the review process was handled objectively and no conflict of interest exists.

Footnote

^For an example in which I = 0 but the nodes are not independent, see Williams and Beer (2010).

References

Amari, S. (2001). Information geometry on a hierarchy of probability distributions. IEEE Trans. Inf. Theory 47, 1701–1709. doi: 10.1109/18.930911

CrossRef Full Text | Google Scholar

Amari, S., and Nagaoka, H. (2000). Methods of Information Geometry. Translations of Mathematical Monographs, vol. 191. Oxford: American Mathematical Society.

Google Scholar

Ay, N. (2015). Information geometry on complexity and stochastic interaction. Entropy 17, 2432. doi:10.3390/e17042432

CrossRef Full Text | Google Scholar

Bertschinger, N., Rauh, J., Olbrich, E., Jost, J., and Ay, N. (2014). Quantifying unique information. Entropy 16, 2161. doi:10.3390/e16042161

CrossRef Full Text | Google Scholar

Csiszár, I., and Shields, P. C. (2004). Information theory and statistics: a tutorial. Found. Trends Commun. Inf. Theory 1, 417–528. doi:10.1561/0100000004

CrossRef Full Text | Google Scholar

Griffith, V., and Koch, C. (2014). “Quantifying synergistic mutual information,” in Guided Self-Organization: Inception, Emergence, Complexity and Computation Series, vol. 9. Springer.

Google Scholar

Harder, M., Salge, C., and Polani, D. (2013). Bivariate measure of redundant information. Phys. Rev. E 87, 012130. doi:10.1103/PhysRevE.87.012130

PubMed Abstract | CrossRef Full Text | Google Scholar

Kakihara, Y. (1999). Abstract Methods in Information Theory. Series on Multivariant Analysis, vol. 4. Singapore: World Scientific.

Google Scholar

McGill, W. L. (1954). Multivariate information transmission. Psychometrika 19, 97–116. doi:10.1007/BF02289159

CrossRef Full Text | Google Scholar

Rauh, J., Bertschinger, N., Olbrich, E., and Jost, J. (2014). “Reconsidering unique information: towards a multivariate information decomposition,” in International Symposium on Information Theory (ISIT). IEEE.

Google Scholar

Schneidmann, E., Bialek, W., and Berry, M. J. II (2003a). Synergy, redundancy, and independence in population codes. J. Neurosci. 23, 11539–11553. doi:10.1523/JNEUROSCI.5319-04.2005

PubMed Abstract | CrossRef Full Text | Google Scholar

Schneidmann, E., Still, S., Berry, M. J. II, and Bialek, W. (2003b). Network information and connected correlations. Phys. Rev. Lett. 91, 238701–238704. doi:10.1103/PhysRevLett.91.238701

PubMed Abstract | CrossRef Full Text | Google Scholar

Williams, P. L. and Beer, R. D. (2010). Nonnegative decomposition of multivariate information. Preprint available on arXiv:1004.2151.

Google Scholar

Keywords: synergy, redundancy, hierarchy, projections, divergences, interactions, iterative scaling, information geometry

Citation: Perrone P and Ay N (2016) Hierarchical Quantification of Synergy in Channels. Front. Robot. AI 2:35. doi: 10.3389/frobt.2015.00035

Received: 14 October 2015; Accepted: 08 December 2015;
Published: 08 January 2016

Edited by:

Mikhail Prokopenko, University of Sydney, Australia

Reviewed by:

Daniele Marinazzo, University of Ghent, Belgium
Rick Quax, University of Amsterdam, Netherlands

Copyright: © 2016 Perrone and Ay. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Paolo Perrone, cGVycm9uZSYjeDAwMDQwO21pcy5tcGcuZGU=

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.