CLR is a mutual information based algorithm first described in Large-Scale Mapping and Validation of Escherichia coli Transcriptional Regulation from a Compendium of Expr ession Profiles. The software and documentation can be downloaded from the Gardner Lab website after filling out an MTA agreement.
1. On a purely technical level, we remove all z-scores < 0 before computing sqrt(Z_i^2 +
Z_j^2). Read on for the 'why'.
2. This lack of clarity has roots in the issue of ‘tailedness’ of a distribution. The normal distribution is two-tailed (and symmetric) whereas mutual information is one-tailed (0 to infinity). Therefore, in the strictest sense, it is improper to use gaussian to model distribution of mutual information. As explained in our supplement, the right distribution to use would be a smooth empirical (e.g. kernel density) or a one-tailed analytical model (e.g. Rayleigh). However, the empirical distribution falls apart at the tails (on both sides) while Ra yleigh is difficult to work with owing to very small values of probabilities (often uncomputable in matlab). The z-scores are a good abstraction, and easy to work with, so we tried to use the gaussian to see how bad the fit was.
It turns out the right tail (corresponding to large mi values) fits fine. The next question we asked is: what about the significant left tail?
Well, the values of z-scores for E. coli extend much farther to the right than to the left (<3.5-4 zscores out on the left, 9 or more on the right (this is from memory)).
Even more importantly, the left tail is meaningless on the intuitive level. Mutual information captures just that - common information - and to have little common information may biologically mean simply that the two genes are very differently regulated and not even related in behavior through pleiotropic effects (changes in salinity or pressure, etc). In other words, the behavior of one as compared to the other is random. Such genes' relationships are not biologically significant (except in the sense that they must represent very different cellular processes). They
are certainly not connected in the network sense. We looked into such gene pairs with low z-scores and found that keeping their relationships had a very slight adverse effect on network recon
struction. The small magnitude of the impact owes to the fact that very few such gene pairs achieved anything resembling a real significance level. Certainly nothing at 60% likelihood, and ma
ybe only a handful at 15-20%.
There are two workarounds:
a) In clr.m, ‘rayleigh’ method section, replace the line
A = A + A’ - A.*A’;
with the line
A = log(1 - A) + log(1 - A’);
This will give you log-p-values, instead of probability estimates. There are certain downsides to the data transformed in this way, but the values will be rank-equivalent to the original outpu
t.
b) Use ‘normal’ mode to obtain z-score-scaled estimates.