CLR Algorithm

CLR is a mutual information based algorithm first described in Large-Scale Mapping and Validation of Escherichia coli Transcriptional Regulation from a Compendium of Expr ession Profiles. The software and documentation can be downloaded from the Gardner Lab website after filling out an MTA agreement.

Click here to download CLR

CLR FAQ

  1. Q: After z-score normalization, matrix entries with sufficiently low mutual information ought to become ’significant’: after all, a square of a negative z-score is a positive , and thus, sqrt(Z_i^2 + Z_j^2) where Z_i << 0 or Z_j << 0 ought to be high. Are these edges important? Is there a bug? What's going on here?

    A: There are two answers to this, a technical answer, and a more complete ‘background’.

    1. On a purely technical level, we remove all z-scores < 0 before computing sqrt(Z_i^2 +
    Z_j^2). Read on for the 'why'.

    2. This lack of clarity has roots in the issue of ‘tailedness’ of a distribution. The normal distribution is two-tailed (and symmetric) whereas mutual information is one-tailed (0 to infinity). Therefore, in the strictest sense, it is improper to use gaussian to model distribution of mutual information. As explained in our supplement, the right distribution to use would be a smooth empirical (e.g. kernel density) or a one-tailed analytical model (e.g. Rayleigh). However, the empirical distribution falls apart at the tails (on both sides) while Ra yleigh is difficult to work with owing to very small values of probabilities (often uncomputable in matlab). The z-scores are a good abstraction, and easy to work with, so we tried to use the gaussian to see how bad the fit was.

    It turns out the right tail (corresponding to large mi values) fits fine. The next question we asked is: what about the significant left tail?

    Well, the values of z-scores for E. coli extend much farther to the right than to the left (<3.5-4 zscores out on the left, 9 or more on the right (this is from memory)).

    Even more importantly, the left tail is meaningless on the intuitive level. Mutual information captures just that - common information - and to have little common information may biologically mean simply that the two genes are very differently regulated and not even related in behavior through pleiotropic effects (changes in salinity or pressure, etc). In other words, the behavior of one as compared to the other is random. Such genes' relationships are not biologically significant (except in the sense that they must represent very different cellular processes). They are certainly not connected in the network sense. We looked into such gene pairs with low z-scores and found that keeping their relationships had a very slight adverse effect on network recon struction. The small magnitude of the impact owes to the fact that very few such gene pairs achieved anything resembling a real significance level. Certainly nothing at 60% likelihood, and ma ybe only a handful at 15-20%.

  2. Does CLR only work with Affymetrix arrays?

    No. It works with any tabular data. The Matlab script accepts a data matrix where the genes are in rows and the experiments are in columns (of type double). The command-line executable accepts CSV files in the same format (see README). The data should satisfy these constraints:

    a) It should be log-scaled (log-ratio for spotted arrays or log-scale for Affy/Nimblegen/other commercial single-channel data).

    b) It should not contain non-numeric values of any kind (NaN, +/-Inf, blanks, etc). Any missing values are best approximated from the data - e.g., -Inf as the smallest numeric value found on the data, or else available on the hardware platform.

    c) 2-color arrays should be referenced to a common control

    d) The number of probes analyzed should not be fewer than ~1000 for the Matlab script, Rayleigh mode, or fewer than ~2500 for the standalone script, or for the Matlab script, Gaussian mo de. Below these numbers, the background distributions of mutual information may not be reliable.

    e) Finally, it helps to know what the transcription factors are. The network acquires a slightly different meaning when this information is unavailable. For instance, in bacteria, CLR will detect operons, snippets of metabolic pathways, and ribosomal subunits.

  3. I’m having trouble understanding fig 2b - I can’t see how at 60 % precision CLR recovers 1079 interactions. The curves suggest to me that as the level of recall increases , the precision decreases.

    Your interpretation of figure 2b is correct with respect to precision
    and recall. Note that this figure doesn’t consider all 1079 interactions
    predicted by the algorithm at 60% confidence. Only about 320 of the
    1079 predicted interactions are between genes contained in Regulon DB
    (regulon is incomplete - it only provides information on about 1200 of
    E. coli’s 4300 genes). 60% (~190) of these 320 predicted interactions
    are correct according to the database, and 40% are false positives
    (though they might also be true new interactions). Thus, for genes
    that are actually described in the database, we find about 6.5% of the
    ~3200 known interactions with about 40% FPs among our predictions for
    these genes.


    At the 60% precision, we also find >700 additional interactions
    between genes that aren’t included in the database. Thus, there is
    no way to calculate precision and recall for these genes. (We are
    confident that there is real regulation among them) We believe that
    our precision figures carry over to these genes as well because we performed chromatin immunoprecipitation experiments on many of them and
    verified the precision is about the same.

  4. In Rayleigh mode, the Matlab clr script returns a lot of probabilities maxed out at 1. Is this a bug?


    Not exactly. Rayleigh distribution fits to mutual information often return very low p-values (rarely - 0s). This is normal. However, when computing joint probabilities, the products of margi nal probabilities tend to ‘underflow’. This problem shows up most vividly when the number of probes or probesets on a given chip is large, causing marginal p-values for statistical ly significiant values of mutual information to be very low. In other words, this problem occurs when the network is very sparse.

    There are two workarounds:

    a) In clr.m, ‘rayleigh’ method section, replace the line

    A = A + A’ - A.*A’;

    with the line

    A = log(1 - A) + log(1 - A’);


    This will give you log-p-values, instead of probability estimates. There are certain downsides to the data transformed in this way, but the values will be rank-equivalent to the original outpu t.


    b) Use ‘normal’ mode to obtain z-score-scaled estimates.