ChIPMunk program description

ChIPMunk is a fast heuristic DNA motif digger based on greedy approach accompanied by bootstrapping. ChIPMunk identifies the strong motif with the maximum Discrete Information Content in a set of DNA sequences. ChIPMunk uses (extended) multifasta as the input format and supports IUPAC DNA letters in the input sequences.

Looking for a cool GUI that is able to call ChIPMunk for motif discovery?
Check all-in-one toolbox MotifLab having ChIPMunk integrated as the motif discovery tool.

BioUML perspective for more ChIP-Seq oriented analysis: click here.

Here you can find a short description of the ChIPMunk online. For a more detailed overview please check the 'downloads' section and the following links:

ChIPMunk integration

ChIPMunk is integrated into Nebula NGS analysis pipeline developed at Institute Curie.

The following independent pipeline is the motif discovery platform integrating different tools (including ChIPMunk). Check the corresponding benchmark showing how ChIPMunk performs on selected datasets versus other available tools.

The BioUML platform have integrated the ChIPMunk as the motif discovery algorithm.

ChIPMunk-related papers and benchmarks

Program parameters and motif length estimation

ChIPMunk iteratively checks different motif lengths to find the longest strong motif. The web-version of ChIPMunk always starts from the specified minimum motif length. The stand-alone version can test motif lengths in both directions (minimum-to-maximum and vice versa). Each sequence in the set has an associated float weight that determines its contribution to the final motif. In the OOPS (one-occurrence-per-sequence) mode motif strength check is performed twice using a motif built from weighted sequences and a motif directly rebuilt by omitting sequence weights. In the ZOOPS (zero-or-one-occurrence-per-sequence) mode the motif strength check is performed for motifs built on unweighted word list only.


Advanced parameters

Definition of the "strong" motif

ChIPMunk searches for the motif with the highest Kullback Discrete Information Content (KDIC). We call the motif strong if it has high conservation alignment columns somewhere on the left and somewhere on the right of any no-conservation column. So we allow non conservative regions (i.e. fixed length gaps) only if they are surrounded by the columns with "high" conservation. In particular this means that in case of the strong motif the first and the last alignment columns have at least "low" conservation.

Column has "high" conservation if its KDIC is more or equal to the high conservation threshold Thc. Column has "low" conservation if its KDIC is between Thc and low conservation threshold Tlc. Column has no conservation if its KDIC is lower than Tlc.
Thc is defined as the discrete information content over the column where only 3 of 4 possible nucleotides are present, i. e. [N/3, N/3, N/3, 0]. Tlc threshold is defined as discrete information content over the column where one pair of nucleotides is 2 times more frequent than other pair, i. e. [2N/6, 2N/6, N/6, N/6]. N here is the total weight of the sequence set (the total number of sequences).