## ChIPMunk program description

ChIPMunk is a fast heuristic DNA motif digger based on greedy approach accompanied by bootstrapping.
ChIPMunk identifies the strong motif with the maximum Discrete Information Content in a set of DNA sequences.
ChIPMunk uses (extended) multifasta as the input format and supports IUPAC DNA letters in the input sequences.

Looking for a cool GUI that is able to call ChIPMunk for motif discovery?

Check all-in-one toolbox MotifLab having ChIPMunk integrated as the motif discovery tool.

BioUML perspective for more ChIP-Seq oriented analysis:
click here.

**Here you can find a short description of the ChIPMunk online. For a more detailed overview please check
the 'downloads' section and the following links: **

- ChIPMunk paper illustrating usage on ChIP-Seq data: Bioinformatics, 2010
- Supplementary text (includes KDIC description): Supplementary_text.pdf
- Supplementary data: click here
- Paper on data integration using ChIPMunk: Biophysics, 2009
- Pre-ChIPMunk motif discovery paper (with the DIC description): Bioinformatics, 2009
- LIX Bioinformatics Colloquium 2010 presentation: Kulakovskiy_LIX_8nov2010.pdf

## ChIPMunk integration

ChIPMunk is integrated into Nebula NGS analysis pipeline developed at Institute Curie.

The following independent pipeline http://cmotifs.tchlab.org/ is the motif discovery platform integrating different tools (including ChIPMunk). Check the corresponding benchmark showing how ChIPMunk performs on selected datasets versus other available tools.

The BioUML platform have integrated the ChIPMunk as the motif discovery algorithm.

## ChIPMunk-related papers and benchmarks

- Weirauch et al,
*Nat Biotech*, 2013 PubMed - Ma et al,
*NAR*, 2012 PubMed - Kuttippurathu et al,
*Bioinformatics*, 2011 PubMed - Bi et al,
*PLoS One*, 2011 PubMed - Carvalho and Oliveira,
*Al Mol Biol*, 2011 PubMed

## Program parameters and motif length estimation

ChIPMunk iteratively checks different motif lengths to find the longest *strong* motif.
The web-version of ChIPMunk always starts from the specified minimum motif length. The stand-alone version can
test motif lengths in both directions (minimum-to-maximum and vice versa).
Each sequence in the set has an associated float *weight* that determines its contribution to the final motif.
In the OOPS (one-occurrence-per-sequence) mode motif strength check is performed twice using a motif built from
weighted sequences and a motif directly rebuilt by omitting sequence weights.
In the ZOOPS (zero-or-one-occurrence-per-sequence) mode the motif strength check is performed for motifs built on unweighted word list only.

### Parameters

*data model*:*"simple"*for sequences of comparable quality (for example footprinting data);*"weighted"*- for sequences with known quality (so each sequence receives a weight specified in the fasta header (usable for SELEX or generic ChIP-derived data);*"peak"*- for sequences with known positional weighting (so each position in each sequence receives a weight specified in fasta header (usable for ChIP-Seq peaks containing information about base coverage).*occurrence-per-sequence*- zero or one-occurrence-per-sequence (ZOOPS). ZOOPS is based on position weight matrix (PWM) self-consistency check and can be either more strict (strong motifs closer to consensus) or flexible (depending on pseudocounts).*background GC percent*- predefined background GC content.*motif shape prior*- flat (no prior, default); single box and double box motif shapes. Note, that motif shape prior won't affect very strong motifs or well-consistent data sets.

### Advanced parameters

*try-limit*- number of random matrices to generate. 100 is suitable for good sets. May be extended up to 200 (unlimited in the stand-alone version) for a more precise motif search.*step-limit*- number of double-optimization steps for each turn of the random matrix generation. Default value is 10. May be increased to 20 (unlimited for the stand-alone version) for precise motif search.*iteration-limit*- maximum number of iterations on the random subset at each double-optimization step. Should be one (default) or two (rare cases). Larger values will slow down the double-step optimization convergence.

## Definition of the "strong" motif

ChIPMunk searches for the motif with the highest Kullback Discrete Information Content (KDIC).
We call the motif *strong* if it has high conservation alignment columns somewhere on the left
and somewhere on the right of any no-conservation column. So we allow non conservative regions (i.e. fixed length gaps)
only if they are surrounded by the columns with "high" conservation. In particular this means that in case of the strong motif the first and the last
alignment columns have at least "low" conservation.

Column has "high" conservation if its KDIC is more or equal to the high conservation threshold *T _{hc}*.
Column has "low" conservation if its KDIC is between

*T*and low conservation threshold

_{hc}*T*. Column has no conservation if its KDIC is lower than

_{lc}*T*.

_{lc}*T*is defined as the discrete information content over the column where only 3 of 4 possible nucleotides are present, i. e. [N/3, N/3, N/3, 0].

_{hc}*T*threshold is defined as discrete information content over the column where one pair of nucleotides is 2 times more frequent than other pair, i. e. [2N/6, 2N/6, N/6, N/6]. N here is the total weight of the sequence set (the total number of sequences).

_{lc}