Cross-mapping correction software
---------------------------------
Michiel de Hoon
RIKEN Omics Science Center
2010.04.22.

The Python script cmc.py implements the cross-mapping correction algorithm.
This Python script reads a SAM file as input, and writes out a SAM file with
the mapping weight, as calculated by the cross-mapping correction algorithm,
added to each line in the SAM file. For each short RNA, the weights for its
mapping locations sum to 1.

The cross-mapping correction uses both expression and alignment information to
calculate the mapping weights. For the former, each line in the SAM file
should contain an XC:i:<number> field, where <number> is the count of the
sequence (i.e., how often it appeared in the library). If this field is
missing, the count is assumed to be 1. For the alignment information, the
input SAM file should contain an MD tag to describe the alignment between the
short RNA and the genome. You can use samtools to generate these MD tags.

To demonstrate how the MD tags can be generated, assume that s05.sam contains
the mapping results of short RNAs against the human genome, stored in the
Fasta file hg18.fa. First create a samtools index:

samtools faidx hg18.fa

If s05.sam is your SAM file, you can use the following commands to add the MD
tag to s05.sam:

samtools import hg18.fa.fai s05.sam s05.bam
samtools sort s05.bam s05.sorted
samtools fillmd s05.sorted.bam hg18.fa > s05.md.sam


Then use the cross-mapping correction software to get the weights:

python cmc.py -i s05.md.sam -o s05.weighted.sam

Here, the -i argument specifies the name of the input SAM file, and the -o
argument specifies the name of the output file, which will also be in the SAM
format. An example s05.md.sam input script is included with this package.

The cmc.py script has two optional arguments:
        -n iterations      number of iterations to run (default: 1)
        -e ends            number of nucleotides at the ends of the alignment
                           for which a separate error rate is calculated
                           (default: 3)
The cross-mapping correction uses iteration to obtain the mapping weights. In
practice, we found that a single iteration is usually sufficient to obtain
reasonable weights. You can use the -n option if you want to run more
iterations. Upon each iteration, the cmc.py script will show the overall
relative change in the mapping weights, as well as the maximum change seen in
any of the mapping weights:

$ python cmc.py -i s05.md.sam -o s05.weighted.sam -n 10
Running iteration 1 of 10 max change: 0.744, relative change: 0.0112
Running iteration 2 of 10 max change: 0.172, relative change: 0.00338
Running iteration 3 of 10 max change: 0.065, relative change: 0.00147
Running iteration 4 of 10 max change: 0.036, relative change: 0.00074
Running iteration 5 of 10 max change: 0.020, relative change: 0.000409
Running iteration 6 of 10 max change: 0.015, relative change: 0.000243
Running iteration 7 of 10 max change: 0.011, relative change: 0.000151
Running iteration 8 of 10 max change: 0.008, relative change: 9.79e-05
Running iteration 9 of 10 max change: 0.006, relative change: 6.53e-05
Running iteration 10 of 10 max change: 0.005, relative change: 4.47e-05
Iteration finished; saving results to s05.weighted.sam

To make use of the alignment information, the cross-mapping correction script
calculates an error profile that specifies the average number of errors
(mismatches, insertions, or deletions) depending on the position along the
alignment. By default, the script calculates separate error rates for the
first and last three nucleotides, as more alignment errors tend to occur there,
in addition to the average error rate for all other positions. You can use the
-e option to change the default value 3; we found that in practice, the weights
do not depend strongly on the choice for this parameter.

This software is covered by the Python license (the same license as Python
itself).
