Hi,
For SAM files, if a query is aligned to multiple positions would I have multiple entries for the same query or would I get multiple alignment positions in the same entry for the query?
Thanks!
Hi,
For SAM files, if a query is aligned to multiple positions would I have multiple entries for the same query or would I get multiple alignment positions in the same entry for the query?
Thanks!
A SAM file will contain one line for each alignment. So if a read aligned to more than one position, it would show up multiple times in the SAM file.
BWA returns only one hit per read. In the SAM output, the XA field contains the alternative hits (format: (chr,pos,CIGAR,NM;)* )
I think it depends on your mapping tool and parameters. If your mapping tool select 1 random position when a read is mapped to different positions in reference genome, then you will have 1 entry (1 row for that read) in sam file. But if your mapping tool reports all different positions, then as Pierre said, you will get XA field (contains the alternative hits) in the same entry (no extra row) in sam file.
Also note that, if your read file contains 2 identical reads with different identifiers, then they will be reported in 2 different entries (rows) in sam file (this case is different from when the same read is mapped at multiple positions at reference genome).
Just note that the XA field isn't a tag you can rely on universally, as it is specific to the aligner. cf. section 1.5 of the SAM spec and you'll see that any tag that starts with 'X', 'Y', or 'Z' or for "local use only" and "will not be formally defined in any future version of this specification."
Ah that's funny because with my colleages we just calculated something similar. Have a look to the ratio a/b calculated on your sam file:
a: samtools view [?] | sort | uniq | wc -l b: samtools view [?] | cut -f 1 | wc -l
some aligners return multiple hits on multiple lines. This is novoalign sam output using default parameters, for example:
W39CP:1373:2520 256 hsa-let-7a-1 6 1 23M * 0 0 TGAGGTAGTAGGTTGTATAGTTA >>>>>>=======9;;>===>== PG:Z:novoalign AS:i:30 UQ:i:30 NM:i:1 MD:Z:22T0 CC:Z:hsa-let-7a-2 CP:i:5 ZS:Z:R ZN:i:3 NH:i:3 HI:i:2 IH:i:3
W39CP:1373:2520 256 hsa-let-7a-2 5 1 23M * 0 0 TGAGGTAGTAGGTTGTATAGTTA >>>>>>=======9;;>===>== PG:Z:novoalign AS:i:30 UQ:i:30 NM:i:1 MD:Z:22T0 ZS:Z:R ZN:i:3 NH:i:3 HI:i:3 IH:i:3
This is great question and great answers. I have made my learning note to the SAM format, which might be helpful to others. Here is it:
http://onetipperday.blogspot.com/2012/07/deeply-understanding-sam-tags.html
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
(unless it doesn't)
I stand corrected. I hadn't seen multiple alignments being presented as an optional tag before. :)