Question

Unmapped Reads And Sequence Name

1

Entering edit mode

13.1 years ago

Stefano Berri 4.4k

Hi. I have two questions that might or might not be related.

I want to filter all and only unmapped reads from a pair end file.

I type

samtools view -f 4 myFile.bam

According to samtools, -f 4 should 'only output alignments' where the query itself is unmapped.

However the beginning of the produced output is like

HWI-ST300_0110:7:21:1528:160264#0       117     chr10   67604   0       *       =       67604   0       <SEQ>    <QUAL>
HWI-ST300_0110:7:7:10624:6076#0 117     chr10   78098   0       *       =       78098   0       <SEQ>    <QUAL>
HWI-ST300_0110:7:5:15368:185558#0       69      chr10   78778   0       *       =       78778   0       <SEQ>    <QUAL>

These are not unmapped. The mate is, not the query, right? Later in the output there are also the unmapped reads. With flag -f 12, I correctly get those where both query and mate pair are unmapped. Why is this happening?

The second question is about sequence name. The fastq files I was provided with, have the name like:

@HWI-ST300:130:B08M9ABXX:2:1101:1137:1993 2:Y:0:

with the space where it is shown.

That means I have both pairs that have the very same name in the resulting bam files (HWI-ST300:130:B08M9ABXX:2:1101:1170:1992). Can this be an issue? What are the specifications about sequence names in fastq format? According to this paper spaces don't seem to be an issue, but yet bwa trim after the space.

Thanks

fastq bwa samtools bam next-gen sequencing • 8.4k views

ADD COMMENT • link updated 13.1 years ago by Swbarnes2 ★ 1.6k • written 13.1 years ago by Stefano Berri 4.4k

score 2 · Answer 1 · 2011-11-14

2

Entering edit mode

13.1 years ago

Istvan Albert 102k

The easiest way to see what a flag means is to visit the Explain SAM flags page and verify what your flags (in this case 117 and 69) stand for. For example:

117: read paired, read unmapped, read on reverse strand, mate on reverse strand, first in pair

So the output is correct, what it probably means is that the read and its mate are both on the reverse strand therefore the mapping of this read (and its mate) are questionable, thus it got the unmapped designation.

Question 2: while in principle the space should not matter by the standard,in general these should be avoided since some tools make assumptions (plus the spaces can be visually misleading). I would replace the spaces with an underscore or minus

ADD COMMENT • link 13.1 years ago by Istvan Albert 102k

0

Entering edit mode

Thanks! Very useful!. But so, they are "officially" unmapped, despite having a mapping field. Confusing. 69 means: read paired, read unmapped, first in pair. Even more confusing. Anyway, I guess I will filter the others with -F 4 to make sure I don't have the same sequence in two different files after splitting.

ADD REPLY • link 13.1 years ago by Stefano Berri 4.4k

0

Entering edit mode

I think there is no standard designation for what 'unmapped' should mean - I usually take it as in indication that the measurement is unreliable.

ADD REPLY • link 13.1 years ago by Istvan Albert 102k

score 2 · Answer 2 · 2011-11-14

You might want to split posts to Biostar if they contain different, not related questions. That helps to better find relevant Q & A after some time.

Anyway, regarding question number 2: in their latest pipeline, Illumina changed the naming scheme of their reads by leaving out the pair-identifier from the name component and pushing it to the 'comment' component.

This change is most unfortunate as it breaks all tools I know which use the naming of a read to determine whether it is in a pair or not. Furthermore, it makes searching for reads by name impossible as in paired-end/mate-pair sequencing you'll get two reads as answer. Oh, the fun. Illumina devs should be tarred and feathered for this.

If you want to avoid problems: rename your reads to follow "usual" naming styles with a postfix by pulling the pair identifier to the read name and concatenating it with a slash. I.e.: change

@HWI-ST300:130:B08M9ABXX:2:1101:1137:1993 1:Y:0: @HWI-ST300:130:B08M9ABXX:2:1101:1137:1993 2:Y:0:

to

@HWI-ST300:130:B08M9ABXX:2:1101:1137:1993/1 1:Y:0: @HWI-ST300:130:B08M9ABXX:2:1101:1137:1993/2 2:Y:0:

score 2 · Answer 3 · 2011-11-14

"For a unmapped paired-end or mate-pair read whose mate is mapped, the unmapped read should have RNAME and POS identical to its mate."

http://samtools.sourceforge.net/SAM1.pdf

That's why unmapped reads can have coordinates. If the 4 is flagged, it's unmapped, regardless of what positions and reference names, and CIGAR strings suggest.

Two reads with the same name are totally fine. You will be able to tell which came from which fastq by the flag (reads from the first fastq have 64 flagged, reads from the second fastq have 128 flagged). In fact, for some functions, the reads have to be the same. I had a bunch of reads with :a and :b to indicate which read they were, and Picard's MarkDuplicates refused to flag any of them as duplicates, because it wanted pairs to have the exact same name.