Question

Output of samtools view, what does the third column actually represent?

1

Entering edit mode

3.4 years ago

DNAngel ▴ 250

I am realigning my fastq reads against contigs generated from metaspades to see which reads mapped to which contig.

After running the following code to get mapped reads:

samtools view -h -F 4 in.bam > mapped.bam

I am confused by the file output and my next step. Here is a shortened example of the mapped.bam file (please note I cut off the end of the file just to make things neat as I am only interested in understanding the first few columns):

A00977:183:HLLKYDSXY:3:1503:16658:5071  73      NODE_5156_length_78_cov_28      1       60      78M27S  =       1       0
A00977:183:HLLKYDSXY:3:2178:28248:9142  369     NODE_5159_length_78_cov_5       31      0       48M98H  NODE_691_length_1085_cov_115.969        17      0

So the first column is clearly the read name, followed by read length (?), then the reference sequence name. I am not sure what columns 4, 5, 6 represent. I also don't understand why in the 7th column, some have the NODE contig name appear but others do not - what exactly does that mean?

If my end goal is to determine which contig a read mapped to, I should be looking at the 3rd column or 7th column? I notice that when I grep a read name, sometimes the read appears multiple times leading me to believe a read has mapped to multiple contigs at a time (fine okay), but which column is giving me the actual contig name that it mapped to?

I just want to get this type of info to determine read depth for the contigs.

samtools • 4.3k views

ADD COMMENT • link updated 3.4 years ago by i.sudbery 21k • written 3.4 years ago by DNAngel ▴ 250

2

Entering edit mode

Hi @DNAngel. I see below that you stated that you have tried to read the SAM format specification. You'll probably find in future that you get more positive responses if you make reference to this in your post, preferably stating which bits of the document you had trouble following and where precisely you need clarification.

ADD REPLY • link 3.4 years ago by i.sudbery 21k

0

Entering edit mode

Yes, learned my lesson. I'll avoid that in the future.

ADD REPLY • link 3.4 years ago by DNAngel ▴ 250

1

Entering edit mode

3.4 years ago

swbarnes2 14k

1) Google "sam format"

2) read the documentation

Much faster than posting and waiting hours for someone to spoonfeed you what you want to know.

ADD COMMENT • link 3.4 years ago by swbarnes2 14k

1

Entering edit mode

I'm not sure an RTM without further clarification is of much help to anyone. Clearly the answer to this question is in the documentaiton if you know how to read it,and finding and understanding the documentation is going to be more useful in the long run than just getting answers here. And in general the more effort an OP has put into a question, the more likely it is to be answered. But I'm not convinced the answer here actaully increases the chances of the OP reading the manual (if they havn't already, see their comment below), or provides any pointers for someone else coming with the same question in the future.

ADD REPLY • link 3.4 years ago by i.sudbery 21k

0

Entering edit mode

I wish I could downvote unhelpful comments. I'd downvote yours into oblivion.

I did read the documentation but the language isn't exactly perfectly clear to me, so I posted here. I just don't know if what is mapped is the third column or 7th column as both give reference sequence names. All the documentation says is that column 7 is the ref name of the next read in the template - I don't understand what that means. If the third column is just the reference name - which column is showing which one the actual query sequence is mapped to?

If you don't want to help, then don't post such useless comments just to make someone feel bad for asking in the first place.

ADD REPLY • link 3.4 years ago by DNAngel ▴ 250

1

Entering edit mode

Even Wikipedia has a definition of the format, col 3 is the reference name which first read is mapped. If the reads are paired-ends, col 7 is used to define the reference name where the second pair is mapped, if it is the same as col 3, it will be a "=", if the reads are single ends (only one read) or second pair is missing/unmapped, it will be a "*".

Ideally, you use col 3 as the reference where sequences are mapped.

ADD REPLY • link 3.4 years ago by JC 13k

1

Entering edit mode

Nothing you wrote gave the slightest indication that you'd read anything. Column 2 is not the read length, the documentation specifically says it's the binary flag, and goes into detail about what the flag means. Column 7 is what reference the mate mapped to, you'd see that if you just checked the entry for the mate.

I imagine all your read names turn up at least twice, once for each mate of a pair. (How do you know which row belongs to which mate? Read the documentation)

ADD REPLY • link 3.4 years ago by swbarnes2 14k

0

Entering edit mode

Okay I typo'd when I wrote length when I meant that for the other column that I cut out; yes that's my own fault for not being clear and checking my own description before posting. I did read the documentation though - but the language used just confuses me since I am not used to it. Took me a while to even learn the different tools used in mapping reads. But thank you for the clarification!

ADD REPLY • link 3.4 years ago by DNAngel ▴ 250

Ram · Accepted Answer · 2021-09-02

3

Entering edit mode

3.4 years ago

i.sudbery 21k

The samtools view outputs information from SAM and BAM files in SAM format. You can find a description of the SAM format here: http://samtools.github.io/hts-specs/SAMv1.pdf

Section 1.4 deals with the meaning of each of the mandatory columns. It includes the following table:

 Col  Field  Type    Regexp/Range                  Brief description
|---|------|-------|----------------------------|----------------------------------------|
1    QNAME  String  [!-?A-~]{1,254}               Query template NAME
2    FLAG   Int     [0, 216 − 1]                  bitwise FLAG
3    RNAME  String  \*|[:rname:∧*=][:rname:]*     Reference sequence NAME11
4    POS    Int     [0, 231 − 1]                  1-based leftmost mapping POSition
5    MAPQ   Int     [0, 28 − 1]                   MAPping Quality
6    CIGAR  String  \*|([0-9]+[MIDNSHPX=])+       CIGAR string
7    RNEXT  String  \*|=|[:rname:∧*=][:rname:]*   Reference name of the mate/next read
8    PNEXT  Int     [0, 231 − 1]                  Position of the mate/next read
9    TLEN   Int     [−231 + 1, 231 − 1]           observed Template LENgth
10   SEQ    String  \*|[A-Za-z=.]+                segment SEQuence
11   QUAL   String  [!-~]+                        ASCII of Phred-scaled base QUALity+33

Column 12 contains a space separated list of optional informational tags about the read.

We can see that the first column is, as you have guessed, the name of the read (or query name). The second column is the FLAG - this is a bitwise flag that encode information about the status of the alignment. Things like is it a successful alignment, is its pair mapped, is it a read1 or a read2?

Finally the third column is the Reference sequence (i.e. the name of the contig the read is aligned to). This is the column you are interested in if you want to know which contig a read is aligned to. And you are correct that reads can align to more than one contig (depending on the configuration of the aligner).

The 7th column, which you note sometimes does and sometimes doesn't contain contig information gives us information about the contig to which the mate of this read aligns. It only contains a contig name if the mate aligns to a different contig. If it aligns to the same contig, this columns will contain =. If the mate is unaligned, it will contain *.

ADD COMMENT • link updated 3.4 years ago by Ram 44k • written 3.4 years ago by i.sudbery 21k

0

Entering edit mode

Ah thanks so much! So I can probably now just work on using simple commands like grep/awk to get the counts for each of the contigs in column 3 and use that as my read depth for each one.

ADD REPLY • link 3.4 years ago by DNAngel ▴ 250

0

Entering edit mode

samtools idxstats should give you the number of reads that map to each contig.

ADD REPLY • link 3.4 years ago by i.sudbery 21k

0

Entering edit mode

I should mention - in case others read this - that when checking if a read maps multiple times with a simple grep, it looks like it does not exactly mean that it mapped to DIFFERENT contigs. I checked one read and saw it showed up twice in the mapped file, but based on column 3 it gave the same contig name. I checked a few more and turns out the minimum is 2, so this probably represents the pairs mapping to the same contig - although maybe someone can confirm this.

ADD REPLY • link 3.4 years ago by DNAngel ▴ 250

1

Entering edit mode

It could mean either. As the two reads in a pair have the same name, counting with grep with count both read1 and read2 as the same read.

However, it is also possible for a single read to map to multiple locations within the same contig.

ADD REPLY • link 3.4 years ago by i.sudbery 21k

0

Entering edit mode

Interesting - I wonder if I can just treat the R1 and R2 reads separately (like single-end files) to see how they map individually to the contigs. This would eliminate the confusion if I am grepping a read that mapped twice to something, or I am simply grepping read pairs. idxstats seems easier but it also states that it "...may count reads multiple times if they are mapped more than once or in multiple fragments." But doesn't clarify if it can parse out R1 and R2 reads. But reading more on this...I think samtools flags/the flag column will help with this. Anyways just food for thought!

ADD REPLY • link 3.4 years ago by DNAngel ▴ 250

0

Entering edit mode

Yes, you can use the flag together with the -f and -F swtiches to achieve this. You might find this site useful: https://broadinstitute.github.io/picard/explain-flags.html

ADD REPLY • link 3.4 years ago by i.sudbery 21k