How to extract certain tag value by key from sam/bam file?
3
0
Entering edit mode
6.3 years ago
yech1990 ▴ 40

One line of the input file:

m54071222/4194368/0_197      4       *       0       255     *       *       0       0       AAGAGGAAGGGGGAGAGAGAGGAGGAGAGGGGGGAAGAGGTTGGGATGGAAAATAGGTGGTTAGAGGGAGAAAGG   !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!         np:i:1   qe:i:197        qs:i:0  rq:f:0  BS:Z:ATGGCCAATTGCAGAA    BQ:Z:JJJKKKKKKKKLLLLL  zm:i:4194368    RG:Z:1f1bf15c   sc:A:L  sz:A:N

I want to extract certain tag (BS and BQ in this example) in each read of sam file and pass it to a new file.

I can do that with pysam. But is it possible to do this job with one line linux command? Tool that accept stdin and stdout operating will be better.

There is build in method getTag in bamtools, however, it is only for filtering reads.


The expect output:

@m54071222/4194368/0_197
ATGGCCAATTGCAGAA 
+
JJJKKKKKKKKLLLLL
sam awk samtools linux bash • 11k views
ADD COMMENT
2
Entering edit mode
6.3 years ago
yech1990 ▴ 40

An awk solution:

samtools view test.bam | awk '{for (i=12; i<=NF; ++i) { if ($i ~ "^BS:|^BQ:"){ split($i, tc, ":"); td[tc[1]] = tc[3]; } }; print "@"$1"\n"td["BS"]"\n+\n"td["BQ"] }'

edited: more robust way:

awk '{for (i=12; i<=NF; ++i) { if ($i ~ "^US:Z:|^UQ:Z:"){ td[substr($i,1,2)] = substr($i,6,length($i)-5); } }; print "@"$1"\n"td["US"]"\n+\n"td["UQ"] }'
ADD COMMENT
2
Entering edit mode
6.3 years ago

using bioalcidaejdk: http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html

$ java -jar dist/bioalcidaejdk.jar -e \
 'stream().forEach(R->println("@"+R.getReadName()+"\n"+R.getAttribute("BS")+"\n+\n"+R.getAttribute("BQ")));' in.bam
ADD COMMENT
1
Entering edit mode

Wow, that is a powerful tool. Although it looks like a one-line python command that wrap pysam...

ADD REPLY
1
Entering edit mode

that wrap pysam...

it wraps htsjdk

ADD REPLY
0
Entering edit mode

still not perfect... If the tag is not exist, the output will but "null".

ADD REPLY
1
Entering edit mode
6.3 years ago

From each line, a fastq record is created and each fastq record is separated by double line. You can remove "\n\n" if you don't want it.

output:

$ awk -v OFS="\n" -v ORS="\n\n" '{gsub("[BQ:Z|BQ:S]","",$0); {print "@"$1,"+",$16,$17}}' test.sam 

@m54071222/4194368/0_197
+
ATGGCCAATTGCAGAA
JJJKKKKKKKKLLLLL

@m54071222/4194368/0_197
+
ATGGCCAATTGCAGAA
JJJKKKKKKKKLLLLL

Input (for testing, I duplicated the lines):

$ cat test.sam 

m54071222/4194368/0_197 4   *   0   255 *   *   0   0   AAGAGGAAGGGGGAGAGAGAGGAGGAGAGGGGGGAAGAGGTTGGGATGGAAAATAGGTGGTTAGAGGGAGAAAGG !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!   np:i:1  qe:i:197    qs:i:0  rq:f:0  BS:Z:ATGGCCAATTGCAGAA   BQ:Z:JJJKKKKKKKKLLLLL   zm:i:4194368    RG:Z:1f1bf15c   sc:A:L  sz:A:N
m54071222/4194368/0_197 4   *   0   255 *   *   0   0   AAGAGGAAGGGGGAGAGAGAGGAGGAGAGGGGGGAAGAGGTTGGGATGGAAAATAGGTGGTTAGAGGGAGAAAGG !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!   np:i:1  qe:i:197    qs:i:0  rq:f:0  BS:Z:ATGGCCAATTGCAGAA   BQ:Z:JJJKKKKKKKKLLLLL   zm:i:4194368    RG:Z:1f1bf15c   sc:A:L  sz:A:N

However I suggest you to move + to 2nd line and sequence to 3rd line so that the output in standard (almost) fastq format. This would help you in running fastq stats on output. For eg.

$ awk -v OFS="\n" -v ORS="\n\n" '{gsub("[BQ:Z|BQ:S]","",$0); {print "@"$1,$16,"+",$17}}' test.sam | seqkit stats

file  format  type  num_seqs  sum_len  min_len  avg_len  max_len
-     FASTQ   DNA          2       32       16       16       16

$ awk -v OFS="\n" -v ORS="\n\n" '{gsub("[BQ:Z|BQ:S]","",$0); {print "@"$1,$16,"+",$17}}' test.sam

@m54071222/4194368/0_197
ATGGCCAATTGCAGAA
+
JJJKKKKKKKKLLLLL

@m54071222/4194368/0_197
ATGGCCAATTGCAGAA
+
JJJKKKKKKKKLLLLL
ADD COMMENT
0
Entering edit mode

thank you @cpad0112

,but this code will work only if the tags are in fix order.

ADD REPLY
0
Entering edit mode

okay. Assuming that ID is present always in first column and tags (BS:Z and BQ:Z) appear in random order and a little bit CPU intensive:

$ paste  <(awk '{print $1}' test.sam) <( grep -wo "\<BS\W\w\W\w\+\>" test.sam) <(grep -wo "\<BQ\W\w\W\w\+\>" test.sam) | sed 's/:/\t/g' | awk -v OFS="\n" -v ORS="\n\n" '{print "@"$1,$4,"+",$7}'

@m54071222/4194368/0_197
ATGGCCAATTGCAGAA
+
JJJKKKKKKKKLLLLL

@m54071222/4194368/0_197
ATGGCCAATTGCAGAA
+
JJJKKKKKKKKLLLLL
ADD REPLY

Login before adding your answer.

Traffic: 2310 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6