Hi all,
I have a BAM/SAM file with molecular barcode (MBC) / unique molecular identifier (UMI) stored in the RX tag (after preprocessing with AGeNT Trimmer). Here is an example, where RX:Z: is the tag containing the UMI TCA-TTA.
A00620:188:HVMYMDSX2:4:2213:4255:9392 147 chr1 10000 0 96M = 10005 -91 ATAACCCTAACCCTAACACTAACACTAACCCTAACCCTATCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC,;-<--9:=<9-8:,;9+.:86.,9;><,,@;8<A-@;7(--A<?+-9/28>--A<?6-AA<?6A8.;><@@@;=;@?@:=;?>>9=:>==:=;>9 ZA:Z:TCACT ZB:Z:TTAGT BC:Z:GTCTCTTC+GAGGCTTC MC:Z:96M MD:Z:17C5C15A56 RG:Z:@A00620:188:HVMYMDSX2:4:1101:6470:1016 MI:Z:TCATTA NM:i:3 AS:i:81 XS:i:80 QX:Z:FFF FFF RX:Z:TCA-TTA
I am trying to append the UMI to the read name (requirement for a tool I am trying to implement).This is what I have tried:
samtools view -H tmp.sam > tmp_header.sam
samtools view tmp.sam | awk '{OFS = "\t"} {for(i=1;i<=NF;i++) if ($i ~ /^RX:Z:/) {$1=$1"_"$i; gsub("RX:Z:","",$1); print}}' >> tmp_header.sam
(see also this Bioinformatics Stack Exchange answer)
The last step throws an error, as samtools cannot parse the file after the modification:
[E::aux_parse] unrecognized type '\t'
[W::sam_read1_sam] Parse error at line 101
This is the line that cannot be parsed when opened with vim:
A00620:188:HVMYMDSX2:4:2213:4255:9392_TCA-TTA 147 chr1 10000 0 96M = 10005 -91 ATAACCCTAACCCTAACACTAACACTAACCCTAACCCTATCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC ,;-<--9:=<9-8:,;9+.:86.,9;><,,@;8<A-@;7(--A<?+-9/28>--A<?6-AA<?6A8.;><@@@;=;@?@:=;?>>9=:>==:=;>9 ZA:Z:TCACT ZB:Z:TTAGT BC:Z:GTCTCTTC+GAGGCTTC MC:Z:96M MD:Z:17C5C15A56 RG:Z:@A00620:188:HVMYMDSX2:4:1101:6470:1016 MI:Z:TCATTA NM:i:3 AS:i:81 XS:i:80 QX:Z:FFF FFF RX:Z:TCA-TTA
This looks very close to what I am trying to do - the tag has been appended to the read name. I cannot quite figure out what causes the parsing error. Any help would be appreciated.
Best wishes
Christian