Good afternoon to everybody. I am working on paired ended whole genome seq data. Specifically I am focusing on structural variant calling and I am at the moment testing different callers. One of the tools that I am using is Speedseq/Lumpy and, in order to use it, I aligned my fastq files with Speedseq align (which is BWA-based) and generated the three BAM files necessary as input: - a main BAM file - a BAM file containing splitted reads - a BAM file containing discordant reads All the other callers need a single BAM file as input. So, my strategy was simply to merge into one single BAM file all the three BAM files generated by speedseq. However I found a bit weird the content of the splitted and discordant BAM files, and maybe I am misunderstanding the meaning of splitted and discordant reads. Here is an example for one read. These are the reads as they come from the fastq files:
R1 fastq file:
@H0E3UALXX:2:1201:2151855:0 1
GTTAGGGTTAGGGTTAGGGTTAGGGTTAGAGATCGGAAGAGCGTCGTGGAGGGAAAGAGTGTTCAGAGCCGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAAAAAAAAAAGGGGGAGTGTGAGGGGGAGGGATCGAGGTAGGGTGCGG
AAFFFFJJJJFJJJJJJFJFJJJJJJJFF<JAF<FJ-AJJJ-JJJA-J-F<JJ-<<-JJ-JJJ<-J-7<FF<JFJ<A--FAFFFJ<FAJ--J-F--AJAJ7A-F-JA-A--7J7---A7-<-----7----F----7-7---77A-----
and R2 fastq file:
@H0E3UALXX:2:1201:2151855:0 2
CTAACCCTAACCCTAACCCTAACCCTAACAGAACGGAAGAGCACAAGTCTGAACTCCAGTCACTCCGGAGAACCTCGTATGCCGTCTTCTGCTTGAAAAAAAATAACCCTAACCCTAACCCTAACCCCAACACTAACACTAACCCTAACC
FFFFFJJJJJJJJJJJJJJJJJJJJJJJJ-7----A<A-J-J<-J-7AJJAJ-JF<-J--JJ--JJAFJ7------AA-J<-F--A-AJ<-<-7AJ-AJJ----J--7777F7-----A<----7-7--A77F-<-A-----<-------
These are the reads aligned in the main BAM file:
H0E3UALXX:2:1201:2151855:0 129 1 10000 0 102S48M 12 95711 0 CTAACCCTAACCCTAACCCTAACCCTAACAGAACGGAAGAGCACAAGTCTGAACTCCAGTCACTCCGGAGAACCTCGTATGCCGTCTTCTGCTTGAAAAAAAATAACCCTAACCCTAACCCTAACCCCAACACTAACACTAACCCTAACC FFFFFJJJJJJJJJJJJJJJJJJJJJJJJ-7----A<A-J-J<-J-7AJJAJ-JF<-J--JJ--JJAFJ7------AA-J<-F--A-AJ<-<-7AJ-AJJ----J--7777F7-----A<----7-7--A77F-<-A-----<------- NM:i:3 MD:Z:25T3C5C12 AS:i:33 XS:i:32 RG:Z:id SA:Z:7,10055,+,30M120S,0,0; MC:Z:31M119S MQ:i:0
H0E3UALXX:2:1201:2151855:0 2177 7 10055 0 30M120H 12 95711 0 CTAACCCTAACCCTAACCCTAACCCTAACA FFFFFJJJJJJJJJJJJJJJJJJJJJJJJ- NM:i:0 MD:Z:30 AS:i:30 XS:i:30 RG:Z:id SA:Z:1,10000,+,102S48M,0,3; MC:Z:31M119S MQ:i:0
H0E3UALXX:2:1201:2151855:0 65 12 95711 0 31M119S 1 10000 0 GTTAGGGTTAGGGTTAGGGTTAGGGTTAGAGATCGGAAGAGCGTCGTGGAGGGAAAGAGTGTTCAGAGCCGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAAAAAAAAAAGGGGGAGTGTGAGGGGGAGGGATCGAGGTAGGGTGCGG AAFFFFJJJJFJJJJJJFJFJJJJJJJFF<JAF<FJ-AJJJ-JJJA-J-F<JJ-<<-JJ-JJJ<-J-7<FF<JFJ<A--FAFFFJ<FAJ--J-F--AJAJ7A-F-JA-A--7J7---A7-<-----7----F----7-7---77A----- NM:i:0 MD:Z:31 AS:i:31 XS:i:30 RG:Z:id MC:Z:102S48M MQ:i:0
These are the reads aligned in the splitted BAM file:
H0E3UALXX:2:1201:2151855:0_2 129 1 10000 0 102S48M 12 95711 0 * * NM:i:3 MD:Z:25T3C5C12 AS:i:33 XS:i:32 RG:Z:id SA:Z:7,10055,+,30M120S,0,0; MC:Z:31M119SMQ:i:0
H0E3UALXX:2:1201:2151855:0_2 2177 7 10055 0 30M120H 12 95711 0 * * NM:i:0 MD:Z:30 AS:i:30 XS:i:30 RG:Z:id SA:Z:1,10000,+,102S48M,0,3; MC:Z:31M119S MQ:i:0
These are the reads aligned in the discordant BAM file:
H0E3UALXX:2:1201:2151855:0 129 1 10000 0 102S48M 12 95711 0 * * NM:i:3 MD:Z:25T3C5C12 AS:i:33 XS:i:32 RG:Z:id SA:Z:7,10055,+,30M120S,0,0; MC:Z:31M119SMQ:i:0
H0E3UALXX:2:1201:2151855:0 65 12 95711 0 31M119S 1 10000 0 * * NM:i:0 MD:Z:31 AS:i:31 XS:i:30 RG:Z:id MC:Z:102S48M MQ:i:0
Within the splitted and discordant BAM files "all" the reads (I didn't check if really all of them) have an "*" in the seq and quality score fields, which should be present in secondary alignments, from what I know (in order to remove redundant information). However these alignments between the main BAM file and the discordant and splitted ones look exactly the same. Does this mean that all the alignments present in the splitted and discordant BAM files are anyway stored in the main BAM file? If so, why if I run a SV caller (not lumpy) in the main BAM file or after merging the three BAM files, the output is different?
Thanks a lot for any suggestion and sorry for the long post.
DO NOT SHOUT, please.
how about your previous question: PLINK RECODING ISSUE , did that work ? add a comment or mark the post as answered.
These people vanish into thin air
Are these three bam files generated by Speedseq itself, or are you processing them after the program generates them?