Question

Filtering reads aligned to specific chromosome

0

Entering edit mode

8.7 years ago

signauncreal • 0

Hello,

First of all: Im quite new to this field, so please apologize any unexact formulations or missing informations.

i want to solve the following task: In order to save time, i want to do alignments not to the whole genome, but only to chromsome 21.

My strategy so far consisted of the following steps: Align NA12878 reads to whole genome. Filter .bam file for reads where at least one read-pair mapps to chr 21 (done using awk). Filter .bam file for unmapped reads (using samtools view -f 12). Merge .bam files and to bedtools bamtofastq conversion. The problem that comes here now (and is also described online) is that bamtofastq only converts those reads to fastq which have the read-pair next to it in the .bam file.

Note here: I specifically need all reads to be retained for technical reasons later.

I tried to solve the problems by taking the reads which lack their respective read pair from the original whole genome alignment file (using picard filtersamreads includeReadList) and adding them onto the file described above. This approach reduced the number of reads lacking its read pair, but only by about one third, meaning i still loose a lot of reads.

My questions are: Shouldnt my approach solve the problem? Why are there still reads which lack their mate? Also, does anybody have an idea what i can do? is there a bamtofastq transformation which does not require every read to be paired?

Im really grateful for any input, thanks.

snp bam sam bamtofastq bedtools • 4.0k views

ADD COMMENT • link updated 4.3 years ago by Biostar 20 • written 8.7 years ago by signauncreal • 0

score 0 · Answer 1 · 2016-03-09

0

Entering edit mode

8.7 years ago

Amitm ★ 2.3k

hi, its probably due to chimeric alignments. If you do samtools flagstat on your BAM, you would see that the last two line sate that some number of reads/ mates have chimeric algn (specificially "mate mapped to a different chr"). This is a strategy I use -

samtools view -h -F 12 my.bam

(are you sure you meant -f 12. Small caps f is required flag, which means unmapped reads would be retained. Capital F is filtering flag. So you need F) After that step, I parse with a perl script which is basically -

        @split_samline = split("\t",$samline);
        if($split_samline[6] eq '=')
        {
                print $samline,"\n";
                #$n++;
        }else{}

Col. 7 of samfile is RNEXT meaning the chr of the mate. In case of algn. where the mate is NOT on a different chr, the RNEXT value is "="

After that, I am left with bam/sam that has ONLY those reads which are algn AND their mate algnd too, and both algn on the same chr.

ADD COMMENT • link 8.7 years ago by Amitm ★ 2.3k

0

Entering edit mode

Thank you for your quick answer:

First of all, isn't it true that what you described can be solved simply by something like this (which is what i did):

samtools view input.bam | awk '{if ($3==YourChromosomeNumber && $7== "=") print $0} > outfile.sam Correct me if im wrong or i am missing your point.

Furthermore, the problem is that i do NOT want to filter reads out. I want my fastq files to have all the the reads, especially those which are chimeric.

ADD REPLY • link 8.7 years ago by signauncreal • 0

0

Entering edit mode

If i have understood correctly, of reads aligned to whole genome, you want those that aligned to chr21 only. And after this you want to get those specific reads in fastq out. One quick check is that you are using bedtools bamtofastq. It requires query name sorted BAM and if you have a coordinate sorted BAM, there would be problem. Probably you are doing alright. Just query-name sort the BAM file first.

ADD REPLY • link 8.7 years ago by Amitm ★ 2.3k

0

Entering edit mode

Im really sorry if im not expressing myself accurately. It is true that i want those file that align to chr21. Those i obtain with the the commands stated above. Also i want all unmapped reads and all split reads that map to chr21 (and somewhere else). The problem is not the extraction of those reads from the bam file, but the conversion of those reads into fastq. I know about the namesorting issue and i have done that.

Essentially what i need is a bamtofastq transformation which does not require a read to have its read-pair adjacent to it in the bam file (as is the case for every split read)

ADD REPLY • link 8.7 years ago by signauncreal • 0

0

Entering edit mode

Picard SamtoFastq maybe? It doesn't require name sorted, but coordinate-sorted data. See that helps

ADD REPLY • link 8.7 years ago by Amitm ★ 2.3k

score 0 · Answer 2 · 2016-03-09

0

Entering edit mode

8.7 years ago

Chris Miller 22k

Most tools don't handle this well, as you're finding out. What if you name-sort the whole bam, then just use a little perl script (or python, et al) to look through the SAM file and output:

a) your header and b) any pair of reads where at least one of them maps to chr21 (third column).

Then you're guaranteed to have both ends, and can move on to generating a fastq from your new bam in whatever way is most convenient.

ADD COMMENT • link 8.7 years ago by Chris Miller 22k

0

Entering edit mode

You are right, which is why i am trying to do what i described above. I do not want to map reads of the whole genome on chromosome 21.

ADD REPLY • link 8.7 years ago by signauncreal • 0

0

Entering edit mode

Ah, I see. Apologies for misunderstanding. Editing my above response with some ideas

ADD REPLY • link 8.7 years ago by Chris Miller 22k