Hello,
First of all: Im quite new to this field, so please apologize any unexact formulations or missing informations.
i want to solve the following task: In order to save time, i want to do alignments not to the whole genome, but only to chromsome 21.
My strategy so far consisted of the following steps: Align NA12878 reads to whole genome. Filter .bam file for reads where at least one read-pair mapps to chr 21 (done using awk). Filter .bam file for unmapped reads (using samtools view -f 12). Merge .bam files and to bedtools bamtofastq conversion. The problem that comes here now (and is also described online) is that bamtofastq only converts those reads to fastq which have the read-pair next to it in the .bam file.
Note here: I specifically need all reads to be retained for technical reasons later.
I tried to solve the problems by taking the reads which lack their respective read pair from the original whole genome alignment file (using picard filtersamreads includeReadList) and adding them onto the file described above. This approach reduced the number of reads lacking its read pair, but only by about one third, meaning i still loose a lot of reads.
My questions are: Shouldnt my approach solve the problem? Why are there still reads which lack their mate? Also, does anybody have an idea what i can do? is there a bamtofastq transformation which does not require every read to be paired?
Im really grateful for any input, thanks.
Thank you for your quick answer:
First of all, isn't it true that what you described can be solved simply by something like this (which is what i did):
samtools view input.bam | awk '{if ($3==YourChromosomeNumber && $7== "=") print $0} > outfile.sam Correct me if im wrong or i am missing your point.
Furthermore, the problem is that i do NOT want to filter reads out. I want my fastq files to have all the the reads, especially those which are chimeric.
If i have understood correctly, of reads aligned to whole genome, you want those that aligned to chr21 only. And after this you want to get those specific reads in fastq out. One quick check is that you are using bedtools bamtofastq. It requires query name sorted BAM and if you have a coordinate sorted BAM, there would be problem. Probably you are doing alright. Just query-name sort the BAM file first.
Im really sorry if im not expressing myself accurately. It is true that i want those file that align to chr21. Those i obtain with the the commands stated above. Also i want all unmapped reads and all split reads that map to chr21 (and somewhere else). The problem is not the extraction of those reads from the bam file, but the conversion of those reads into fastq. I know about the namesorting issue and i have done that.
Essentially what i need is a bamtofastq transformation which does not require a read to have its read-pair adjacent to it in the bam file (as is the case for every split read)
Picard SamtoFastq maybe? It doesn't require name sorted, but coordinate-sorted data. See that helps