Hi! I was wondering how I can extract reads from a bam file (the paired end reads were mapped to the reference genome and then the sam file was converted to bam and sorted) that have a high GC% content. I loaded my bam file into FASTQC are I have 2 distinct peaks when analysing GC content. I would like to BLAST these reads and see where they are mapping.
Do your bam also contain unmapped reads? What is the organism and type of sequencing / library? There are a number of reasons which could account for this pattern, and knowing the above information would help provide more accurate / direct solutions to your problem.
As for your question, you can do this with reformat.sh from BBTools / BBMap:
Adjust
mingc
andmaxgc
as needed.Thank you very much. I first extracted mapped reads from the sorted bam file (that came from bowtie) and I added the mapped reads to the FASTQC and I still have a peak of around 55%GC content. I am working on Dictyostelium discoideum which is very AT rich (27% GC content) and the sequencing was done on DNA. I should also say that even in the sequences that map really well (>90%) I see the same pattern. I am interested to see where are located and maybe if they are the same between strains. This is a link to the image I get image