Hi guys,
my question is quite similar to some other questions that have been asked here before. Unfortunately, none of them answer it all the way which is why I'll make another (this) post.
- What I have is a BAM file with ~100 million reads.
- What I need is to extract the reads that have information about a certain genomic position along with a tag of the reads.
The reason why I'm saying "read that has information about a certain genomic position" is that with samtools view in.bam chrX:22222
I also extract all reads that only stretch along this position but don't actually overlap it. These reads are useless for me. Ideally, I'd like to only get the information in that position instead of the whole read.
Additionally, I need to carry over a barcode that is saved as a tag in the read to link this information together. It would also be fine to just keep all the tags and then parse it later.
Does anybody know of a way to do this? To me it looks like I gotta write my own little script to do this but I'd like to avoid reinventing the wheel if this already exists somewhere. Also samtools mpileup
is helping in that it uses only the informative reads but it returns only the nucleotides in that position and throws away the read tags.
Thanks a lot!
cool, and you got some answers, ( e.g: TCGA: Stratify breast cancer cases based on presence of Her2 mutation and get transcription data? ), please validate your previous questions using the green mark on the left.
My bad, done! Don't know why I didn't do it before.
forgive my french: what is the difference ?
no clear: so by 'extract' do you want to keep them or to remove them ?
Again, sorry, could've made it a bit more clear. Some of the reads in my BAM file are cut in pieces with one piece aligning before the position of interest and the other piece after the position of interest. The CIGAR string then looks like this: 12M1441N39M.
samtools view
also reports these reads even though they do not align to the position of interest. Does that make any more sense? Not sure this explains it any better.Regarding the second question: With useless reads I was referring to the reads I just described above. They overlap/stretch the position of interest but don't contain information about it. I would like to only have the reads from which I can extract information about the nucleotide position of interest.