Hi all,
I have a bam file and a bed file with several regions of interest to me. I would like to get the sequence IDs of the reads that fall in each region included in the bed file. I was able to get the sequence ID, chr and start coordinate for each read, using this command:
samtools view file.bam | cut -f1,3,4
An example of what I get in output:
readA chr1 3985548
readB chr1 48349456
readC chr1 3985568
readD chr1 4932093
I would like to create for each region in the bed file a different txt file that includes the sequence IDs that map in a given region.
example of the top 5 lines included in my bed file:
chr1 3985538 3985593 chr1:3985538-3985593
chr1 3986091 3986158 chr1:3986091-3986158
chr1 4834946 4834972 chr1:4834946-4834972
chr1 4932073 4932105 chr1:4932073-4932105
chr1 7398192 7398238 chr1:7398192-7398238
example of the desired output:
Region: chr1:3985538-3985593
file -> chr1:3985538-3985593.txt
readA
readC
Region: chr1:4834946-4834972
file -> chr1:4834946-4834972.txt
readB
Region: chr1:4932073-4932105
file -> chr1:4932073-4932105.txt
readD
So how can I achieve the same? Any help is highly appreciated.
Thank you!
hi pingu77,
what have you tried/searched for so far? this question has been asked quite a number of times here already. Also some googling will point you to the right direction.
If you are stuck at any point you can ask here for specific help.
Hi! Yes, I tried to find an answer on google/forum. However, I couldn't find the answer to my problem. In the questions asked, people usually have a specific genomic region, in my case I have 60 000 regions in the same bed file and I would like to create 60 000 txt file reporting the read ID that map to the considered region..
use a loop over each bed. https://www.cyberciti.biz/faq/bash-for-loop/