I have bed intersect file wherein the gene feature is intersected by multiple reads. I assumed based on similarity I can filter the data however tools like blat and magic-blast didn't help as the reads span more than 5000 bp and covered outside of the gene feature / or covers the gene feature in fragments.
Now I wish to retrieve the fasta sequence of gene features that has been covered by the read. any suggestions on how to select the coordinates?
Additionally, if a gene feature is covered multiple fragments of reads at different coordinates without overlaps is there a way to assign alphabets to the gene feature ID?
File1:
<contig_name> <querystart> <querystop><queryID> <querystrand_direction> <featurestart> <featurestop> <featureID> <feature_strand>
input
contig 13000 14000 pac34 + 10000 16000 ID_84 +
contig 14500 15784 pac75 + 10000 16000 ID_84 +
out
contig 13000 14000 pac34 + 10000 16000 ID_84a +
contig 14500 15784 pac75 + 10000 16000 ID_84b +
You can use samtools faidx to retrieve sequences from the fasta file. http://www.htslib.org/doc/samtools-faidx.html
What do you mean by assigning alphabets?
Your file has 9 columns but your header has 8 names.
I used getfasta to retrieve sequence however i am not sure about selecting the coordinates in case a gene feature is hit by different reads at the same location. It would mean I should compare the similarity however tools like blat/magic-blast returns identity only for a small percentage of sequence in the mapped region.
I have edited the question for the requirement.