I did a mapping on my query reads to the genome using bwa. I wish to extract the query sequence that matched with the gene feature. I found column 10 in samtools to provide information on the query sequence that matched to genome (which in many cases covers outside of gene feature).
Since, I am interested in extracting the query base sequence mapped only to the gene feature. Is there a possible way to perform I have been looking in seqkit locate and grep but they aren't feasible.
genome mappedcoordinate_start mappedcoordinate_end query ID gene_ID gene_start gene_end
Xgenome 26 3000 read_1 gene_1 30 740
I want to extract the query sequence corresponding to chr:30:740 instead chr:26-3000. I have 2 million hence I need a tool can't do this manually.
I need the query sequence as we decided to probe on it.
Appreciate all help.
What do you mean ?
The mapping was performed between the query reads and the reference genome. I have long read sequences and they cover more than one gene at time within the genome. I want to extract query sequences that mapped to each particular gene.
do you want https://gatk.broadinstitute.org/hc/en-us/articles/360037594571-FastaAlternateReferenceMaker ? using
-L your.gene.bed
?If you are interested in extracting all query sequences that fall within a region (gene) then you can try this: Isolating reads from specific region from bam file
samtools ampliconclip
may also be applicable if you want to truncate the reads (http://www.htslib.org/doc/samtools-ampliconclip.html )