extracting sequence from fast file
2
1
Entering edit mode
6.6 years ago

im newbie to linux and NGS. I want to extract specific sequences from genome.fasta file. the IDs which i have are as follows

gnl|BGIA2|CA_chr1:329729-329840
gnl|BGIA2|CA_chr1:359729-379840
.............so on...........

where gnl|BGIA2|CA_chr1 is the name of the chromosome and 329729-329840 is the range of the nucleotides needed to be extracted in separate file alonge with its ID. can someone kindly help me how to do this with samtools or any other tool or command/script.

many thanks

next-gen • 2.5k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Hello daisy,

is this connected to your other question?

fin swimmer

ADD REPLY
0
Entering edit mode

Dear Fin Swimmer, thank you so much for your kind help and precious time. no it is not related to the previous question as that was to extract sequence reads that have sequences spanning specific range. this question is to extract that range from fasta file/genome.fasta..

ADD REPLY
0
Entering edit mode
6.6 years ago

Here a quick solution with python using pyfaidx module:

import re
from pyfaidx import Fasta


genome = Fasta('genome.fa')

with open('list.txt', "r") as list,  open('output.fa', "w") as output:
    for id in list:
        region = re.split("[:|-]", id.strip())

        seq = genome[region[0]][int(region[1])-1:int(region[2])].seq
        output.write(">"+id.strip()+"\n")
        output.write(seq+"\n")

EDIT:

I always forget bedtools.

So we need again to convert your ID list to a bed file:

 awk -v OFS="\t" '{n=split($0,a,/[:\-]/); print a[1],a[2]-1,a[3],a[0]}' list.txt > ranges.bed

And can than use bedtools getfasta

bedtools getfasta -o output.fa -fi genome.fa -bed ranges.bed

fin swimmer

ADD COMMENT
0
Entering edit mode

Seqtk also has a nice and fast command to go from BED to sequences :

 seqtk subseq in.fa reg.bed > out.fa
ADD REPLY
0
Entering edit mode
6.6 years ago
JJ ▴ 710

Check out the gffread utility. gffread can be used to generate a FASTA file with the DNA sequences for all transcripts in a GFF file. So you would only need to convert your id file to a gff.

ADD COMMENT

Login before adding your answer.

Traffic: 1987 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6