Extract nucleotides from fasta file by a single position
5
1
Entering edit mode
7.5 years ago
misbahabas ▴ 70

Asslamu Alikum,

I have a tab separated file like

chr38 456

chr38 555

and I want fetch the nucleotide at this position from fasta file. Is there any perl script or linux command to do this work?.
Please tell me if have any idea about it

thankx in advance

sequence snp genome • 6.4k views
ADD COMMENT
0
Entering edit mode

thanks for reply

Actually I have only single position of SNP not starting and ending position, file not in bed format

ADD REPLY
0
Entering edit mode

Please use ADD COMMENT or ADD REPLY to answer to previous reactions, as such this thread remains logically structured and easy to follow. I have now moved your post but as you can see it's not optimal. Adding an answer should only be used for providing a solution to the question asked.

I said pretty close, I guess you can easily make a bed format from your file.

ADD REPLY
0
Entering edit mode

Bed file require starting and ending position , but i have only one position , sorry i am new in bioinformatics , i have no idea about it

kindly tell me how make bed file from this single position

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

seqtk subseq in.fa reg.bed > out.fa from https://github.com/lh3/seqtk

ADD REPLY
1
Entering edit mode
7.5 years ago

bedtools getfasta comes pretty close to what you ask for

ADD COMMENT
1
Entering edit mode
7.5 years ago

You can format the table like UCSC region strings (chr:start-end) which are 1-based closed coordinates. This way you don't have to worry about converting to [0,1) half-open coordinates like BED requires. Then you can pipe the region strings to either samtools faidx or faidx from the pyfaidx package.

pip install pyfaidx
awk '{printf "%s:%s-%s\n",$1,$2,$2}' < regions.txt | xargs faidx ref.fasta
ADD COMMENT
0
Entering edit mode
7.5 years ago
st.ph.n ★ 2.7k

A Python solution. Linearize your fasta file, if it has more than one line per sequence. Save this code as get_pos.py. Change the names of pos.txt to your tab-delimited file, and input.fasta to the name of your fasta file. Run as python get_pos.py > nucl_at_pos.txt. nucl_at_pos.txt can be changed to whatever output file name you wish.

#!/usr/bin/env python

with open('pos.txt', 'r') as f:
    pos = {}
    for line in f:
        pos[line.strip().split('\t')[0]] = int(line.strip().split('\t')[1])

with open('input.fasta', 'r') as fh:
    seqs = {}
    for line in fh:
        if line.startswith(">"):
            seqs[line.strip().split('>')[1]] = next(fh).strip()

for i in pos:
    print i, '\t', seqs[i][pos[i]]
ADD COMMENT
3
Entering edit mode

Just a suggestion for simplifying this process: you don't need to linearize the fasta file, and you don't need to read the whole thing into memory - especially if you're just extracting a few single nucleotides.

from pyfaidx import Fasta
with Fasta('input.fasta') as fasta:
with open('pos.txt', 'r') as nucleotides:
for line in nucleotides:
chrom, pos = line.rstrip().split()
nuc = fasta[chrom][int(pos)].seq
print("{chrom}\t{pos}\t{nuc}".format(**locals()))
view raw answer.py hosted with ❤ by GitHub

ADD REPLY
1
Entering edit mode

Thanks for the tips, Matt. I should learn more about pyfaidx. Having a single-line fasta to work with is just a preference of mine, otherwise I would read the file in with Biopython. I like also to write parts out so the OP can learn something and understand what's going on, and without knowing the size of either file, I agree can be overly taxing for both time and computing power when simpler options are available.

This part could also be changed to this, to iterate through the sequences instead of loading to memory, again depending on the size of the files:

with open('input.fasta', 'r') as fh:
    for line in fh:
        if line.startswith(">"):
            chrom = line.strip().split('>')[1]
            if chrom in pos:
                print chrom, '\t', pos[chrom], '\t', next(f).strip()[pos[chrom]]
ADD REPLY
4
Entering edit mode

It's good to do the file parsing yourself, especially as a learning exercise. I think for production or publication it's nice to use parsers that are tested and hopefully eliminate a broad class of simple bugs.

ADD REPLY
0
Entering edit mode
6.7 years ago
drkennetz ▴ 560

for this line:

 chrom = line.strip().split('>')[1]

It could be replaced with:

chrom = line.strip().strip('>').split()[0]

Which would get rid of the '>' and just return the value (for example when running on an actual genome). printing this line for genome hg19 yields: 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7 8 9 MT X Y.

They are all printed to new line. The next issue with the parsing above is that if you have a large list of key:value pairs in your pos dictionary you will run into a problem with duplicates. It will not increment keys, something else to think about although I haven't worked on a fix just yet.

ADD COMMENT
0
Entering edit mode
6.7 years ago

This should work.

awk '{print $1 "\t" $2-1 "\t" $2}' input.file | bedtools getfasta -fi genome.fa -bed stdin -fo output.fa

bedtools's getfasta is here.

ADD COMMENT

Login before adding your answer.

Traffic: 3613 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6