Extract nucleotides from fasta file by a single position

1

Entering edit mode

7.5 years ago

misbahabas ▴ 70

Asslamu Alikum,

I have a tab separated file like

chr38 456

chr38 555

and I want fetch the nucleotide at this position from fasta file. Is there any perl script or linux command to do this work?.
Please tell me if have any idea about it

thankx in advance

sequence snp genome • 6.4k views

ADD COMMENT • link updated 6.7 years ago by Rashedul Islam ▴ 480 • written 7.5 years ago by misbahabas ▴ 70

0

Entering edit mode

thanks for reply

Actually I have only single position of SNP not starting and ending position, file not in bed format

ADD REPLY • link 7.5 years ago by misbahabas ▴ 70

0

Entering edit mode

Please use ADD COMMENT or ADD REPLY to answer to previous reactions, as such this thread remains logically structured and easy to follow. I have now moved your post but as you can see it's not optimal. Adding an answer should only be used for providing a solution to the question asked.

I said pretty close, I guess you can easily make a bed format from your file.

ADD REPLY • link 7.5 years ago by WouterDeCoster 47k

0

Entering edit mode

Bed file require starting and ending position , but i have only one position , sorry i am new in bioinformatics , i have no idea about it

kindly tell me how make bed file from this single position

ADD REPLY • link 7.5 years ago by misbahabas ▴ 70

0

Entering edit mode

You are going to enjoy Cheat Sheet For One-Based Vs Zero-Based Coordinate Systems

ADD REPLY • link 7.5 years ago by WouterDeCoster 47k

0

Entering edit mode

seqtk subseq in.fa reg.bed > out.fa from https://github.com/lh3/seqtk

ADD REPLY • link 7.5 years ago by cpad0112 21k

1

Entering edit mode

7.5 years ago

WouterDeCoster 47k

bedtools getfasta comes pretty close to what you ask for

ADD COMMENT • link 7.5 years ago by WouterDeCoster 47k

1

Entering edit mode

7.5 years ago

Matt Shirley 10k

You can format the table like UCSC region strings (chr:start-end) which are 1-based closed coordinates. This way you don't have to worry about converting to [0,1) half-open coordinates like BED requires. Then you can pipe the region strings to either samtools faidx or faidx from the pyfaidx package.

pip install pyfaidx
awk '{printf "%s:%s-%s\n",$1,$2,$2}' < regions.txt | xargs faidx ref.fasta

ADD COMMENT • link 7.5 years ago by Matt Shirley 10k

0

Entering edit mode

7.5 years ago

st.ph.n ★ 2.7k

A Python solution. Linearize your fasta file, if it has more than one line per sequence. Save this code as get_pos.py. Change the names of pos.txt to your tab-delimited file, and input.fasta to the name of your fasta file. Run as python get_pos.py > nucl_at_pos.txt. nucl_at_pos.txt can be changed to whatever output file name you wish.

#!/usr/bin/env python

with open('pos.txt', 'r') as f:
    pos = {}
    for line in f:
        pos[line.strip().split('\t')[0]] = int(line.strip().split('\t')[1])

with open('input.fasta', 'r') as fh:
    seqs = {}
    for line in fh:
        if line.startswith(">"):
            seqs[line.strip().split('>')[1]] = next(fh).strip()

for i in pos:
    print i, '\t', seqs[i][pos[i]]

ADD COMMENT • link 7.5 years ago by st.ph.n ★ 2.7k

3

Entering edit mode

Just a suggestion for simplifying this process: you don't need to linearize the fasta file, and you don't need to read the whole thing into memory - especially if you're just extracting a few single nucleotides.

	from pyfaidx import Fasta
	with Fasta('input.fasta') as fasta:
	with open('pos.txt', 'r') as nucleotides:
	for line in nucleotides:
	chrom, pos = line.rstrip().split()
	nuc = fasta[chrom][int(pos)].seq
	print("{chrom}\t{pos}\t{nuc}".format(**locals()))

view raw answer.py hosted with ❤ by GitHub

ADD REPLY • link 7.5 years ago by Matt Shirley 10k

1

Entering edit mode

Thanks for the tips, Matt. I should learn more about pyfaidx. Having a single-line fasta to work with is just a preference of mine, otherwise I would read the file in with Biopython. I like also to write parts out so the OP can learn something and understand what's going on, and without knowing the size of either file, I agree can be overly taxing for both time and computing power when simpler options are available.

This part could also be changed to this, to iterate through the sequences instead of loading to memory, again depending on the size of the files:

with open('input.fasta', 'r') as fh:
    for line in fh:
        if line.startswith(">"):
            chrom = line.strip().split('>')[1]
            if chrom in pos:
                print chrom, '\t', pos[chrom], '\t', next(f).strip()[pos[chrom]]

ADD REPLY • link 7.5 years ago by st.ph.n ★ 2.7k

4

Entering edit mode

It's good to do the file parsing yourself, especially as a learning exercise. I think for production or publication it's nice to use parsers that are tested and hopefully eliminate a broad class of simple bugs.

ADD REPLY • link 7.5 years ago by Matt Shirley 10k

0

Entering edit mode

6.7 years ago

drkennetz ▴ 560

for this line:

 chrom = line.strip().split('>')[1]

It could be replaced with:

chrom = line.strip().strip('>').split()[0]

Which would get rid of the '>' and just return the value (for example when running on an actual genome). printing this line for genome hg19 yields: 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7 8 9 MT X Y.

They are all printed to new line. The next issue with the parsing above is that if you have a large list of key:value pairs in your pos dictionary you will run into a problem with duplicates. It will not increment keys, something else to think about although I haven't worked on a fix just yet.

ADD COMMENT • link 6.7 years ago by drkennetz ▴ 560

0

Entering edit mode

6.7 years ago

Rashedul Islam ▴ 480

This should work.

awk '{print $1 "\t" $2-1 "\t" $2}' input.file | bedtools getfasta -fi genome.fa -bed stdin -fo output.fa

bedtools's getfasta is here.

ADD COMMENT • link 6.7 years ago by Rashedul Islam ▴ 480

Login before adding your answer.