and I want fetch the nucleotide at this position from fasta file. Is there any perl script or linux command to do this work?.
Please tell me if have any idea about it
Please use ADD COMMENT or ADD REPLY to answer to previous reactions, as such this thread remains logically structured and easy to follow. I have now moved your post but as you can see it's not optimal.
Adding an answer should only be used for providing a solution to the question asked.
I said pretty close, I guess you can easily make a bed format from your file.
You can format the table like UCSC region strings (chr:start-end) which are 1-based closed coordinates. This way you don't have to worry about converting to [0,1) half-open coordinates like BED requires. Then you can pipe the region strings to either samtools faidx or faidx from the pyfaidx package.
A Python solution. Linearize your fasta file, if it has more than one line per sequence. Save this code as get_pos.py. Change the names of pos.txt to your tab-delimited file, and input.fasta to the name of your fasta file. Run as python get_pos.py > nucl_at_pos.txt. nucl_at_pos.txt can be changed to whatever output file name you wish.
#!/usr/bin/env python
with open('pos.txt', 'r') as f:
pos = {}
for line in f:
pos[line.strip().split('\t')[0]] = int(line.strip().split('\t')[1])
with open('input.fasta', 'r') as fh:
seqs = {}
for line in fh:
if line.startswith(">"):
seqs[line.strip().split('>')[1]] = next(fh).strip()
for i in pos:
print i, '\t', seqs[i][pos[i]]
Just a suggestion for simplifying this process: you don't need to linearize the fasta file, and you don't need to read the whole thing into memory - especially if you're just extracting a few single nucleotides.
Thanks for the tips, Matt. I should learn more about pyfaidx. Having a single-line fasta to work with is just a preference of mine, otherwise I would read the file in with Biopython. I like also to write parts out so the OP can learn something and understand what's going on, and without knowing the size of either file, I agree can be overly taxing for both time and computing power when simpler options are available.
This part could also be changed to this, to iterate through the sequences instead of loading to memory, again depending on the size of the files:
with open('input.fasta', 'r') as fh:
for line in fh:
if line.startswith(">"):
chrom = line.strip().split('>')[1]
if chrom in pos:
print chrom, '\t', pos[chrom], '\t', next(f).strip()[pos[chrom]]
It's good to do the file parsing yourself, especially as a learning exercise. I think for production or publication it's nice to use parsers that are tested and hopefully eliminate a broad class of simple bugs.
Which would get rid of the '>' and just return the value (for example when running on an actual genome).
printing this line for genome hg19 yields:
1
10
11
12
13
14
15
16
17
18
19
2
20
21
22
3
4
5
6
7
8
9
MT
X
Y.
They are all printed to new line. The next issue with the parsing above is that if you have a large list of key:value pairs in your pos dictionary you will run into a problem with duplicates. It will not increment keys, something else to think about although I haven't worked on a fix just yet.
thanks for reply
Actually I have only single position of SNP not starting and ending position, file not in bed format
Please use
ADD COMMENT
orADD REPLY
to answer to previous reactions, as such this thread remains logically structured and easy to follow. I have now moved your post but as you can see it's not optimal. Adding an answer should only be used for providing a solution to the question asked.I said pretty close, I guess you can easily make a bed format from your file.
Bed file require starting and ending position , but i have only one position , sorry i am new in bioinformatics , i have no idea about it
kindly tell me how make bed file from this single position
You are going to enjoy Cheat Sheet For One-Based Vs Zero-Based Coordinate Systems
seqtk subseq in.fa reg.bed > out.fa
fromhttps://github.com/lh3/seqtk