Retrieve the FASTA transcript sequences of a list of NCBI genes
3
0
Entering edit mode
9.8 years ago
Sibs • 0

Hello everyone,

I want to design a microarray probe set for Coccomyxa subellipsoidea C-169 which has been fully sequenced. I need to make a target sequence data file in FASTA format or a TDT file of GenBank accessions. I got a list of 10091 potential genes to design the microarray (NCBI-->Gene). How can I make this file? Do you know any step by step guideline that I can use? Thanks

gene Probe-set RNA Microarray • 4.6k views
ADD COMMENT
0
Entering edit mode

Do you need to pull the genes themselves using identifiers or just get the transcript sequence from your fasta file?

ADD REPLY
1
Entering edit mode
9.6 years ago

Say you are able to extract the identifiers with one identifier/line (perhaps in a file). I am just using the echo to generate the list:

$ echo -e "A0QTF8\nA0QTG1\nA0QU63"
A0QTF8
A0QTG1
A0QU63

Then you can use the following one-liner to get the sequences:

$ echo -e "A0QTF8\nA0QTG1\nA0QU63" | while read l; do echo -e ">"$l"\n"$(curl -s http://www.uniprot.org/uniprot/$l.txt | awk '/SQ/,/\/\//{if ($0!~/^\/\// && $0!~/^SQ/) {gsub(" ","",$0); printf $0}}'); done
>A0QTF8
MDLINGMGTSPGYWRTPREPGNDHRRARLDVMAQRIVITGAGGMVGRVLADQAAAKGHTVLALTSSQCDITDEDAVRRFVANGDVVINCAAYTQVDKAEDEPERAHAVNAVGPGNLAKACAAVDAGLIHISTDYVFGAVDRDTPYEVDDETGPVNIYGRTKLAGEQAVLAAKPDAYVVRTAWVYRGGDGSDFVATMRRLAAGDGAIDVVADQVGSPTYTGDLVGALLQIVDGGVEPGILHAANAGVASRFDQARATFEAVGADPERVRPCGSDRHPRPAPRPSYTVLSSQRSAQAGLTPLRDWREALQDAVAAVVGATTDGPLPSTP
>A0QTG1
MSAAANAEHGAADRVEILPVPGLPEFRPGDDLVGSLAEAAPWLRDGDVLVVTSKVVSKCEGRIVAAPSDPEERDTLRRKLIDDEAVRVLARKGRTLITENAIGLVQAAAGVDGSNVGSTELALLPVDPDRSAATLREGLRERLGVTVGVVITDTMGRAWRTGQTDFAIGASGLTVLQGYAGSRDRHGNELVVTEVAVADEIAAAADLVKGKLTAIPVAVVRGLRLPDDGSTAHRLVRAGEDDLFWLGTAEAIELGRRQAQLLRRSVRRFSAEPVPHDAIEAAVGEALTAPAPHHTRPVRFVWVQDSETRTRLLDRMKEQWRADLTADGLDADAVDRRVARGQILYDAPELVIPFLVPDGAHSYPDDARTAAEHTMFTVAVGAAVQGLLVALAVRDIGSCWIGSTIFAADLVRAELELPDDWEPLGAIAIGYPEQTPQPLGPRDPVPTDELLVRK
>A0QU63
MTKKSASSNNKVVATNRKARHNYTILDTYEAGIVLMGTEVKSLREGQASLADAFATVDDGEIWLRNVHIAEYHHGTWTNHAPRRNRKLLLHRKQIDNLIGKIRDGNLTLVPLSIYFTDGKVKVELALARGKQAHDKRQDLARRDAQREVIRELGRRAKGKI

Best Wishes,
Umer

ADD COMMENT
0
Entering edit mode
9.6 years ago
Dan D 7.4k

Step 1: Gene a track from the UCSC table browser, which contains the following for the hg19 genome:

chromosome
txStart (transcription start)
txEnd (transcription end)
gene name

Step 2:

With that data in hand, and with a local copy of the hg19 reference genome, use samtools faidx along with the chromosome, start, and end positions to get the FASTA sequence for each gene. By simply launching a separate samtools process for each gene and capturing the output, you can programmatically build a FASTA containing all of the gene sequences

If you need clarification or coding help for any of these steps, just let me know what specific questions you have.

ADD COMMENT
0
Entering edit mode

You can directly tell the UCSC Genome Browser to give you the sequences instead of the genomic loci.

Just select 'sequence' as 'output format'.

ADD REPLY
0
Entering edit mode
9.6 years ago
st.ph.n ★ 2.7k

This python snippet, using BioPython, will pull the sequence record in genbank format from NCBI using an input file containing a list of accession numbers.

import sys
from Bio import Entrez
from Bio import SeqIO

Entrez.email = #put your email address here

input_list = sys.argv[1]

with open(input_list, "r") as f:
        idlist = [line.strip() for line in f]

handle = Entrez.efetch(db="nucleotide", rettype="gb", retmode="text",id=idlist, email=Entrez.email)

seq_rec_list = []
for seq_record in SeqIO.parse(handle, "gb"):
    seq_rec_list.append(seq_record)

out_handle = open("output.fasta", "w")
SeqIO.write(seq_rec_list, out_handle, "fasta")
out_handle.close()
ADD COMMENT

Login before adding your answer.

Traffic: 1837 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6