I did it like that:
1- Download refGene.txt.gz and hg19.fasta from the UCSC goldenpath. ( note: convert hg19.2bit to hg19.fa using twoBitToFa )
2- Create a bed file with exon coordiniate using my awk script
// to_transcript.awk
BEGIN
{
OFS ="\t"
}
{
name=$2
name2=$13
sens = $4 =="+" ? "forward" : "reverse"
chrom=$3
split($10,exon_starts,",")
split($11,exon_ends,",")
for (i=1; i<length(exon_starts); i++)
{
start = exon_starts[i];
end = exon_ends[i];
print(chrom, start, end, name2,sens, name)
}
}
And run it :
zcat refGene.txt.gz|awk -f to_transcript.awk > exon.bed
3 - Extract sequence using bedtools
bedtools getfasta -fi hg19.fa -bed exon.bed -fo exon.fa -name
4 - You may want to reverse sequence for gene placed on the reverse strand. Split exon.bed into and exon_minus.bed and exon_plus.bed using grep on column 5th. And do same as I described above to have exon_minus.fa and exon_plus.fa
You can then revert your sequence using :
seqtk seq -r exon_minus.fasta> exon_minus.rev.fasta
and concat both file :
cat exon_plus.fasta exon_minus.rev.fasta > all_exons.fa
Hi where you gtf file come from ? There is few possibility for this error like not the same NM gene/transcript version so the coordinates from gtf files and ucsc are not the same.
i have the gtf file from gencode ...so what do you suggest?
You can extract coordinates of exon from ucsc (https://genome.ucsc.edu/cgi-bin/hgTables ) to be sure you compare the same ENST
Have you looked at bedtools getfasta? This does what you are looking for if I understood the question correctly.
no i haven;t looked into it so i have this problem which is as such , I want to get the exon coordinates and the respective sequence ,then i want to use that exon sequence for primer designing.So is it possible for the bedtools getFasta to help in that?
You need a gtf, gff or bed of the exons of interest. That shouldn't be too hard to find. Then bedtools getfasta can be used to slice out a fasta record per exon (or any other interval) which you can then use for primer design.
okay so you saying i need the exon coordinates and using those i can slice fasta sequence from a fasta file ?
Yes, you use the reference genome fasta and slice out the fragments of interest from the bed/gtf.
thank you will try your suggestion
I have extracted the field containing chr, start, end , +/-, and the column containing exon/transcript/cds . I have a made field , now i want to extract only those rows which contain exon only , I did it R its straight forward but i'm not sure if I can retain the bed format in R or not so how do i parse only the exon only ?from the bed file...
its simple awk so i did it anyways thank you for your input im using that only