Question

Getting exon sequences for all human genes

1

Entering edit mode

8.5 years ago

valerie ▴ 100

Hi everyone,

I would be very grateful if you could help me.

I want to download the sequences of all the exons for each human gene. I went to ensembl biomart and tried to do it for BRCA2 first (this was a random choice). First I selected the following attributes: Unspliced(gene), Exon start, Exon end, strand, Gene start. I thought that this information will be enough to 'cut out' all the exons. The first thing I noticed is that the number of exons is almost twice larger then the number I see on the Wiki. Then I tried to download directly Exon sequences, but the number of sequences was again larger the the number of exons should be and moreover several exon ids correspond to the same sequence. I also tried to download coding sequence and cDNA, but the length of both sequences is not consistent with 'official' BRCA2 length!

These all drives me crazy and I have absolutely no idea on what to do. All I want is to get for each gene the sequences of its exons. Help me please!

Thanks!

genome gene sequence exon • 4.8k views

ADD COMMENT • link updated 3.6 years ago by asalimih ▴ 60 • written 8.5 years ago by valerie ▴ 100

1

Entering edit mode

8.5 years ago

WouterDeCoster 48k

The issue you have with more sequences and exons is most likely due to alternative transcripts. I think bedtools getfasta can do what you want, if you supply a reference fasta and corresponding gtf or bed file of all exons (http://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html).

ADD COMMENT • link 8.5 years ago by WouterDeCoster 48k

0

Entering edit mode

Thank you very much!!

ADD REPLY • link 8.5 years ago by valerie ▴ 100

0

Entering edit mode

in bedtools getfasta it seems that the only option to have the exons concatenated is using -split which is available for bed12 format. my bed file is not in bed12 and each exon is in a separate row (though the name column of the bed file is the gene ID). how can I get the concatenated sequence?

ADD REPLY • link 3.6 years ago by asalimih ▴ 60

score 4 · Accepted Answer · 2016-10-22

4

Entering edit mode

8.5 years ago

sacha ★ 2.4k

Download refGene.txt.gz from UCSC :

 wget http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/refGene.txt.gz

Then keep only uniq Gene name ( column 13 ) and extract coordinate ( chrom - exonstart - exonend) to a bed file. Column 10 and 11 contains exonsStart and exonEnd position separated by a comma.

zcat refGene.txt.gz|sort -u -k13,13|cut -f3,10,11|awk 'BEGIN{OFS="\t"}{split($2,start,",");split($3,end,","); for(i=1;i<length(start);++i){print $1,start[i],end[i]}}' > exons.bed

Finally, you can get exons sequence using bedtools getFasta :

bedtools getfasta -fi hg19.fa -bed exons.bed -fo exon.fa

paid attention to work only with chromsome name in range 1-22,X,Y.

ADD COMMENT • link 8.5 years ago by sacha ★ 2.4k

0

Entering edit mode

I just count the size of exome using the following commands :

zcat refGene.txt.gz|sort -u -k 13,13|awk '{SUM=0;split($10,s,","); split($11,e,",");for(i=1;i<length(s);i++){SUM+=e[i] - s[i]};print SUM}'|paste -sd "+"|bc

Divided by the human genom size from hg19, I get : 2.32 %

ADD REPLY • link 8.5 years ago by sacha ★ 2.4k

0

Entering edit mode

Dear Sacha, thank you very much, this is exactly what I need! May I also ask you one more question? I have headers in exons.fa file like "chr19:58346805-58347029". Is there any possibility to have a heading in a format "BRCA exon1"? Thank you in advance!

ADD REPLY • link 8.5 years ago by valerie ▴ 100

0

Entering edit mode

Solved this issue myself :) In case anyone needs this, you should run:

zcat refGene.txt.gz|sort -u -k13,13|cut -f3,10,11,13|awk 'BEGIN{OFS="\t"}{split($2,start,",");split($3,end,","); for(i=1;i<length(start);++i){print $1,start[i],end[i],$4="" "_"="" i}}'="" >="" exons.bed<="" p="">

to add geneName_exodId to the bed file and then use -name option when running bedtools

ADD REPLY • link 8.5 years ago by valerie ▴ 100