Hi everyone,
I would be very grateful if you could help me.
I want to download the sequences of all the exons for each human gene. I went to ensembl biomart and tried to do it for BRCA2 first (this was a random choice). First I selected the following attributes: Unspliced(gene), Exon start, Exon end, strand, Gene start. I thought that this information will be enough to 'cut out' all the exons. The first thing I noticed is that the number of exons is almost twice larger then the number I see on the Wiki. Then I tried to download directly Exon sequences, but the number of sequences was again larger the the number of exons should be and moreover several exon ids correspond to the same sequence. I also tried to download coding sequence and cDNA, but the length of both sequences is not consistent with 'official' BRCA2 length!
These all drives me crazy and I have absolutely no idea on what to do. All I want is to get for each gene the sequences of its exons. Help me please!
Thanks!
I just count the size of exome using the following commands :
Divided by the human genom size from hg19, I get : 2.32 %
Dear Sacha, thank you very much, this is exactly what I need! May I also ask you one more question? I have headers in exons.fa file like "chr19:58346805-58347029". Is there any possibility to have a heading in a format "BRCA exon1"? Thank you in advance!
Solved this issue myself :) In case anyone needs this, you should run:
zcat refGene.txt.gz|sort -u -k13,13|cut -f3,10,11,13|awk 'BEGIN{OFS="\t"}{split($2,start,",");split($3,end,","); for(i=1;i<length(start);++i){print $1,start[i],end[i],$4="" "_"="" i}}'="" >="" exons.bed<="" p="">
to add geneName_exodId to the bed file and then use -name option when running bedtools