Question

extracting exome from genome

0

Entering edit mode

7.0 years ago

zeelo ▴ 10

Hello Biostars community,

Time to end my lurker-existence. I have a quick question about extracting particular exonic regions from annotated genomes and I would appreciate any help or comments!

I want to extract all exons from the honeybee genome that fulfill certain categories. The stringency of these categories will probably change at some point, as I want to explore the data under different parameters. For example, I would like to get all exons of amel 4.5 that fulfill the following: (1) The respective LOC has at least 5 exons, (2) 2 exons of at least 500 bp length. Ideally as a .fasta file.

To be honest, I am a little overwhelmed by all the tools/databases that are out there. Just a tip what tool could accomplish this task would help me a lot.

Thanks!

Z

genome Assembly alignment gene • 2.6k views

ADD COMMENT • link updated 7.0 years ago by Pierre Lindenbaum 164k • written 7.0 years ago by zeelo ▴ 10

0

Entering edit mode

what is "LOC" ?

ADD REPLY • link 7.0 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Pierre,

Thanks for your fast reply. LOC describes the gene locus of the honeybee genes. E.g., LOC726504 corresponds to GB41140. I am looking for gene loci that have at least a certain number of exons. I also just saw your solution down on the page. I am excited to try it out. Thanks!

ADD REPLY • link 7.0 years ago by zeelo ▴ 10

score 2 · Accepted Answer · 2017-12-03

2

Entering edit mode

7.0 years ago

Pierre Lindenbaum 164k

get all exons,from gtf, extract name-of-the-transcript+length

    curl -s "ftp://ftp.ensemblgenomes.org/pub/metazoa/release-37/gff3/apis_mellifera/Apis_mellifera.Amel_4.5.37.gff3.gz" | gunzip -c | awk '($3=="exon") {n=split($9,a,/[;\:=]/);for(i=1;i<n;i++) if(a[i]="="transcript")" {printf("%s\t%d\n",a[i+1],1+int($5)-int($4));}}'="" |="" sort="" |="" uniq=""> tmp1.tsv

count the number of transcripts having 5 exons, extract those transcripts, extract, the exons (length>500), sort+uniq, get those having a count>2

    join -t $'\t' -1 1 -2 1 tmp1.tsv <(cut -f 1 tmp1.tsv  | sort | uniq -c | awk '($1==5) {print($2);}'  | sort | uniq) | awk '(int($2)>500)' | cut -f 1 | sort | uniq -c | awk '(int($1)>2) {print $2;}'

GB40269-RA
GB41718-RA
GB41966-RA
GB42025-RA
GB42112-RA
GB42527-RA
GB42613-RA
GB43158-RA
GB44220-RA
GB44690-RA
GB44885-RA
GB44888-RA
GB44987-RA
GB45023-RA
GB45437-RA
GB47779-RA
GB50302-RA
GB50387-RA
GB50561-RA
GB50769-RA
GB51160-RA
GB51985-RA
GB52153-RA
GB52307-RA
GB52435-RA
GB53160-RA
GB54131-RA
GB54351-RA
GB54395-RA
GB54551-RA
GB54715-RA
GB54937-RA
GB55357-RA
GB55624-RA
GB55826-RA

eg.: GB40269-RA http://metazoa.ensembl.org/Apis_mellifera/Transcript/Exons?db=core;g=GB40269;r=7:18936-22560;t=GB40269-RA

GB41718-RA : http://metazoa.ensembl.org/Apis_mellifera/Transcript/Exons?db=core;g=GB41718;r=14:9600024-9604327;t=GB41718-RA

ADD COMMENT • link 7.0 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Pierre,

Thanks for the suggested solution. That was fast. I see the idea, however, I get a syntax error in the curl-

ADD REPLY • link 7.0 years ago by zeelo ▴ 10

1

Entering edit mode

I get a syntax error in the curl-

ADD REPLY • link 7.0 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Hi Pierre,

Thank you for your follow-up. Of course I'd be happy to provide some more detail:

awk: cmd. line:1: ($3=="exon") {n=split($9,a,/[;\:=]/);for(i=1;i<n;i++) if(a[i]="="transcript")" {printf("%s\t%d\n",a[i+1],1+int($5)-int($4));}}= awk: cmd. line:1: ^ syntax error and the ^syntax error matches the { before the printf

awk: cmd. line:1: ($3=="exon") {n=split($9,a,/[;\:=]/);for(i=1;i<n;i++) if(a[i]="="transcript")" {printf("%s\t%d\n",a[i+1],1+int($5)-int($4));}}= awk: cmd. line:1: ^ syntax error and the ^syntax error matches the last =

=: command not found =: command not found

Maybe that helps. Thanks again

zeelo

ADD REPLY • link 7.0 years ago by zeelo ▴ 10

0

Entering edit mode

ah ! I see the biostars engine has messed-up the code ! sorry for this: I'll post a gist in a few minutes.

ADD REPLY • link 7.0 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

here is the gist :

ADD REPLY • link 7.0 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Hi Pierre,

Wow that works like a charm. Thanks again for your help. Neat!

ADD REPLY • link 7.0 years ago by zeelo ▴ 10

0

Entering edit mode

Hi Pierre,

It's me again. I'm trying to explore the join command, however, I am having some troubles. Given the case I want to parse out all transcript that have at least 20 exons. Using your code I'd do:

join -t $'\t' -1 1 -2 1 tmp1.tsv <(cut -f 1 tmp1.tsv | sort | uniq -c | awk '($1=>20) {print($2);}' | sort | uniq) >> test.tsv

The output table is a little confusing to me, several lines appear multiple times. Do you have a hint what I'm doing wrong?

Thanks again,

Z

ADD REPLY • link 7.0 years ago by zeelo ▴ 10

0

Entering edit mode

this

>> test.tsv

looks wrong to me

ADD REPLY • link 7.0 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

I see. Another thanks, Pierre.

ADD REPLY • link 7.0 years ago by zeelo ▴ 10