Question

Extract Information from fasta File

0

Entering edit mode

8.5 years ago

saamar.rajput ▴ 80

I have big fasta file with information like this,

>ENST00000448914.1 cdna chromosome:GRCh38:14:22449113:22449125:1 gene:ENSG00000228985.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRDD3 description:T cell receptor delta diversity 3 [Source:HGNC Symbol;Acc:HGNC:12256]
ACTGGGGGATACG
>ENST00000631435.1 cdna chromosome:GRCh38:CHR_HSCHR7_2_CTG6:142847306:142847317:1 gene:ENSG00000282253.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRBD1 description:T cell receptor beta diversity 1 [Source:HGNC Symbol;Acc:HGNC:12158]
GGGACAGGGGGC
>ENST00000632684.1 cdna chromosome:GRCh38:7:142786213:142786224:1 gene:ENSG00000282431.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRBD1 description:T cell receptor beta diversity 1 [Source:HGNC Symbol;Acc:HGNC:12158]
GGGACAGGGGGC

I want to retrieve the transcript ID that is the first entry and then i want to extract the gene Id (fourth entry) from the file to a new file. I know it works something like this

zcat file.fa.gz | grep ">" | perl -lane 'if***************{print join("\t", $1, $4)}' > transcripts2genes.

What I dont know is what come on the part of the asterisk. Can somebody help me with this?

i want the output be like

ENST00000448914.1  ENSG00000228985.1
ENST00000631435.1  ENSG00000282253.1

RNA-Seq genome gene • 3.4k views

ADD COMMENT • link updated 8.5 years ago by Charles Plessy ★ 2.9k • written 8.5 years ago by saamar.rajput ▴ 80

0

Entering edit mode

Can you provide an example of the desired output?

ADD REPLY • link 8.5 years ago by pld 5.1k

score 4 · Accepted Answer · 2016-11-02

4

Entering edit mode

8.5 years ago

pld 5.1k

Here's one way to do this:

zcat myfile.fa.gz | grep ">" | cut -f 1,4 -d" " | tr -d '>gene:' | tr ' ' '\t'

Input

>ENST00000448914.1 cdna chromosome:GRCh38:14:22449113:22449125:1 gene:ENSG00000228985.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRDD3 description:T cell receptor delta diversity 3 [Source:HGNC Symbol;Acc:HGNC:12256]
ACTGGGGGATACG
>ENST00000631435.1 cdna chromosome:GRCh38:CHR_HSCHR7_2_CTG6:142847306:142847317:1 gene:ENSG00000282253.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRBD1 description:T cell receptor beta diversity 1 [Source:HGNC Symbol;Acc:HGNC:12158]
GGGACAGGGGGC
>ENST00000632684.1 cdna chromosome:GRCh38:7:142786213:142786224:1 gene:ENSG00000282431.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRBD1 description:T cell receptor beta diversity 1 [Source:HGNC Symbol;Acc:HGNC:12158]
GGGACAGGGGGC

Output

ENST00000448914.1       ENSG00000228985.1
ENST00000631435.1       ENSG00000282253.1
ENST00000632684.1       ENSG00000282431.1

EDIT: Forgot that last call to tr in the original post

ADD COMMENT • link 8.5 years ago by pld 5.1k

0

Entering edit mode

Shouldn't the tr not simply be 'gene:' instead of '>gene:'?

ADD REPLY • link 8.5 years ago by WouterDeCoster 47k

1

Entering edit mode

Nope, here's the output of cut:

>ENST00000448914.1 gene:ENSG00000228985.1
>ENST00000631435.1 gene:ENSG00000282253.1
>ENST00000632684.1 gene:ENSG00000282431.1

ADD REPLY • link 8.5 years ago by pld 5.1k

0

Entering edit mode

Oh yes, you're completely right. I had a not-so-bright moment apparently :-D

ADD REPLY • link 8.5 years ago by WouterDeCoster 47k

score 2 · Accepted Answer · 2016-11-02

2

Entering edit mode

8.5 years ago

WouterDeCoster 47k

Can you check if this is the desired output?

I'm not sure if you want to keep the sequence information...

paste <(zcat file.fa.gz | cut -f1 -d' ') <(zcat file.fa.gz | cut -f4 -d' ' | sed 's/gene://') > output.fa

Note that you can pipe it into gzip again to get a .gz fasta file again if you would want to.

ADD COMMENT • link 8.5 years ago by WouterDeCoster 47k

0

Entering edit mode

this is the output

ENST00000448914.1 ENSG00000228985.1 ACTGGGGGATACG ACTGGGGGATACG ENST00000631435.1 ENSG00000282253.1 GGGACAGGGGGC GGGACAGGGGGC ENST00000632684.1 ENSG00000282431.1 GGGACAGGGGGC GGGACAGGGGGC i do not need the sequence after the gene information but it shows in the output file.

ADD REPLY • link 8.5 years ago by saamar.rajput ▴ 80

0

Entering edit mode

In that case the answer from joe.cornish826 is most optimal :-)

ADD REPLY • link 8.5 years ago by WouterDeCoster 47k

score 2 · Accepted Answer · 2016-11-02

2

Entering edit mode

8.5 years ago

Rohit ★ 1.5k

Probably

grep '^>' infile | sed -e 's/^>//' -e 's/ .*gene://' | sed 's/ gene_biotype.*//'

ADD COMMENT • link 8.5 years ago by Rohit ★ 1.5k

score 2 · Accepted Answer · 2016-11-02

2

Entering edit mode

8.5 years ago

Charles Plessy ★ 2.9k

In Perl:

zcat file.fa.gz | perl -anE 'next unless />/ ; $F[0] =~ s/>// ; $F[3] =~ s/gene:// ; say join "\t", $F[0], $F[3]'

ADD COMMENT • link 8.5 years ago by Charles Plessy ★ 2.9k

1

Entering edit mode

a more compact perl alternative:

zcat file.fa.gz | perl -ne '/^>(\S+).+ gene:(\S+)/ and print "$1\t$2\n"'

ADD REPLY • link 8.5 years ago by Jorge Amigo 14k

0

Entering edit mode

Slightly more compact: zcat file.fa.gz | perl -nE '/^>(\S+).+ gene:(\S+)/ and say "$1\t$2"' :)

ADD REPLY • link 8.5 years ago by Charles Plessy ★ 2.9k