Extract Information from fasta File
4
0
Entering edit mode
8.1 years ago

I have big fasta file with information like this,

>ENST00000448914.1 cdna chromosome:GRCh38:14:22449113:22449125:1 gene:ENSG00000228985.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRDD3 description:T cell receptor delta diversity 3 [Source:HGNC Symbol;Acc:HGNC:12256]
ACTGGGGGATACG
>ENST00000631435.1 cdna chromosome:GRCh38:CHR_HSCHR7_2_CTG6:142847306:142847317:1 gene:ENSG00000282253.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRBD1 description:T cell receptor beta diversity 1 [Source:HGNC Symbol;Acc:HGNC:12158]
GGGACAGGGGGC
>ENST00000632684.1 cdna chromosome:GRCh38:7:142786213:142786224:1 gene:ENSG00000282431.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRBD1 description:T cell receptor beta diversity 1 [Source:HGNC Symbol;Acc:HGNC:12158]
GGGACAGGGGGC

I want to retrieve the transcript ID that is the first entry and then i want to extract the gene Id (fourth entry) from the file to a new file. I know it works something like this

zcat file.fa.gz | grep ">" | perl -lane 'if***************{print join("\t", $1, $4)}' > transcripts2genes.

What I dont know is what come on the part of the asterisk. Can somebody help me with this?

i want the output be like

ENST00000448914.1  ENSG00000228985.1
ENST00000631435.1  ENSG00000282253.1
RNA-Seq genome gene • 3.1k views
ADD COMMENT
0
Entering edit mode

Can you provide an example of the desired output?

ADD REPLY
4
Entering edit mode
8.1 years ago
pld 5.1k

Here's one way to do this:

zcat myfile.fa.gz | grep ">" | cut -f 1,4 -d" " | tr -d '>gene:' | tr ' ' '\t'

Input

>ENST00000448914.1 cdna chromosome:GRCh38:14:22449113:22449125:1 gene:ENSG00000228985.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRDD3 description:T cell receptor delta diversity 3 [Source:HGNC Symbol;Acc:HGNC:12256]
ACTGGGGGATACG
>ENST00000631435.1 cdna chromosome:GRCh38:CHR_HSCHR7_2_CTG6:142847306:142847317:1 gene:ENSG00000282253.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRBD1 description:T cell receptor beta diversity 1 [Source:HGNC Symbol;Acc:HGNC:12158]
GGGACAGGGGGC
>ENST00000632684.1 cdna chromosome:GRCh38:7:142786213:142786224:1 gene:ENSG00000282431.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRBD1 description:T cell receptor beta diversity 1 [Source:HGNC Symbol;Acc:HGNC:12158]
GGGACAGGGGGC

Output

ENST00000448914.1       ENSG00000228985.1
ENST00000631435.1       ENSG00000282253.1
ENST00000632684.1       ENSG00000282431.1

EDIT: Forgot that last call to tr in the original post

ADD COMMENT
0
Entering edit mode

Shouldn't the tr not simply be 'gene:' instead of '>gene:'?

ADD REPLY
1
Entering edit mode

Nope, here's the output of cut:

>ENST00000448914.1 gene:ENSG00000228985.1
>ENST00000631435.1 gene:ENSG00000282253.1
>ENST00000632684.1 gene:ENSG00000282431.1
ADD REPLY
0
Entering edit mode

Oh yes, you're completely right. I had a not-so-bright moment apparently :-D

ADD REPLY
2
Entering edit mode
8.1 years ago

Can you check if this is the desired output?

I'm not sure if you want to keep the sequence information...

paste <(zcat file.fa.gz | cut -f1 -d' ') <(zcat file.fa.gz | cut -f4 -d' ' | sed 's/gene://') > output.fa

Note that you can pipe it into gzip again to get a .gz fasta file again if you would want to.

ADD COMMENT
0
Entering edit mode

this is the output

ENST00000448914.1 ENSG00000228985.1 ACTGGGGGATACG ACTGGGGGATACG ENST00000631435.1 ENSG00000282253.1 GGGACAGGGGGC GGGACAGGGGGC ENST00000632684.1 ENSG00000282431.1 GGGACAGGGGGC GGGACAGGGGGC i do not need the sequence after the gene information but it shows in the output file.

ADD REPLY
0
Entering edit mode

In that case the answer from joe.cornish826 is most optimal :-)

ADD REPLY
2
Entering edit mode
8.1 years ago
Rohit ★ 1.5k

Probably

grep '^>' infile | sed -e 's/^>//' -e 's/ .*gene://' | sed 's/ gene_biotype.*//'
ADD COMMENT
2
Entering edit mode
8.1 years ago
Charles Plessy ★ 2.9k

In Perl:

zcat file.fa.gz | perl -anE 'next unless />/ ; $F[0] =~ s/>// ; $F[3] =~ s/gene:// ; say join "\t", $F[0], $F[3]'
ADD COMMENT
1
Entering edit mode

a more compact perl alternative:

zcat file.fa.gz | perl -ne '/^>(\S+).+ gene:(\S+)/ and print "$1\t$2\n"'
ADD REPLY
0
Entering edit mode

Slightly more compact: zcat file.fa.gz | perl -nE '/^>(\S+).+ gene:(\S+)/ and say "$1\t$2"' :)

ADD REPLY

Login before adding your answer.

Traffic: 1788 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6