I have big fasta file with information like this,
>ENST00000448914.1 cdna chromosome:GRCh38:14:22449113:22449125:1 gene:ENSG00000228985.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRDD3 description:T cell receptor delta diversity 3 [Source:HGNC Symbol;Acc:HGNC:12256]
ACTGGGGGATACG
>ENST00000631435.1 cdna chromosome:GRCh38:CHR_HSCHR7_2_CTG6:142847306:142847317:1 gene:ENSG00000282253.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRBD1 description:T cell receptor beta diversity 1 [Source:HGNC Symbol;Acc:HGNC:12158]
GGGACAGGGGGC
>ENST00000632684.1 cdna chromosome:GRCh38:7:142786213:142786224:1 gene:ENSG00000282431.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRBD1 description:T cell receptor beta diversity 1 [Source:HGNC Symbol;Acc:HGNC:12158]
GGGACAGGGGGC
I want to retrieve the transcript ID that is the first entry and then i want to extract the gene Id (fourth entry) from the file to a new file. I know it works something like this
zcat file.fa.gz | grep ">" | perl -lane 'if***************{print join("\t", $1, $4)}' > transcripts2genes.
What I dont know is what come on the part of the asterisk. Can somebody help me with this?
i want the output be like
ENST00000448914.1 ENSG00000228985.1
ENST00000631435.1 ENSG00000282253.1
Can you provide an example of the desired output?