I have eg.txt
file which has information:
gene chr transcript
GeneA chr1 transcript1
GeneA chr1 transcript2
GeneC chr3 transcript3
I also have a big gtf file from which I'm trying to create a new gtf only for the transcripts present in third column of eg.txt
file.
For this, I'm using awk
and grep
in two commands.
awk '{print $3}' eg.txt | tail --lines=+2 | sort -u > eg2.txt
And then using grep
grep -Ff eg2.txt eg.gtf > subset.gtf
I actually have the output I wanted, but is there a way to get the output in a single command?
Here is how eg.gtf
file look:
chr1 Cufflinks exon 2494 5622 . - . transcript_id "transcript1"; gene_id "XLOC_000002"; gene_name "XLOC_000002"; exon_number "1";
chr1 Cufflinks transcript 2494 5622 . - . transcript_id "transcript1"; gene_id "XLOC_000002"; gene_name "XLOC_000002"; oId "TCONS_00000002"; class_code "u"; tss_id "TSS2";
chr1 Cufflinks exon 27425 27528 . + . transcript_id "transcript2"; gene_id "XLOC_000001"; gene_name "XLOC_000001"; exon_number "1";
chr1 Cufflinks transcript 27425 27904 . + . transcript_id "transcript2"; gene_id "XLOC_000001"; gene_name "XLOC_000001"; oId "TCONS_00000001"; class_code "u"; tss_id "TSS1";
chr1 Cufflinks exon 27612 27904 . + . transcript_id "transcript2"; gene_id "XLOC_000001"; gene_name "XLOC_000001"; exon_number "2";
chr3 ENSEMBL exon 56140 58376 . - . transcript_id "transcript3"; gene_id "ENSG00000278704.1"; gene_name "ENSG00000278704.1"; exon_number "1";
chr3 ENSEMBL transcript 56140 58376 . - . transcript_id "transcript3"; gene_id "ENSG00000278704.1"; gene_name "ENSG00000278704.1"; oId "ENST00000618686.1"; tss_id "TSS3";
chr4 ENSEMBL exon 37434 37534 . - . transcript_id "transcript4"; gene_id "ENSG00000277428.1"; gene_name "ENSG00000277428.1"; exon_number "1";
chr4 ENSEMBL transcript 37434 37534 . - . transcript_id "transcript4"; gene_id "ENSG00000277428.1"; gene_name "ENSG00000277428.1"; oId "ENST00000618679.1"; tss_id "TSS5";
So this is more a
code golf
challenge?I wanted to do it in one command. Do you have any idea?