Question

How to subset a gtf based on names in the third column of another file in a single command?

0

Entering edit mode

2.8 years ago

Vasu ▴ 790

I have eg.txt file which has information:

gene     chr    transcript
GeneA    chr1   transcript1   
GeneA    chr1   transcript2
GeneC    chr3   transcript3

I also have a big gtf file from which I'm trying to create a new gtf only for the transcripts present in third column of eg.txt file.

For this, I'm using awk and grep in two commands.

awk '{print $3}' eg.txt | tail --lines=+2 | sort -u > eg2.txt

And then using grep

grep -Ff eg2.txt eg.gtf > subset.gtf

I actually have the output I wanted, but is there a way to get the output in a single command?

Here is how eg.gtf file look:

chr1     Cufflinks       exon    2494    5622    .       -       .       transcript_id "transcript1"; gene_id "XLOC_000002"; gene_name "XLOC_000002"; exon_number "1";
chr1     Cufflinks       transcript      2494    5622    .       -       .       transcript_id "transcript1"; gene_id "XLOC_000002"; gene_name "XLOC_000002"; oId "TCONS_00000002"; class_code "u"; tss_id "TSS2"; 
chr1     Cufflinks       exon    27425   27528   .       +       .       transcript_id "transcript2"; gene_id "XLOC_000001"; gene_name "XLOC_000001"; exon_number "1"; 
chr1     Cufflinks       transcript      27425   27904   .       +       .       transcript_id "transcript2"; gene_id "XLOC_000001"; gene_name "XLOC_000001"; oId "TCONS_00000001"; class_code "u"; tss_id "TSS1"; 
chr1    Cufflinks       exon    27612   27904   .       +       .       transcript_id "transcript2"; gene_id "XLOC_000001"; gene_name "XLOC_000001"; exon_number "2"; 
chr3      ENSEMBL exon    56140   58376   .       -       .       transcript_id "transcript3"; gene_id "ENSG00000278704.1"; gene_name "ENSG00000278704.1"; exon_number "1"; 
chr3      ENSEMBL transcript      56140   58376   .       -       .       transcript_id "transcript3"; gene_id "ENSG00000278704.1"; gene_name "ENSG00000278704.1"; oId "ENST00000618686.1"; tss_id "TSS3"; 
chr4      ENSEMBL exon    37434   37534   .       -       .       transcript_id "transcript4"; gene_id "ENSG00000277428.1"; gene_name "ENSG00000277428.1"; exon_number "1"; 
chr4      ENSEMBL transcript      37434   37534   .       -       .       transcript_id "transcript4"; gene_id "ENSG00000277428.1"; gene_name "ENSG00000277428.1"; oId "ENST00000618679.1"; tss_id "TSS5";

grep rnaseq awk • 2.3k views

ADD COMMENT • link updated 2.8 years ago by GenoMax 151k • written 2.8 years ago by Vasu ▴ 790

0

Entering edit mode

I actually have the output I wanted

So this is more a code golf challenge?

ADD REPLY • link 2.8 years ago by GenoMax 151k

0

Entering edit mode

I wanted to do it in one command. Do you have any idea?

ADD REPLY • link 2.8 years ago by Vasu ▴ 790

score 2 · Answer 1 · 2022-09-01

2

Entering edit mode

2.8 years ago

Juke34 9.2k

It is difficult to track the parent and child features with a simple bash command.
I recommend you to use agat_sp_filter_feature_from_keep_list.pl from AGAT that will do it properly in a single command.

agat_sp_filter_feature_from_keep_list.pl --gff infile.gff --keep_list file.txt

The file.txt should only contain the 3rd column of your txt file e.g.:

ENST00000673477
ENST00000472194
ENST00000275493

ADD COMMENT • link 2.8 years ago by Juke34 9.2k

0

Entering edit mode

Vasu : This will ensure that your subset GFF/GTF will be in correct format.

ADD REPLY • link 2.8 years ago by GenoMax 151k

score 0 · Answer 2 · 2022-09-01

0

Entering edit mode

2.8 years ago

Pierre Lindenbaum 166k

but is there a way to get the output in a single command?

not tested.

grep -Ff <( awk '(NR>=2) {printf("transcript_id \"%s\"\n",$3);}'  eg.txt | sort | uniq) gtf-file

ADD COMMENT • link 2.8 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

but it didn't work

ADD REPLY • link 2.8 years ago by Vasu ▴ 790

score 0 · Answer 3 · 2022-09-01

You could probably do something like this, but it is horribly inefficient, since it loops through the whole array of values from eg.txt for each line of the gtf:

awk 'NR==FNR{a[$3]=$3;next}{for (i in a){ if($14 ~ a[i])print $0}}'  eg.txt Homo_sapiens.GRCh38.107.gtf > subset.gtf

Change $3 for a different column of eg.txt and $14 for matching something else than the Transcript ID. Since .gtf files can look quite different, it likely needs adaptions to work with various ones. I used this:

curl http://ftp.ensembl.org/pub/release-107/gtf/homo_sapiens/Homo_sapiens.GRCh38.107.gtf.gz | gzip -d > Homo_sapiens.GRCh38.107.gtf

and tested with eg.txt:

gene     chr    transcript
GeneA    chr1   ENST00000673477
GeneB    chr2   ENST00000472194
GeneC    chr3   ENST00000275493