How to subset a gtf based on names in the third column of another file in a single command?
3
0
Entering edit mode
2.2 years ago
Vasu ▴ 790

I have eg.txt file which has information:

gene     chr    transcript
GeneA    chr1   transcript1   
GeneA    chr1   transcript2
GeneC    chr3   transcript3

I also have a big gtf file from which I'm trying to create a new gtf only for the transcripts present in third column of eg.txt file.

For this, I'm using awk and grep in two commands.

awk '{print $3}' eg.txt | tail --lines=+2 | sort -u > eg2.txt

And then using grep

grep -Ff eg2.txt eg.gtf > subset.gtf

I actually have the output I wanted, but is there a way to get the output in a single command?

Here is how eg.gtf file look:

chr1     Cufflinks       exon    2494    5622    .       -       .       transcript_id "transcript1"; gene_id "XLOC_000002"; gene_name "XLOC_000002"; exon_number "1";
chr1     Cufflinks       transcript      2494    5622    .       -       .       transcript_id "transcript1"; gene_id "XLOC_000002"; gene_name "XLOC_000002"; oId "TCONS_00000002"; class_code "u"; tss_id "TSS2"; 
chr1     Cufflinks       exon    27425   27528   .       +       .       transcript_id "transcript2"; gene_id "XLOC_000001"; gene_name "XLOC_000001"; exon_number "1"; 
chr1     Cufflinks       transcript      27425   27904   .       +       .       transcript_id "transcript2"; gene_id "XLOC_000001"; gene_name "XLOC_000001"; oId "TCONS_00000001"; class_code "u"; tss_id "TSS1"; 
chr1    Cufflinks       exon    27612   27904   .       +       .       transcript_id "transcript2"; gene_id "XLOC_000001"; gene_name "XLOC_000001"; exon_number "2"; 
chr3      ENSEMBL exon    56140   58376   .       -       .       transcript_id "transcript3"; gene_id "ENSG00000278704.1"; gene_name "ENSG00000278704.1"; exon_number "1"; 
chr3      ENSEMBL transcript      56140   58376   .       -       .       transcript_id "transcript3"; gene_id "ENSG00000278704.1"; gene_name "ENSG00000278704.1"; oId "ENST00000618686.1"; tss_id "TSS3"; 
chr4      ENSEMBL exon    37434   37534   .       -       .       transcript_id "transcript4"; gene_id "ENSG00000277428.1"; gene_name "ENSG00000277428.1"; exon_number "1"; 
chr4      ENSEMBL transcript      37434   37534   .       -       .       transcript_id "transcript4"; gene_id "ENSG00000277428.1"; gene_name "ENSG00000277428.1"; oId "ENST00000618679.1"; tss_id "TSS5";
grep rnaseq awk • 1.8k views
ADD COMMENT
0
Entering edit mode

I actually have the output I wanted

So this is more a code golf challenge?

ADD REPLY
0
Entering edit mode

I wanted to do it in one command. Do you have any idea?

ADD REPLY
2
Entering edit mode
2.2 years ago
Juke34 8.9k

It is difficult to track the parent and child features with a simple bash command.
I recommend you to use agat_sp_filter_feature_from_keep_list.pl from AGAT that will do it properly in a single command.

agat_sp_filter_feature_from_keep_list.pl --gff infile.gff --keep_list file.txt

The file.txt should only contain the 3rd column of your txt file e.g.:

ENST00000673477
ENST00000472194
ENST00000275493
ADD COMMENT
0
Entering edit mode

Vasu : This will ensure that your subset GFF/GTF will be in correct format.

ADD REPLY
0
Entering edit mode
2.2 years ago

but is there a way to get the output in a single command?

not tested.

grep -Ff <( awk '(NR>=2) {printf("transcript_id \"%s\"\n",$3);}'  eg.txt | sort | uniq) gtf-file  
ADD COMMENT
0
Entering edit mode

but it didn't work

ADD REPLY
0
Entering edit mode
2.2 years ago

You could probably do something like this, but it is horribly inefficient, since it loops through the whole array of values from eg.txt for each line of the gtf:

awk 'NR==FNR{a[$3]=$3;next}{for (i in a){ if($14 ~ a[i])print $0}}'  eg.txt Homo_sapiens.GRCh38.107.gtf > subset.gtf

Change $3 for a different column of eg.txt and $14 for matching something else than the Transcript ID. Since .gtf files can look quite different, it likely needs adaptions to work with various ones. I used this:

curl http://ftp.ensembl.org/pub/release-107/gtf/homo_sapiens/Homo_sapiens.GRCh38.107.gtf.gz | gzip -d > Homo_sapiens.GRCh38.107.gtf

and tested with eg.txt:

gene     chr    transcript
GeneA    chr1   ENST00000673477
GeneB    chr2   ENST00000472194
GeneC    chr3   ENST00000275493 
ADD COMMENT
0
Entering edit mode

I added an eg.gtf above.

ADD REPLY

Login before adding your answer.

Traffic: 1600 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6