How to extract specific lines from a GTF file?
1
0
Entering edit mode
2.9 years ago
Shraddha ▴ 90

Hello all,

I'm trying to extract the lines containing a specific list of genes, from a gtf file. My pattern file looks like this:

LOC115705147 
LOC115716814 
LOC115713105

My reference file looks like this (gene IDs in the 9th column next to gene_id):

NC_044371.1     Gnomon  gene    3273    3552    .       -       .       gene_id "LOC115705401"; db_xref "GeneID:115705401"; gbkey "Gene"; gene "LOC115705401"; gene_biotype "lncRNA";
NC_044371.1     Gnomon  exon    3539    3552    .       -       .       gene_id "LOC115705401"; transcript_id "XR_004009451.1"; db_xref "GeneID:115705401"; gbkey "ncRNA"; gene "LOC115705401"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 2 samples with support for all annotated introns"; product "uncharacterized LOC115705401"; exon_number "1";
NC_044371.1     Gnomon  exon    3384    3456    .       -       .       gene_id "LOC115705401"; transcript_id "XR_004009451.1"; db_xref "GeneID:115705401"; gbkey "ncRNA"; gene "LOC115705401"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 2 samples with support for all annotated introns"; product "uncharacterized LOC115705401"; exon_number "2";

If I use grep on a single gene ID, then it gives me a set of results I need. If I try to use it on my list of genes, though, I get an empty file. Here are the things I've tried:

  1. grep -f genes.txt anno.gtf > results.txt

  2. cat genes.txt | xargs -I {} grep -f {} anno.gtf > results.txt

  3. while read -r line; do grep -f $line anno.gtf > results.txt done < genes.txt

I'm struggling to see what the issue could be, since the parts work individually (e.g., when I use the while read loop and put in an echo $line command, it works fine). Does someone know what the issue is/another way to extract these?

Thanks in advance!

command-line grep loop • 3.3k views
ADD COMMENT
1
Entering edit mode

Except the problem related to the CR/LF you needed to fix you might prefer to use a tool like agat_sp_filter_feature_from_keep_list.pl from AGAT.
It might avoid problems like in a case you want to catch a gene "LOC115705401" and "LOC115705401.2" would exist too (with grep both would be taken).

ADD REPLY
0
Entering edit mode

. If I try to use it on my list of genes, though, I get an empty file.

your genes are not in the gtf.

ADD REPLY
0
Entering edit mode

They're in the 9th column under gene_id. If I just grep any one of those genes with the reference annotation file, I get all the lines that contain that gene in the 9th column.

ADD REPLY
0
Entering edit mode

then there is probably something like a space or any invisible character in "genes.txt", or "genes.txt" is a windows file with CRLF end of lines etc...

ADD REPLY
0
Entering edit mode

show me the output of

grep -F LOC115705147 genes.txt anno.gtf | cat -A
ADD REPLY
0
Entering edit mode

Responding to both comments here:

  1. That seems likely. I got these lists by saving sheets of an excel file as a csv, which I then converted to a txt (not the optimal process, of course, but that's just sort of how it happened). Do you know how I can fix that?

  2. Did you mean grep -F LOC115605147 anno.gtf | cat -A? If so, the results are:

    NC_044371.1^IGnomon^Igene^I22449283^I22450610^I.^I+^I.^Igene_id "LOC115705147"; db_xref "GeneID:115705147"; gbkey "Gene"; gene "LOC115705147"; gene_biotype "protein_coding"; $ NC_044371.1^IGnomon^Iexon^I22449283^I22449484^I.^I+^I.^Igene_id "LOC115705147"; transcript_id "XM_030632388.1"; db_xref "GeneID:115705147"; gbkey "mRNA"; gene "LOC115705147"; model_evidence "Supporting evidence includes similarity to: 1 EST, 26 Proteins, and 100% coverage of the annotated genomic feature by RNAseq alignments, including 76 samples with support for all annotated introns"; product "cysteine proteinase inhibitor 12"; exon_number "1"; $ NC_044371.1^IGnomon^Iexon^I22449570^I22449792^I.^I+^I.^Igene_id "LOC115705147"; transcript_id "XM_030632388.1"; db_xref "GeneID:115705147"; gbkey "mRNA"; gene "LOC115705147"; model_evidence "Supporting evidence includes similarity to: 1 EST, 26 Proteins, and 100% coverage of the annotated genomic feature by RNAseq alignments, including 76 samples with support for all annotated introns"; product "cysteine proteinase inhibitor 12"; exon_number "2"; $ NC_044371.1^IGnomon^Iexon^I22449904^I22450049^I.^I+^I.^Igene_id "LOC115705147"; transcript_id "XM_030632388.1"; db_xref "GeneID:115705147"; gbkey "mRNA"; gene "LOC115705147"; model_evidence "Supporting evidence includes similarity to: 1 EST, 26 Proteins, and 100% coverage of the annotated genomic feature by RNAseq alignments, including 76 samples with support for all annotated introns"; product "cysteine proteinase inhibitor 12"; exon_number "3"; $ NC_044371.1^IGnomon^Iexon^I22450118^I22450610^I.^I+^I.^Igene_id "LOC115705147"; transcript_id "XM_030632388.1"; db_xref "GeneID:115705147"; gbkey "mRNA"; gene "LOC115705147"; model_evidence "Supporting evidence includes similarity to: 1 EST, 26 Proteins, and 100% coverage of the annotated genomic feature by RNAseq alignments, including 76 samples with support for all annotated introns"; product "cysteine proteinase inhibitor 12"; exon_number "4"; $ NC_044371.1^IGnomon^ICDS^I22449314^I22449484^I.^I+^I0^Igene_id "LOC115705147"; transcript_id "XM_030632388.1"; db_xref "GeneID:115705147"; gbkey "CDS"; gene "LOC115705147"; product "cysteine proteinase inhibitor 12"; protein_id "XP_030488248.1"; exon_number "1"; $ NC_044371.1^IGnomon^ICDS^I22449570^I22449792^I.^I+^I0^Igene_id "LOC115705147"; transcript_id "XM_030632388.1"; db_xref "GeneID:115705147"; gbkey "CDS"; gene "LOC115705147"; product "cysteine proteinase inhibitor 12"; protein_id "XP_030488248.1"; exon_number "2"; $ NC_044371.1^IGnomon^ICDS^I22449904^I22450049^I.^I+^I2^Igene_id "LOC115705147"; transcript_id "XM_030632388.1"; db_xref "GeneID:115705147"; gbkey "CDS"; gene "LOC115705147"; product "cysteine proteinase inhibitor 12"; protein_id "XP_030488248.1"; exon_number "3"; $ NC_044371.1^IGnomon^ICDS^I22450118^I22450252^I.^I+^I0^Igene_id "LOC115705147"; transcript_id "XM_030632388.1"; db_xref "GeneID:115705147"; gbkey "CDS"; gene "LOC115705147"; product "cysteine proteinase inhibitor 12"; protein_id "XP_030488248.1"; exon_number "4"; $ NC_044371.1^IGnomon^Istart_codon^I22449314^I22449316^I.^I+^I0^Igene_id "LOC115705147"; transcript_id "XM_030632388.1"; db_xref "GeneID:115705147"; gbkey "CDS"; gene "LOC115705147"; product "cysteine proteinase inhibitor 12"; protein_id "XP_030488248.1"; exon_number "1"; $ NC_044371.1^IGnomon^Istop_codon^I22450253^I22450255^I.^I+^I0^Igene_id "LOC115705147"; transcript_id "XM_030632388.1"; db_xref "GeneID:115705147"; gbkey "CDS"; gene "LOC115705147"; product "cysteine proteinase inhibitor 12"; protein_id "XP_030488248.1"; exon_number "4"; $

I assume the ^| there are stand ins for the tab that separates the columns?

ADD REPLY
0
Entering edit mode

Did you mean grep -F LOC115605147 anno.gtf

no I really meant

grep -F LOC115705147 genes.txt anno.gtf | cat -A
ADD REPLY
0
Entering edit mode

Oh sorry, in that case, it didn't show anything.

ADD REPLY
0
Entering edit mode
2.9 years ago

I got these lists by saving sheets of an excel file as a csv

you're probably having a CR/LF problem. https://en.wikipedia.org/wiki/Newline#Issues_with_different_newline_formats

tr -d '\r' < genes.txt > new.genes.txt
ADD COMMENT
0
Entering edit mode

Looks like that's the issue exactly, thanks so much!

ADD REPLY

Login before adding your answer.

Traffic: 2003 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6