Hello all,
I'm trying to extract the lines containing a specific list of genes, from a gtf file. My pattern file looks like this:
LOC115705147
LOC115716814
LOC115713105
My reference file looks like this (gene IDs in the 9th column next to gene_id):
NC_044371.1 Gnomon gene 3273 3552 . - . gene_id "LOC115705401"; db_xref "GeneID:115705401"; gbkey "Gene"; gene "LOC115705401"; gene_biotype "lncRNA";
NC_044371.1 Gnomon exon 3539 3552 . - . gene_id "LOC115705401"; transcript_id "XR_004009451.1"; db_xref "GeneID:115705401"; gbkey "ncRNA"; gene "LOC115705401"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 2 samples with support for all annotated introns"; product "uncharacterized LOC115705401"; exon_number "1";
NC_044371.1 Gnomon exon 3384 3456 . - . gene_id "LOC115705401"; transcript_id "XR_004009451.1"; db_xref "GeneID:115705401"; gbkey "ncRNA"; gene "LOC115705401"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 2 samples with support for all annotated introns"; product "uncharacterized LOC115705401"; exon_number "2";
If I use grep on a single gene ID, then it gives me a set of results I need. If I try to use it on my list of genes, though, I get an empty file. Here are the things I've tried:
grep -f genes.txt anno.gtf > results.txt
cat genes.txt | xargs -I {} grep -f {} anno.gtf > results.txt
while read -r line; do grep -f $line anno.gtf > results.txt done < genes.txt
I'm struggling to see what the issue could be, since the parts work individually (e.g., when I use the while read loop and put in an echo $line command, it works fine). Does someone know what the issue is/another way to extract these?
Thanks in advance!
Except the problem related to the CR/LF you needed to fix you might prefer to use a tool like
agat_sp_filter_feature_from_keep_list.pl
from AGAT.It might avoid problems like in a case you want to catch a gene "LOC115705401" and "LOC115705401.2" would exist too (with grep both would be taken).
your genes are not in the gtf.
They're in the 9th column under gene_id. If I just grep any one of those genes with the reference annotation file, I get all the lines that contain that gene in the 9th column.
then there is probably something like a space or any invisible character in "genes.txt", or "genes.txt" is a windows file with CRLF end of lines etc...
show me the output of
Responding to both comments here:
That seems likely. I got these lists by saving sheets of an excel file as a csv, which I then converted to a txt (not the optimal process, of course, but that's just sort of how it happened). Do you know how I can fix that?
Did you mean grep -F LOC115605147 anno.gtf | cat -A? If so, the results are:
NC_044371.1^IGnomon^Igene^I22449283^I22450610^I.^I+^I.^Igene_id "LOC115705147"; db_xref "GeneID:115705147"; gbkey "Gene"; gene "LOC115705147"; gene_biotype "protein_coding"; $ NC_044371.1^IGnomon^Iexon^I22449283^I22449484^I.^I+^I.^Igene_id "LOC115705147"; transcript_id "XM_030632388.1"; db_xref "GeneID:115705147"; gbkey "mRNA"; gene "LOC115705147"; model_evidence "Supporting evidence includes similarity to: 1 EST, 26 Proteins, and 100% coverage of the annotated genomic feature by RNAseq alignments, including 76 samples with support for all annotated introns"; product "cysteine proteinase inhibitor 12"; exon_number "1"; $ NC_044371.1^IGnomon^Iexon^I22449570^I22449792^I.^I+^I.^Igene_id "LOC115705147"; transcript_id "XM_030632388.1"; db_xref "GeneID:115705147"; gbkey "mRNA"; gene "LOC115705147"; model_evidence "Supporting evidence includes similarity to: 1 EST, 26 Proteins, and 100% coverage of the annotated genomic feature by RNAseq alignments, including 76 samples with support for all annotated introns"; product "cysteine proteinase inhibitor 12"; exon_number "2"; $ NC_044371.1^IGnomon^Iexon^I22449904^I22450049^I.^I+^I.^Igene_id "LOC115705147"; transcript_id "XM_030632388.1"; db_xref "GeneID:115705147"; gbkey "mRNA"; gene "LOC115705147"; model_evidence "Supporting evidence includes similarity to: 1 EST, 26 Proteins, and 100% coverage of the annotated genomic feature by RNAseq alignments, including 76 samples with support for all annotated introns"; product "cysteine proteinase inhibitor 12"; exon_number "3"; $ NC_044371.1^IGnomon^Iexon^I22450118^I22450610^I.^I+^I.^Igene_id "LOC115705147"; transcript_id "XM_030632388.1"; db_xref "GeneID:115705147"; gbkey "mRNA"; gene "LOC115705147"; model_evidence "Supporting evidence includes similarity to: 1 EST, 26 Proteins, and 100% coverage of the annotated genomic feature by RNAseq alignments, including 76 samples with support for all annotated introns"; product "cysteine proteinase inhibitor 12"; exon_number "4"; $ NC_044371.1^IGnomon^ICDS^I22449314^I22449484^I.^I+^I0^Igene_id "LOC115705147"; transcript_id "XM_030632388.1"; db_xref "GeneID:115705147"; gbkey "CDS"; gene "LOC115705147"; product "cysteine proteinase inhibitor 12"; protein_id "XP_030488248.1"; exon_number "1"; $ NC_044371.1^IGnomon^ICDS^I22449570^I22449792^I.^I+^I0^Igene_id "LOC115705147"; transcript_id "XM_030632388.1"; db_xref "GeneID:115705147"; gbkey "CDS"; gene "LOC115705147"; product "cysteine proteinase inhibitor 12"; protein_id "XP_030488248.1"; exon_number "2"; $ NC_044371.1^IGnomon^ICDS^I22449904^I22450049^I.^I+^I2^Igene_id "LOC115705147"; transcript_id "XM_030632388.1"; db_xref "GeneID:115705147"; gbkey "CDS"; gene "LOC115705147"; product "cysteine proteinase inhibitor 12"; protein_id "XP_030488248.1"; exon_number "3"; $ NC_044371.1^IGnomon^ICDS^I22450118^I22450252^I.^I+^I0^Igene_id "LOC115705147"; transcript_id "XM_030632388.1"; db_xref "GeneID:115705147"; gbkey "CDS"; gene "LOC115705147"; product "cysteine proteinase inhibitor 12"; protein_id "XP_030488248.1"; exon_number "4"; $ NC_044371.1^IGnomon^Istart_codon^I22449314^I22449316^I.^I+^I0^Igene_id "LOC115705147"; transcript_id "XM_030632388.1"; db_xref "GeneID:115705147"; gbkey "CDS"; gene "LOC115705147"; product "cysteine proteinase inhibitor 12"; protein_id "XP_030488248.1"; exon_number "1"; $ NC_044371.1^IGnomon^Istop_codon^I22450253^I22450255^I.^I+^I0^Igene_id "LOC115705147"; transcript_id "XM_030632388.1"; db_xref "GeneID:115705147"; gbkey "CDS"; gene "LOC115705147"; product "cysteine proteinase inhibitor 12"; protein_id "XP_030488248.1"; exon_number "4"; $
I assume the ^| there are stand ins for the tab that separates the columns?
no I really meant
Oh sorry, in that case, it didn't show anything.