Hi all,
I've been wondering about best practices in pattern matching in bash. On sizable files and with lots of strings to search for, the issue becomes computationally burdensome quickly. Here's an example to illustrate a problem I've had. Hoping to learn about alternative solutions here.
I generated a set of genome annotations using BRAKER2 and then ran through a set of filters to whittle away at the false positive models. I arrived at final gene model sets that only included ~ 30-40% of the BRAKER2 gene models, which is in line with the expected number of genes for the taxa I'm working with. Now that I have the final set of models, I want to subset the full GFF3s that BRAKER2 produced to exclude the gene models that I filtered out. I have a python script that can use a list of accepted genes to subset GFF3, but I really want to perform the same task in a bash command. I thought I'd try using grep, since it's -f flag to grep whole lists (like my gene list) seemed like the tool for the job.
Here's what the GFF3 looks like. It has 9 fields total, but only the 9th contains any useful info for this subsetting.
scaffold3 AUGUSTUS gene 3808 4464 0.48 - . ID=g1;
scaffold3 AUGUSTUS mRNA 3808 4464 0.48 - . ID=g1.t1;Parent=g1
scaffold3 AUGUSTUS stop_codon 3808 3810 . - 0 Parent=g1.t1;
scaffold3 AUGUSTUS CDS 3808 3976 0.48 - 1 ID=g1.t1.CDS1;Parent=g1.t1
scaffold3 AUGUSTUS exon 3808 3976 . - . ID=g1.t1.exon1;Parent=g1.t1;
scaffold3 AUGUSTUS intron 3977 4306 0.48 - . Parent=g1.t1;
scaffold3 AUGUSTUS CDS 4307 4464 0.57 - 0 ID=g1.t1.CDS2;Parent=g1.t1
scaffold3 AUGUSTUS exon 4307 4464 . - . ID=g1.t1.exon2;Parent=g1.t1;
scaffold3 AUGUSTUS start_codon 4462 4464 . - 0 Parent=g1.t1;
scaffold3 AUGUSTUS gene 12096 12527 0.82 - . ID=g2;
scaffold3 AUGUSTUS mRNA 12096 12527 0.82 - . ID=g2.t1;Parent=g2
scaffold3 AUGUSTUS stop_codon 12096 12098 . - 0 Parent=g2.t1;
scaffold3 AUGUSTUS CDS 12096 12527 0.82 - 0 ID=g2.t1.CDS1;Parent=g2.t1
scaffold3 AUGUSTUS exon 12096 12527 . - . ID=g2.t1.exon1;Parent=g2.t1;
scaffold3 AUGUSTUS start_codon 12525 12527 . - 0 Parent=g2.t1;
There are a few difficulties that this formatting poses for subsetting
Problem: I didn't always accept all of the isoforms presented under the same "Parent" gene, so I can't just search for the parent (e.g. 'g1234'). Instead, I need to indicate the isoform in the search as well (e.g. 'g1234.t2'). This is problematic, because to preserve GFF3 format, I also need to pull the row that indicated gene, which is ambiguous to isoform. `scaffold3 AUGUSTUS gene 3808 4464 0.48 - . ID=g1;
Solution: By duplicating the gene model name 'g1234.t2' and excluding the isoform call in the duplicated name then appending a ';' at the back side, I can target that gene row that would have otherwise been excluded. The semicolon happens to also resolve the issue of accidentally matching with a longer gene name when searching for a shorter name that is contained within it (e.g. matching 'g12345' when looking for 'g123').
Here's a look at the gene list formatting, for clarity:
g33215;
g33215.t1
g37800;
g37800.t1
g22539;
g22539.t1
I ended up wrapping grep in a shell script and submitting it to the cluster that way, since (I thought) job time restrictions in interactive mode were killing grep, but it turns out grep was being killed anyways.
command: grep -f genemodelnames.txt Jure.gff3 > subset/Jure.gff3
Looking at the error files made it pretty clear that 8GB of memory wasn't enough for this job. Unsubsetted GFF3 files are ~75MB each & the lists of genes are ~40,000 each. Its obvious to me now that grep does not scale well. I should have stuck to python.
Question: If you had to do this in bash, how would you have done it?