Entering edit mode
4.5 years ago
newbie
▴
130
I have a Final.gtf
that looks like below:
chr17 StringTie transcript 48525369 48528212 1000 - . gene_id "MSTRG.100005"; transcript_id "MSTRG.100005.1"; class_code "u"; transcript_length "863"; lncRNA_type "LincRNA";
chr17 StringTie exon 48525369 48525461 1000 - . gene_id "MSTRG.100005"; transcript_id "MSTRG.100005.1"; exon_number "1"; class_code "u"; transcript_length "863"; lncRNA_type "LincRNA";
chr17 StringTie exon 48527443 48528212 1000 - . gene_id "MSTRG.100005"; transcript_id "MSTRG.100005.1"; exon_number "2"; class_code "u"; transcript_length "863"; lncRNA_type "LincRNA";
chr17 StringTie transcript 48539225 48540614 1000 + . gene_id "MSTRG.100008"; transcript_id "MSTRG.100008.1"; class_code "u"; transcript_length "1390"; lncRNA_type "LincRNA";
chr17 StringTie exon 48539225 48540614 1000 + . gene_id "MSTRG.100008"; transcript_id "MSTRG.100008.1"; exon_number "1"; class_code "u"; transcript_length "1390"; lncRNA_type "LincRNA";
chr17 StringTie transcript 49197365 49198218 1000 - . gene_id "MSTRG.100023"; transcript_id "MSTRG.100023.1"; class_code "u"; transcript_length "854"; lncRNA_type "LincRNA";
chr17 StringTie exon 49197365 49198218 1000 - . gene_id "MSTRG.100023"; transcript_id "MSTRG.100023.1"; exon_number "1"; class_code "u"; transcript_length "854"; lncRNA_type "LincRNA";
chr17 StringTie transcript 49191640 49196359 1000 + . gene_id "MSTRG.100024"; transcript_id "MSTRG.100024.1"; class_code "u"; transcript_length "4720"; lncRNA_type "LincRNA";
chr17 StringTie exon 49191640 49196359 1000 + . gene_id "MSTRG.100024"; transcript_id "MSTRG.100024.1"; exon_number "1"; class_code "u"; transcript_length "4720"; lncRNA_type "LincRNA";
chr17 StringTie transcript 49198593 49199192 1000 - . gene_id "MSTRG.100025"; transcript_id "MSTRG.100025.1"; class_code "u"; transcript_length "600"; lncRNA_type "LincRNA";
I also have a Int.csv
file with ids like below: (The original csv file is with 55,000 ids)
"t_id"
"MSTRG.100023.1"
"MSTRG.100031.2"
"MSTRG.100038.1"
"MSTRG.100066.1"
"MSTRG.10013.2"
"MSTRG.10013.3"
"MSTRG.100143.1"
"MSTRG.100147.1"
"MSTRG.100150.1
I am working on cluster (SLURM). I used following command.
grep -f Int.csv Final.gtf > Transcripts_step.gtf
It is taking huge time so, I wrote a script and submitted a job using sbatch
. For this I created run.sh
script like below:
#!/bin/bash
# Set shell for job
#SBATCH --cpus-per-task=8
#SBATCH --mem=240G
#SBATCH --time=05:59:59
if [ -f ~/.bashrc ] ; then
. ~/.bashrc
fi
echo "##CMD:>" $*
echo "##" `date`
eval $*
echo "##" `date`
Then I submitted a job like below:
sbatch run.sh "grep -f Int.csv Final.gtf > Transcripts_step.gtf"
The job had run more than 2 hours and the job got killed.
##CMD:> grep -f Int.csv Final.gtf > Transcripts_step.gtf
/var/lib/slurm/slurmd/job13500184/slurm_script: line 21: 21411 Killed
grep -f Int.csv Final.gtf > Transcripts_step.gtf
slurmstepd: error: Detected 1 oom-kill event(s) in step 13500184.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
What could be the problem and how to resolve this?
Even with 240G of RAM the job was killed. Well ...
You may want to take a look at AGAT toolkit and see if any of the filter options would work here. You may also want to try a fixed string (
-F
) search withgrep
to speed things up.you mean
grep -F Int.csv Final.gtf > Transcripts_step.gtf
?In general,
grep -f
does not scale well. For tasks similar to this, here's what I had done previously:awk
for this. One column is just a serial number (if you are usingawk
you can useNR
for this) and the second column in your case would be the MSTRG identifier copied from column 9 of GTF.join
command to join the ID file and the modified GTF file