Problem in subsetting a gtf based on ids in another file
0
0
Entering edit mode
4.4 years ago
newbie ▴ 130

I have a Final.gtf that looks like below:

chr17   StringTie   transcript  48525369    48528212    1000    -   .   gene_id "MSTRG.100005"; transcript_id "MSTRG.100005.1";  class_code "u"; transcript_length "863"; lncRNA_type "LincRNA"; 
chr17   StringTie   exon    48525369    48525461    1000    -   .   gene_id "MSTRG.100005"; transcript_id "MSTRG.100005.1"; exon_number "1";  class_code "u"; transcript_length "863"; lncRNA_type "LincRNA"; 
chr17   StringTie   exon    48527443    48528212    1000    -   .   gene_id "MSTRG.100005"; transcript_id "MSTRG.100005.1"; exon_number "2";  class_code "u"; transcript_length "863"; lncRNA_type "LincRNA";
chr17   StringTie   transcript  48539225    48540614    1000    +   .   gene_id "MSTRG.100008"; transcript_id "MSTRG.100008.1";  class_code "u"; transcript_length "1390"; lncRNA_type "LincRNA"; 
chr17   StringTie   exon    48539225    48540614    1000    +   .   gene_id "MSTRG.100008"; transcript_id "MSTRG.100008.1"; exon_number "1";  class_code "u"; transcript_length "1390"; lncRNA_type "LincRNA"; 
chr17   StringTie   transcript  49197365    49198218    1000    -   .   gene_id "MSTRG.100023"; transcript_id "MSTRG.100023.1";  class_code "u"; transcript_length "854"; lncRNA_type "LincRNA"; 
chr17   StringTie   exon    49197365    49198218    1000    -   .   gene_id "MSTRG.100023"; transcript_id "MSTRG.100023.1"; exon_number "1";  class_code "u"; transcript_length "854"; lncRNA_type "LincRNA";
chr17   StringTie   transcript  49191640    49196359    1000    +   .   gene_id "MSTRG.100024"; transcript_id "MSTRG.100024.1";  class_code "u"; transcript_length "4720"; lncRNA_type "LincRNA"; 
chr17   StringTie   exon    49191640    49196359    1000    +   .   gene_id "MSTRG.100024"; transcript_id "MSTRG.100024.1"; exon_number "1";  class_code "u"; transcript_length "4720"; lncRNA_type "LincRNA"; 
chr17   StringTie   transcript  49198593    49199192    1000    -   .   gene_id "MSTRG.100025"; transcript_id "MSTRG.100025.1";  class_code "u"; transcript_length "600"; lncRNA_type "LincRNA";

I also have a Int.csv file with ids like below: (The original csv file is with 55,000 ids)

"t_id"
"MSTRG.100023.1"
"MSTRG.100031.2"
"MSTRG.100038.1"
"MSTRG.100066.1"
"MSTRG.10013.2"
"MSTRG.10013.3"
"MSTRG.100143.1"
"MSTRG.100147.1"
"MSTRG.100150.1

I am working on cluster (SLURM). I used following command.

grep -f Int.csv Final.gtf > Transcripts_step.gtf

It is taking huge time so, I wrote a script and submitted a job using sbatch. For this I created run.sh script like below:

#!/bin/bash

# Set shell for job
#SBATCH --cpus-per-task=8
#SBATCH --mem=240G
#SBATCH --time=05:59:59

if [ -f ~/.bashrc ] ; then
    . ~/.bashrc
fi

echo "##CMD:>" $*
echo "##" `date`
eval $*
echo "##" `date`

Then I submitted a job like below:

sbatch run.sh "grep -f Int.csv Final.gtf > Transcripts_step.gtf"

The job had run more than 2 hours and the job got killed.

##CMD:> grep -f Int.csv Final.gtf > Transcripts_step.gtf

/var/lib/slurm/slurmd/job13500184/slurm_script: line 21: 21411 Killed                  
grep -f Int.csv Final.gtf > Transcripts_step.gtf

slurmstepd: error: Detected 1 oom-kill event(s) in step 13500184.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

What could be the problem and how to resolve this?

gtf slurm grep shell bash • 1.2k views
ADD COMMENT
0
Entering edit mode

Even with 240G of RAM the job was killed. Well ...

You may want to take a look at AGAT toolkit and see if any of the filter options would work here. You may also want to try a fixed string (-F) search with grep to speed things up.

ADD REPLY
0
Entering edit mode

you mean grep -F Int.csv Final.gtf > Transcripts_step.gtf ?

ADD REPLY
0
Entering edit mode

In general, grep -f does not scale well. For tasks similar to this, here's what I had done previously:

  1. Modify the GTF file to add two more columns. You can use awk for this. One column is just a serial number (if you are using awk you can use NR for this) and the second column in your case would be the MSTRG identifier copied from column 9 of GTF.
  2. Sort the "table" on identifier column
  3. Use join command to join the ID file and the modified GTF file
  4. Sort the output from step 3 on the line number column added in step 1 (this is so that the lines in GTF remain in the same order they were in the starting file) and remove that column from the output to get a valid GTF file.
ADD REPLY

Login before adding your answer.

Traffic: 1633 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6