Hi there,
you can do two-pass-mapping manually by splitting up the process. I read about it in some google group with developer Alex:
So this is an example for single end data with an average length of 150nt.
First you do the first STAR alignment:
cd /path_to_project/gap_table/
while read SAMPLE; do
FILEBASE=$(basename "${SAMPLE%.fastq}")
mkdir /path_to_project/gap_table/$FILEBASE.STAR
cd /path_to_project/gap_table/$FILEBASE.STAR
/path_to/STAR --outFilterType BySJout --outFilterMismatchNmax 10 --outFilterMismatchNoverLmax 0.04 --alignEndsType EndToEnd --runThreadN 8 --outSAMtype BAM SortedByCoordinate --alignSJDBoverhangMin 4 --alignIntronMax 300000 --alignSJoverhangMin 8 --genomeDir /path_to_STAR_index_directory/star_index_hg38_r149/ --sjdbOverhang 149 --sjdbGTFfile /path_to_genome_GTF_file/Homo_sapiens.GRCh38.91.gtf --outFileNamePrefix /path_to_project/gap_table/$FILEBASE.STAR/ --readFilesIn $SAMPLE
done < Textfile_with_sample_names.txt
After a STAR alignment, there is a file generated called SJ.out.tab
. This file hold information about every gap which passed the criteria in the first STAR alignment to be counted as such, with at least one read overlapping it. During 2-pass mapping, you want to mapp the reads of all samples against the all gap which were found in the first alignment of all samples together. Therefore, in the next step, you gather all SJ.out.tab
files and fuse them to an unique one. This could be done for example like this:
Generating fused unique SJ.out.tab
cd /path_to_project/gap_table/
mkdir SJ_tabs
for i in /path_to_project/gap_table/*.STAR
do
cd $i
b=$(basename "${i%.STAR}")
cp SJ.out.tab /path_to_project/gap_table/SJ_tabs/$b.sj_out1.tab
done
cd /path_to_project/gap_table/SJ_tabs/
cat *.tab > all_SJ_tabs.tab
awk '!x[$0]++' all_SJ_tabs.tab > Unique_SJ_all.tab
cd /path_to_project/gap_table/SJ_tabs/
awk 'BEGIN {OFS="\t"; strChar[0]="."; strChar[1]="+"; strChar[2]="-";} {if($5>0){print $1,$2,$3,strChar[$4]}}' /path_to_project/gap_table/SJ_tabs/Unique_SJ_all.tab > SJin.tab
awk '!x[$0]++' SJin.tab > Unique_SJin.tabs
cd /path_to_project/gap_table/SJ_tabs/
rm *.tab
cp Unique_SJin.tabs Unique_SJin.tab
rm Unique_SJin.tabs
We now generate a new STAR index for the second STAR run, which holds all found gaps from the previous STAR run:
Generate new STAR index
cd /path_to_project/gap_table/
mkdir Genome2
/path_to/STAR --runMode genomeGenerate --genomeDir Genome2 --genomeFastaFiles /path_to_genome_fasta_file/GRCh38_r91.fa --runThreadN 8 --sjdbGTFfile /path_to_genome_GTF_file/Homo_sapiens.GRCh38.91.gtf --sjdbOverhang 149 --sjdbFileChrStartEnd /path_to_project/gap_table/SJ_tabs/Unique_SJin.tab --limitSjdbInsertNsj 2000000
Using the newly generated STAR index, we no redo STAR alignment a second time:
Second STAR aligment:
cd /path_to_project/gap_table/
while read SAMPLE; do
FILEBASE=$(basename "${SAMPLE%.trim.fq}")
mkdir /path_to_project/gap_table/$FILEBASE.STAR_second_pass
cd /path_to_project/gap_table/$FILEBASE.STAR_second_pass
/path_to_STAR/STAR --outFilterType BySJout --outFilterMismatchNmax 10 --outFilterMismatchNoverLmax 0.04 --alignEndsType EndToEnd --runThreadN 8 --outSAMtype BAM SortedByCoordinate --alignSJDBoverhangMin 4 --alignIntronMax 300000 --alignSJoverhangMin 8 --genomeDir /path_to_project/gap_table/Genome2/ --sjdbOverhang 149 --quantMode GeneCounts --sjdbGTFfile /path_to_genome_GTF_file/Homo_sapiens.GRCh38.91.gtf --limitSjdbInsertNsj 2000000 --readFilesIn $SAMPLE
done < Text_file_with_sample_names.txt
And that's it :)
I assume you are referring to a physical node on the cluster that has those specs (hopefully you have many more like this node). It does not automatically mean that all those resources are available for your job, especially if you are using a job scheduler. Are you using one? If so which one and what parameters are you using for that job submission?
Hi, i am using the interactive mode into the cluster and not a bash script (I do not know if that is the answer to your question). For initiate the interactive mode I request the resources as this way:
srun --pty -p interactive --mem 250G -c 20 -t 0-12:00:00 /bin/bash
Let me ask you this. Do you see any other user processes running on that node (if you run
top
orhtop
)? Since you are not using--exclusive[=user]
option my assumption is that there may be some other things running on that node.BTW; that
srun
seems to indicate that your cluster usesSLURM
job scheduler but you are not submitting a non-interactive job.Exactly, my cluster uses
SLURM
, when I rantop
andhtop
seem that I am the only one using that, i gonna try to open a new session with--exclusive
and see what I getTry to lower the number of threads to 16, or even lower. STAR creates a lot of temporary files for each thread, and you may be hitting the limit of
ulimit -n
."no success" means trying to increase the limits resulted in a error message? Or no error message, but limits were left unchanged? Or limits were changed, but STAR crashed anyway? You can only increase the limits to what is defined in /etc/security/limits.conf, so maybe you can't increase the limits anyway - you may need to contact the system administrators.
Hi!Thanks for your reply. By "no successs" I meant that the limits were changed and STAR crashed anyway. I will try to lower the number of threads
Hi guys, Just a update: I tried both suggestions: I try to run the
SLURM
with the--exclusive
tag, and I also try to run with a reduced number of cores. I still have the same Log.out error, and then and I tried to useulimit -v
andulimit -n
the STAR crash yet. If someone has any insight about this trouble I will be happy to read that. Anyway, thank you guys for helping me.Can you post the actual
STAR
command you are using? What genome are you trying to align against (size)?