Entering edit mode
4.4 years ago
screadore
▴
20
I need help converting histat aligner into star aligner for this script. Please let me know what I could do. Link to the full code is here: https://github.com/JustinGibbons/USF_Omics_Hub/blob/master/RNA-Seq_Pipeline/rna_seq_pipeline.py
def write_out_hisat_array_slurm_file(arguments_file,outfile="hisat_array_commands.sh"):
"""Uses data from input file to write out the hisat commands as an array job"""
arguments_dic=read_key_word_input_file(arguments_file,write_output_dir=True)
#hisat_array_name="hisat_array"
#sam_array="sam_array" I don't think I need this for right now
samtools_sort_array_name="samtools_sort_array"
#bam_array="bam_array" I don't think I need this for anything right now
task_id="$SLURM_ARRAY_TASK_ID"
workdir=arguments_dic["main_output_dir"]
print workdir
#Create the directory for the alignment files
alignment_dir="Alignments"
os.mkdir(workdir+"/"+alignment_dir)
#Dictionary pointing from sample names to reads
if "Data_Paired_End" not in arguments_dic:
data_paired_end=True
read_dict=find_matching_reads(get_fastq_files(arguments_dic["sample_dir"]))
else:
data_paired_end=arguments_dic["Data_Paired_End"]
if data_paired_end=="TRUE" or data_paired_end=="true" or data_paired_end=="True" or data_paired_end=="T":
data_paired_end=True
read_dict=find_matching_reads(get_fastq_files(arguments_dic["sample_dir"]))
else:
data_paired_end=False
read_dict=find_unmatched_reads(get_fastq_files(arguments_dic["sample_dir"]))
#List of slurm options to be written on their own lines
slurm_commands=create_slurm_header(workdir=arguments_dic["main_output_dir"],
time=":".join([arguments_dic["hours"],arguments_dic["minutes"],"00"]),
ntasks=arguments_dic["threads"],
mem=arguments_dic["mem"],
mail_user=arguments_dic["email_to_send_output_to"],submit_to_rra=arguments_dic["submit_to_rra"])
slurm_commands2=create_slurm_header(workdir=arguments_dic["main_output_dir"],
time=":".join([arguments_dic["hours"],arguments_dic["minutes"],"00"]),
ntasks=arguments_dic["threads"],
mem=arguments_dic["mem"],
mail_user=arguments_dic["email_to_send_output_to"],submit_to_rra=arguments_dic["submit_to_rra"])
hisat_command_input,sam_files=construct_hisat_array_command(index_files_path_and_basename=arguments_dic["genome_index"],
read_dict=read_dict,
number_of_processors=arguments_dic["threads"],
input_dict=arguments_dic,alignment_dir=alignment_dir,data_paired_end=data_paired_end)
if data_paired_end==True:
read1_array,read2_array,hisat_output_array,hisat_command,deref_hisat_outfiles=hisat_command_input
elif data_paired_end==False:
unpaired_reads_array,hisat_output_array,hisat_command,deref_hisat_outfiles=hisat_command_input
#print hisat_command_input
bam_array,samtools_sort_command,bam_files=construct_samtools_commands(sam_files,deref_hisat_outfiles,threads=arguments_dic["threads"])
arguments_dic["alignment_files"]=bam_files
job_size,cufflinks_input_array_string,cufflinks_outputdirs_array_string,cufflinks_command=construct_cufflinks_array_command(arguments_dic)
gff_file=arguments_dic["gff"]
if "gff_annotation" in arguments_dic:
id_value=arguments_dic["gff_annotation"]
else:
id_value="gene_id"
featurecounts_command_list=construct_feature_counts_command(gff_file,bam_files,id_value=id_value)
featurecounts_command=" ".join(featurecounts_command_list)
normalization_groups=arguments_dic["normalization_groups"]
normalization_method=arguments_dic["normalization_method"]
cuffnorm_command_list=construct_cuffnorm_command(gff_file,bam_files,alignment_dir+"/",groupings=normalization_groups,normalization_method=normalization_method)
cuffnorm_command=" ".join(cuffnorm_command_list)
narray=len(sam_files)
##May want to include a warning here if only one job is allowed to run at a time
narray_arg="#SBATCH --array=0-"+str(narray-1)+"%"+str(narray)
slurm_commands.append(narray_arg)
slurm_commands.append("#SBATCH --output=rna-seq_pipeline.out")
slurm_commands.append("module purge")
slurm_commands.append("module load apps/hisat2/2.1.0")
slurm_commands.append("module load apps/samtools/1.3.1")
slurm_commands.append("module load apps/cufflinks/2.2.1")
slurm_commands2.append("#SBATCH --output=rna-seq_pipeline2.out")
slurm_commands2.append("module load apps/cufflinks/2.2.1")
slurm_commands2.append("module load apps/subread/1.6.3")
This is a huge mess of a python script that writes a script to be submitted to slurm based on input parameters. You're better off using the sample script file in the github repo and replacing hisat2 with STAR in that. Changing this script is going to take a TON of time as it is insanely specific to the hisat2 pipeline and the environment it's running. It's spaghetti code at its best (worst?).
most of this script has nothing to do with HISAT2 or STAR and is just the developer reinventing what pipeline frameworks do
Exactly. This is what obliterates any sense of portability in the scripts - they're tied too closely to the tools AND the environment. There's no plug and play here.
Thank you very much for the reply. Could you give me some insight into how to go about replacing the hisat2 with STAR in the sample script file? I'd greatly appreciate it. Thank you again.
Sorry, that is not a level of help I'm willing to offer. Please read the hisat2 and STAR manual and see what changes need to be made.
Just use nf-core RNAseq - it can do both
The only issue is for the lab we need to use the pipeline provided above.
Dare I ask why? If you are replacing the core program, it is not the same pipeline anymore.
I honestly don't know it's one of my colleagues issue and I'm just trying to help her figure it out.
I would try to figure out why she is doing this in the first place. If change one step, downstream steps may break. You could be troubleshooting for a while and it may be irrelevant in the end.
If that's the case, your colleague should be the one making the post here. Any effort put into making this script work with STAR is wasted, as you'll spend 10x the time understanding this script as you will into switching to STAR.
then you'll need to look at the
construct_hisat_array_command
and write a correspondingconstruct_star_array_command
and obviously change themodule
command