Hello everyone, I would need help in order to download and run a hisat2 pipeline on snakemake.
In fact I have to run 3 different hisat2 mapping on 3 assemblies.
For that I have a dataframe where I have the reads IDs I want to map for each assembly name such as :
SRA_accession Assembly
SRR1 Assembly1
SRR2 Assembly1
SRR3 Assembly1
SRR1 Assembly2
SRR2 Assembly2
SRR3 Assembly2
SRR1 Assembly3
SRR2 Assembly3
where we can find the assemblies here: /data/Genomes/AssemblyN/Assembly.fa
and SRA reads here for each assembly :
/data/Genomes/AsemblyN/reads/
So for now I use a python script in order to generate the following scripts (here it is an example for Assembly1)
#I download all the reads by doing: (Here the sra_accession are fake of course)
/TOOLS/sratoolkit.2.10.8-ubuntu64/bin/prefetch --max-size 100000000 SRR1 && /beegfs/data/bguinet/TOOLS/sratoolkit.2.10.8-ubuntu64/bin/fasterq-dump -t /data/Genomes/Assembly1/reads/ --threads 10 -f SRR1 -O /data/Genomes/Asembly1/reads/
pigz --best /data/Genomes/AsemblyN/reads/SRR1*
/TOOLS/sratoolkit.2.10.8-ubuntu64/bin/prefetch --max-size 100000000 SRR2 && /beegfs/data/bguinet/TOOLS/sratoolkit.2.10.8-ubuntu64/bin/fasterq-dump -t /data/Genomes/Assembly1/reads/ --threads 10 -f SRR2-O /data/Genomes/Asembly1/reads/
pigz --best /data/Genomes/AsemblyN/reads/SRR2*
/TOOLS/sratoolkit.2.10.8-ubuntu64/bin/prefetch --max-size 100000000 SRR3 && /beegfs/data/bguinet/TOOLS/sratoolkit.2.10.8-ubuntu64/bin/fasterq-dump -t /data/Genomes/Assembly1/reads/ --threads 10 -f SRR3 -O /data/Genomes/Asembly1/reads/
pigz --best /data/Genomes/AsemblyN/reads/SRR3*
#Then I run the hisat2 soft
hisat2 --dta -k 1 -q -x mapping_index -1 SRR1_1.fastq.gz,SRR2_1.fastq.gz,SRR3_1.fastq.gz -2 SRR1_2.fastq.gz,SRR2_2.fastq.gz,SRR3_2.fastq.gz | samtools view -o mapping_Assembly1.bam 2> stats_mapping.txt
But I wondered if someone had an idea in order to do it simply with a snakemake pipeline ? I do not know how to handle the fact that I cant to run specific reads for specific assemblies and how to add a list of read ids on Hisat2. I'm really new in this topic and it would be amazing if someone can help me on that.
Thank you very much and have a nice day.
thank you very much, it helps a lot !
Hello @Jeremy Leipzig I used the code but now when I try to run the
prefetch rule
its says :Which is normal since the file does not exist, they will be written.
Here is the rule I used :
ok yes sorry can you change
input:
toparams:
in your prefetch rule (and input.read becomes params.read)ok thank you, now if I well understand the code will have only one job where it will download all reads at once right? How can I manage to create one job for each read? For Instance I have 13 reads to download and I would like to create 13 jobs. Because the code here provided a list of read names, so the prefetch command download one after another instead of downloading only one.
please put your whole snakefile up in github so we can see it
Sure, here is the adresse : https://github.com/StromTroopers/biostar_ex/blob/main/README.md
Maybe you want to communicate in a chat instead ?
so prefetch is multithreaded, you just need to provide it a space-delimited list of accessions, so just write a function to return that for each set of SRR's you need to fetch for an assembly
yes but in order to speedup the process I would like to create one job (with mutliple core) for each read independently. Do you think it is possible?
yes but then the download rule cannot be looking up multiple files - just offload that to another rule or function and then simply ask for all the files in your
all
rule