I want to perform alignment using Hisat2 for single-ended thousands of samples and each sample distributed among different libraries.
I have modified this script (https://www.biostars.org/p/223404/#224169):
#!/bin/bash
for f in `ls data/1547_2015/Accessions/Alst-1_6989/transcriptome/fastq/trim/*.fastq.bz2_trimmed.fq.gz | sed 's/.fastq.bz2_trimmed.fq.gz//g' | sort -u`
do
echo HISAT/hisat2-2.1.0/hisat2 --p 8 --min-intronlen 60 --max-intronlen 6000 --dta -x Hisat2_index/arabidopsis -U ${f}.fastq.bz2_trimmed.fq.gz -S ${f}.bam
done
It gives me echo as:
HISAT/hisat2-2.1.0/hisat2 --p 8 --min-intronlen 60 --max-intronlen 6000 --dta -x Hisat2_index/arabidopsis -U data/1547_2015/Accessions/Alst-1_6989/transcriptome/fastq/trim/SRR3460433.fastq.bz2_trimmed.fq.gz -S data/1547_2015/Accessions/Alst-1_6989/transcriptome/fastq/trim/SRR3460433.bam
HISAT/hisat2-2.1.0/hisat2 --p 8 --min-intronlen 60 --max-intronlen 6000 --dta -x Hisat2_index/arabidopsis -U data/1547_2015/Accessions/Alst-1_6989/transcriptome/fastq/trim/SRR3460434.fastq.bz2_trimmed.fq.gz -S data/1547_2015/Accessions/Alst-1_6989/transcriptome/fastq/trim/SRR3460434.bam
HISAT/hisat2-2.1.0/hisat2 --p 8 --min-intronlen 60 --max-intronlen 6000 --dta -x Hisat2_index/arabidopsis -U data/1547_2015/Accessions/Alst-1_6989/transcriptome/fastq/trim/SRR3460435.fastq.bz2_trimmed.fq.gz -S data/1547_2015/Accessions/Alst-1_6989/transcriptome/fastq/trim/SRR3460435.bam
HISAT/hisat2-2.1.0/hisat2 --p 8 --min-intronlen 60 --max-intronlen 6000 --dta -x Hisat2_index/arabidopsis -U data/1547_2015/Accessions/Alst-1_6989/transcriptome/fastq/trim/SRR3460436.fastq.bz2_trimmed.fq.gz -S data/1547_2015/Accessions/Alst-1_6989/transcriptome/fastq/trim/SRR3460436.bam
As this is the same sample (Alst-1_6989) is distributed among different lanes (SRR3460433,SRR3460434,SRR3460435,SRR3460436) which should be joined by comma not as a separate command as follows and I want the name of the sample (Alst-1_6989) in the output file (Alst-1_6989.bam), currently its name of the distributed library. Its just one example I have thousands of sample with a variable number of distributed library, so we need to keep this thing in mind.
HISAT/hisat2-2.1.0/hisat2 --p 8 --min-intronlen 60 --max-intronlen 6000 --dta -x Hisat2_index/arabidopsis -U data/1547_2015/Accessions/Alst-1_6989/transcriptome/fastq/trim/SRR3460433.fastq.bz2_trimmed.fq.gz,data/1547_2015/Accessions/Alst-1_6989/transcriptome/fastq/trim/SRR3460434.fastq.bz2_trimmed.fq.gz,data/1547_2015/Accessions/Alst-1_6989/transcriptome/fastq/trim/SRR3460435.fastq.bz2_trimmed.fq.gz,data/1547_2015/Accessions/Alst-1_6989/transcriptome/fastq/trim/SRR3460436.fastq.bz2_trimmed.fq.gz -S data/1547_2015/Accessions/Alst-1_6989/transcriptome/fastq/trim/Alst-1_6989.bam
Any help will be highly appericated.
I have made a layout.txt as:
Modified script:
Error:
ATpoint Can you please help with it?
should do it, so double quotes around the $p.
ATpoint I got this error now, does that mean we need to adjust .bam file in the code?
Output of
samtools --version
?ATpoint the previous version of samtools was 1.7 but now i updated to latest version (1.9) but still getting this error:
The error means basically that no data arrive at samtools, so there is an issue with the alignment. The commands work fine for me, maybe something with your files not being properly catted. Difficult to tell without being in front of your computer, sorry. Maybe try to put all the relevant fastq files into one folder, then
cd
into that folder and run the script from there.ATpoint I am trying to understand the script, can you please comment so that I can understand it properly. Normally we pass single ended fastq files seperated by comma, but I didn't see anything after -U - , can you please also explain this thing?
The
-
indicates reading fromstdin
. Thecat
command concatenated the individual fastq files that are then passed over tohisat
(that is called a Unix pipe, stdin/stdout). This is a simplified version of dealing with multiple lane replicates (well at least if it works... :-D ) avoiding any comma-separated lists. I frequently use these kinds of scripts and have good experience with them.But look, if it doesn't work for you skip my suggestion and try something else. You need a certain experience with these pipes and nested scripts to properly handle them. It is only one of many options to solve your task. Just try something else ;-) The simplest thing would be to
cat
together the individual runs into one fastq file, save that to disk and then align as you did with the individual runs.