I'm new to RNA-seq data sets (and programming in general) and so far have only analyzed sample data from a single chromosome from various papers. My question is, aside from the amount of time required to process the samples, how much different would a basic shell script look? For instance, here is a simple script I wrote for aligning data from a single chromosome:
set -euo pipefail
SAMPLES=chrX_data/files.txt
mkdir -p sam
CPUS=8
IDX=chrX_data/indexes/chrX_tran
for SAMPLE in $(cat $SAMPLES)
do
R1=chrX_data/samples/${SAMPLE}_chrX_1.fastq
R2=chrX_data/samples/${SAMPLE}_chrX_2.fastq
SAM=${SAMPLE}_chrX.sam
hisat2 -p $CPUS --dta -x $IDX -1 $R1 -2 $R2 -S $SAM
done
How much would I have to re-work this script for data from all chromosomes? Assuming I use an Illumina sequencer, does each chromosome have its own fastq file which would require me to concatenate them or does all the data from one sample come in one fastq file (assuming single end reads)?
Thanks for the answers guys. Much appreciated.
If answers were helpful, feel free to upvote and accept: