Entering edit mode
7.3 years ago
Paul
▴
80
I have a number of pair-end and single-end reads to map it to a reference genome.
TaxonID SRR files
1448592 SRR1172918 SRR1175065 SRR1184297 SRR1196515
1448462 SRR1180190 SRR1181352 SRR1181404 SRR1183042
1402586 SRR1011524 SRR1019194
1448524 SRR1172749 SRR1173120 SRR1184340 SRR1196497
1295800 SRR833218 SRR1011520
Now each SRR number is a folder in itself consists of pair end and single end reads. Now my aim is to read each SRR folder for a particular TaxonID and map it to a single reference genome.
Please suggest me a way to do this.
I have the following script, but I think for this the files has to be in a single folder
FILES=`ls SRR*_P1.fastq | sed 's/_P1.fastq//g'`
for F in $FILES ; do
R1=${F}_P1.fastq
R2=${F}_P2.fastq
bowtie --all -S Trinity -1 $R1 -2 $R2 > ${F}.sam
samtools view -S -b ${F}.sam > ${F}.bam
done
Please suggest a way to map the multiple reads to a reference genome... If any R package or a shell script
Do you want to generate a separate BAM file for every SRR subfolder?
oh! can I generate a single bam file for all the subfolders? will that be fine? In that case, I can put all the SRR files in a single folder and map it against a single reference sequence. But I have thousands of SRR files
It depends on the data you have. If each SRR represents a different experiment/condition, better generate a separate bam for every SRR. You can merge replicates (if you have) after generating the BAM also.