Question

Confused with output of STAR aligner for paired end reads and need advice on closely related SNPs analyses

0

Entering edit mode

3.0 years ago

mohsamir2016 ▴ 30

Dear all,

I am aligning paired end reads from two closely related chicken breeds against chicken reference genome. In each bread, I have 5 individuals (samples), each individual has two files (R1 and R2).

When I run alignment by STAR using this code: I am running them in the directory containing all fastq files of the 5 samples using a for in loop:

STAR --runMode alignReads --genomeDir IndexRef/ --outSAMtype BAM SortedByCoordinate --readFilesIn ${file} --outFileNamePrefix mapped/L10/${file} --runThreadN 12

it produced for each sample, two file that ends with (R1.Aligned.sortedByCoord.out.bam and R2 Aligned.sortedByCoord.out.bam). Now I know that these two files is unsorted BAM, each have statistics on % mapping,etc. I am confused which one of these 2 is considered a final alignment file for this sample? Do these two files combine after that when running samtools on them? I assume that there should be single BAM file to be considered as an aligment to be analyzed and visualized using genome browser ?

Another question: My two breeds are two closely related breeds, so senstivity is important to pick up SNPs differ between both, so you think the above code is doing highly sensitive alignment? Or do I need to add more options?

Thanks

RNA-seq • 4.1k views

ADD COMMENT • link updated 3 months ago by GenoMax 154k • written 3.0 years ago by mohsamir2016 ▴ 30

0

Entering edit mode

I actually tried to run the 5 samples (each have paired end) simultaneously sing bash script: in the script my code was:

 #!/bin/bash
STAR --runMode alignReads --genomeDir IndexRef/ --outSAMtype BAM SortedByCoordinate --readFilesIn R0629-S0001_L10AU1_A56592_1_HGFCJDSX2_TTACCGAC-CGTATTCG_L003_R1_trimmed.fastq R0629-S0001_L10AU1_A56592_1_HGFCJDSX2_TTACCGAC-CGTATTCG_L003_R2_trimmed.fastq --outFileNamePrefix mapped/L10/BAM_L10 --runThreadN 12
STAR --runMode alignReads --genomeDir IndexRef/ --outSAMtype BAM SortedByCoordinate --readFilesIn R0629-S0005_L10BU1_A56596_1_HGFCJDSX2_AAGACCGT-CAATCGAC_L003_R1_trimmed.fastq R0629-S0005_L10BU1_A56596_1_HGFCJDSX2_AAGACCGT-CAATCGAC_L003_R2_trimmed.fastq --outFileNamePrefix mapped/L10/BAM_L10 --runThreadN 12
STAR --runMode alignReads --genomeDir IndexRef/ --outSAMtype BAM SortedByCoordinate --readFilesIn R0629-S0009_L10CU1_A56600_1_HGFCJDSX2_CAGGTTCA-GGCGTTAT_L003_R1_trimmed.fastq R0629-S0009_L10CU1_A56600_1_HGFCJDSX2_CAGGTTCA-GGCGTTAT_L003_R2_trimmed.fastq --outFileNamePrefix mapped/L10/BAM_L10 --runThreadN 12
STAR --runMode alignReads --genomeDir IndexRef/ --outSAMtype BAM SortedByCoordinate --readFilesIn R0629-S0014_L10DU1_A56605_1_HGFCJDSX2_AGCCTATC-GTTACGCA_L003_R1_trimmed.fastq R0629-S0014_L10DU1_A56605_1_HGFCJDSX2_AGCCTATC-GTTACGCA_L003_R2_trimmed.fastq --outFileNamePrefix mapped/L10/BAM_L10 --runThreadN 12
STAR --runMode alignReads --genomeDir IndexRef/ --outSAMtype BAM SortedByCoordinate --readFilesIn R0629-S0017_L10EU1_A56608_1_HGFCJDSX2_TTGCGAGA-GTGCCATA_L003_R1_trimmed.fastq R0629-S0017_L10EU1_A56608_1_HGFCJDSX2_TTGCGAGA-GTGCCATA_L003_R2_trimmed.fastq --outFileNamePrefix mapped/L10/BAM_L10 --runThreadN 12

This actually produced just one BAM file! I was expecting 5 BAM files for the 5 samples. Any comment on the code? Shall I leave space after the command?

Thanks

ADD REPLY • link updated 2.7 years ago by Ram 45k • written 3.0 years ago by mohsamir2016 ▴ 30

1

Entering edit mode

As noted by @swbarnes2 You need to use a unique name in --outFileNamePrefix mapped/L10/**UNIQUE_NAME_HERE** in each command to ensure that five sets of result files will end up with unique names.

ADD REPLY • link 3.0 years ago by GenoMax 154k

0

Entering edit mode

Can you explain exactly what part of the code you think tells the software to make 5 different bams, instead of rewriting over the same one over and over again?

ADD REPLY • link 3.0 years ago by swbarnes2 15k

Ram · Answer 1 · 2022-10-23

1

Entering edit mode

3.0 years ago

swbarnes2 15k

You need to redo the alignments. Star needs R1 and matching R2 together as input.

ADD COMMENT • link 3.0 years ago by swbarnes2 15k

0

Entering edit mode

Dear @swbarnes2,

Thanks for the answer: So, you are telling me that in each alignment job, I need to only supply two fastq files, R1 and R2 of the same sample? SO, what is the case if I have 5 samples and I need to do alignment for all of them at once? That is why I made the loop?

Thanks

ADD REPLY • link updated 2.7 years ago by Ram 45k • written 3.0 years ago by mohsamir2016 ▴ 30

1

Entering edit mode

STAR manual has this relevant section:

Multiple samples can be mapped in one job. For single-end reads use a comma separated list (no spaces around commas), e.g. --readFilesIn sample1.fq,sample2.fq,sample3.fq. For paired-end reads, use comma separated list for read1 /space/ comma separated list for read2, e.g.: --readFilesIn sample1read1.fq,sample2read1.fq,sample3read1.fq sample1read2.fq,sample2read2.fq,sample3read2.fq

ADD REPLY • link 3.0 years ago by GenoMax 154k

0

Entering edit mode

Thanks a lot

ADD REPLY • link 3.0 years ago by mohsamir2016 ▴ 30

0

Entering edit mode

I have 24 samples (paired-end), do I need to specify name of every file separated by commas? Is there any automated way, like it works for single-ended?

ADD REPLY • link 3 months ago by Aastha • 0

0

Entering edit mode

do I need to specify name of every file separated by commas?

Yes. That said if you are using a compute cluster it may be simpler to submit 24 separate alignment jobs to parallelize the process.

Is there any automated way, like it works for single-ended?

You can easily create a list of comma separated filenames by doing something like

$ ls -1 *R1*.fastq | tr "\n" ","
file01_R1.fastq,file02_R1.fastq,file03_R1.fastq,file04_R1.fastq,file05_R1.fastq,file06_R1.fastq,file07_R1.fastq,file08_R1.fastq,file09_R1.fastq,file10_R1.fastq,file11_R1.fastq

Repeat for *R2* files. You can then copy paste these names into your STAR command line.

ADD REPLY • link 3 months ago by GenoMax 154k