weird STAR output, bash scripting problem
2
0
Entering edit mode
7.2 years ago

Hi all. I am trying to align PE reads with STAR. I have trimmed files (used trimmomatic) that look like *_1.1P.fastq.gz *_1.1U.fastq.gz *_2.2P.fastq.gz *_2.2U.fastq.gz . I only want to align the*_1.1P.fastq.gzand*_2.2P.fastq.gz files. Here is my code:

#!/bin/bash
mkdir -p alignments
path='/trimmed/alignments/'
for i in $(ls | egrep '.[1|2]P.fastq.gz' | rev | cut -c 10- | rev | uniq)

do
STAR --genomeDir /indices/human --readFilesIn ${i}.fastq.gz ${i}.fastq.gz --runThreadN 8 --outFileNamePrefix /trimmed/alignments/${i%.fastq.gz}.cd177negtreg_tumor.Out --outSAMtype BAM SortedByCoordinate --readFilesCommand zcat

done

When I run this, STAR seems to be aligning each _1 and _2 read separately instead of treating them as paired end reads. For instance, my output files look like this:

filtered.ABC4454232_1.1P.cd177negtreg_tumor.OutAligned.sortedByCoord.out.bam filtered.ABC4454232_2.2P.cd177negtreg_tumor.OutAligned.sortedByCoord.out.bam

It should just be outputing one .bam file for this sample, such as. filtered.ABC4454232.cd177negtreg_tumor.OutAligned.sortedByCoord.out.bam

Any ideas what is wrong? Thanks!

RNA-Seq STAR bash • 3.2k views
ADD COMMENT
2
Entering edit mode
--readFilesIn ${i}.fastq.gz ${i}.fastq.gz

You are instructing STAR to use the same fastq twice for a given value of ${i}

ADD REPLY
1
Entering edit mode

Have you looked at what values i is getting for each iteration of the loop?

On a different note: Are you just running these scripts (or did you write them yourself)?

In general it would be simpler to do something like ls -1 *_1.1P.fastq.gz | sed 's/_1.1P.fastq.gz//' to grab the file names.

ADD REPLY
0
Entering edit mode

I found the script online but modified it to try to fit my data. echo $i spits out values such as these:

filtered.ABC4454232_1.1P
filtered.ABC4454232_2.2P
filtered.ABC4454233_1.1P
filtered.ABC4454233_2.2P
....

Which are the correct values, i guess i do not understand how to tell bash to differentiate between the _1.1P and its corresponding _2.2P

ADD REPLY
0
Entering edit mode

Have you tried basename that you were using in a different question you had asked yesterday to differentiate between the names?

ADD REPLY
0
Entering edit mode

Yes, I thought of that but not sure how to incorporate basename into my existing code.

ADD REPLY
3
Entering edit mode
7.2 years ago

Grab the file name, an then do like:

--readFilesIn ${i}_1.1P.fastq.gz ${i}_2.2P.fastq.gz
ADD COMMENT
0
Entering edit mode

Thank you, but when I try this the output .bam file keeps getting written over.

ADD REPLY
0
Entering edit mode

Can you show an example of a full name?

A good method of testing a loop is:

for i in $(ls | egrep '.[1|2]P.fastq.gz' | rev | cut -c 10- | rev | uniq)
do
echo STAR --genomeDir /indices/human --readFilesIn ${i}.fastq.gz ${i}.fastq.gz --runThreadN 8 --outFileNamePrefix /trimmed/alignments/${i%.fastq.gz}.cd177negtreg_tumor.Out --outSAMtype BAM SortedByCoordinate --readFilesCommand zcat
done

Using the echo the command will not run but you will see how it looks like and what you should tweak further.

ADD REPLY
0
Entering edit mode

Exactly, because at every iteration STAR will output the file with the same name, and in the end for the loop you will get the file from the last iteration. To avoid this you can rename final files at each iteration.

for i in $(ls | egrep '.[1|2]P.fastq.gz' | rev | cut -c 10- | rev | uniq)
do
STAR --genomeDir /indices/human --readFilesIn ${i}_1.1P.fastq.gz ${i}_2.2P.fastq.gz other_commands 
mv Aligned.out.bam Log_${i}.bam
mv Log.final.out Log_${i}.final.out
mv Log.out Log_${i}.out
mv Log.progress.out Log_{i}.progress.out

done

Bear in mind that you'd need to rename all output files in each iteration, so that nothing would be overwritten by the next iteration.

ADD REPLY
0
Entering edit mode

Erm and what about just using ${i} as part of the --outFileNamePrefix? You are overcomplicating this by renaming files.

ADD REPLY
0
Entering edit mode

Yes, your option is definitely more clean. Nevertheless, the trick with renaming files might be also useful in other situations.

ADD REPLY
1
Entering edit mode
7.2 years ago
h.mon 35k

Your current question is essentially a duplicate of your previous question, and thats the reason I argued your previous question was not a bioinformatics question: you are just manipulating bash variables, and the bioinformatics software you want to use is of no importance to this. In fact, you could use the same solution as before for this problem, instead of the cumbersome $(ls | egrep '.[1|2]P.fastq.gz' | rev | cut -c 10- | rev | uniq):

#!/bin/bash
mkdir -p alignments
path='/trimmed/alignments/'
for i in *_1.1P.fastq.gz
do
  base=${i%%_1.1P.fastq.gz}
  STAR --genomeDir /indices/human --readFilesIn ${base}_1.1P.fastq.gz ${base}_2.2P.fastq.gz \
    --runThreadN 8 --outSAMtype BAM SortedByCoordinate --readFilesCommand zcat \
    --outFileNamePrefix /trimmed/alignments/${base}.cd177negtreg_tumor.Out 
done

As I hinted in my previous answer (and as others explicitly suggested here), have a look at your variables before running your commands:

for i in *_1.1P.fastq.gz
do
  base=${i%%_1.1P.fastq.gz}
  echo "file1 is ${base}_1.1P.fastq.gz, file2 is ${base}_2.2P.fastq.gz"
  echo "output will be /trimmed/alignments/${base}.cd177negtreg_tumor.Out"
done

P.S.: it is generally (though not unanimous) considered bad practice to parse the output os ls, the main reason being file names can have characters which will break your script (such as spaces) unless you account for them.

ADD COMMENT

Login before adding your answer.

Traffic: 1936 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6