Question

weird STAR output, bash scripting problem

0

Entering edit mode

7.2 years ago

nikelle.petrillo ▴ 110

Hi all. I am trying to align PE reads with STAR. I have trimmed files (used trimmomatic) that look like *_1.1P.fastq.gz *_1.1U.fastq.gz *_2.2P.fastq.gz *_2.2U.fastq.gz . I only want to align the*_1.1P.fastq.gzand*_2.2P.fastq.gz files. Here is my code:

#!/bin/bash
mkdir -p alignments
path='/trimmed/alignments/'
for i in $(ls | egrep '.[1|2]P.fastq.gz' | rev | cut -c 10- | rev | uniq)

do
STAR --genomeDir /indices/human --readFilesIn ${i}.fastq.gz ${i}.fastq.gz --runThreadN 8 --outFileNamePrefix /trimmed/alignments/${i%.fastq.gz}.cd177negtreg_tumor.Out --outSAMtype BAM SortedByCoordinate --readFilesCommand zcat

done

When I run this, STAR seems to be aligning each _1 and _2 read separately instead of treating them as paired end reads. For instance, my output files look like this:

filtered.ABC4454232_1.1P.cd177negtreg_tumor.OutAligned.sortedByCoord.out.bam filtered.ABC4454232_2.2P.cd177negtreg_tumor.OutAligned.sortedByCoord.out.bam

It should just be outputing one .bam file for this sample, such as. filtered.ABC4454232.cd177negtreg_tumor.OutAligned.sortedByCoord.out.bam

Any ideas what is wrong? Thanks!

RNA-Seq STAR bash • 3.2k views

ADD COMMENT • link updated 7.2 years ago by h.mon 35k • written 7.2 years ago by nikelle.petrillo ▴ 110

2

Entering edit mode

--readFilesIn ${i}.fastq.gz ${i}.fastq.gz

You are instructing STAR to use the same fastq twice for a given value of ${i}

ADD REPLY • link 7.2 years ago by WouterDeCoster 47k

1

Entering edit mode

Have you looked at what values i is getting for each iteration of the loop?

On a different note: Are you just running these scripts (or did you write them yourself)?

In general it would be simpler to do something like ls -1 *_1.1P.fastq.gz | sed 's/_1.1P.fastq.gz//' to grab the file names.

ADD REPLY • link 7.2 years ago by GenoMax 147k

0

Entering edit mode

I found the script online but modified it to try to fit my data. echo $i spits out values such as these:

filtered.ABC4454232_1.1P
filtered.ABC4454232_2.2P
filtered.ABC4454233_1.1P
filtered.ABC4454233_2.2P
....

Which are the correct values, i guess i do not understand how to tell bash to differentiate between the _1.1P and its corresponding _2.2P

ADD REPLY • link 7.2 years ago by nikelle.petrillo ▴ 110

0

Entering edit mode

Have you tried basename that you were using in a different question you had asked yesterday to differentiate between the names?

ADD REPLY • link 7.2 years ago by GenoMax 147k

0

Entering edit mode

Yes, I thought of that but not sure how to incorporate basename into my existing code.

ADD REPLY • link 7.2 years ago by nikelle.petrillo ▴ 110

1

Entering edit mode

7.2 years ago

h.mon 35k

Your current question is essentially a duplicate of your previous question, and thats the reason I argued your previous question was not a bioinformatics question: you are just manipulating bash variables, and the bioinformatics software you want to use is of no importance to this. In fact, you could use the same solution as before for this problem, instead of the cumbersome $(ls | egrep '.[1|2]P.fastq.gz' | rev | cut -c 10- | rev | uniq):

#!/bin/bash
mkdir -p alignments
path='/trimmed/alignments/'
for i in *_1.1P.fastq.gz
do
  base=${i%%_1.1P.fastq.gz}
  STAR --genomeDir /indices/human --readFilesIn ${base}_1.1P.fastq.gz ${base}_2.2P.fastq.gz \
    --runThreadN 8 --outSAMtype BAM SortedByCoordinate --readFilesCommand zcat \
    --outFileNamePrefix /trimmed/alignments/${base}.cd177negtreg_tumor.Out 
done

As I hinted in my previous answer (and as others explicitly suggested here), have a look at your variables before running your commands:

for i in *_1.1P.fastq.gz
do
  base=${i%%_1.1P.fastq.gz}
  echo "file1 is ${base}_1.1P.fastq.gz, file2 is ${base}_2.2P.fastq.gz"
  echo "output will be /trimmed/alignments/${base}.cd177negtreg_tumor.Out"
done

P.S.: it is generally (though not unanimous) considered bad practice to parse the output os ls, the main reason being file names can have characters which will break your script (such as spaces) unless you account for them.

ADD COMMENT • link 7.2 years ago by h.mon 35k

score 3 · Accepted Answer · 2017-08-24

3

Entering edit mode

7.2 years ago

grant.hovhannisyan ★ 2.6k

Grab the file name, an then do like:

--readFilesIn ${i}_1.1P.fastq.gz ${i}_2.2P.fastq.gz

ADD COMMENT • link 7.2 years ago by grant.hovhannisyan ★ 2.6k

0

Entering edit mode

Thank you, but when I try this the output .bam file keeps getting written over.

ADD REPLY • link 7.2 years ago by nikelle.petrillo ▴ 110

0

Entering edit mode

Can you show an example of a full name?

A good method of testing a loop is:

for i in $(ls | egrep '.[1|2]P.fastq.gz' | rev | cut -c 10- | rev | uniq)
do
echo STAR --genomeDir /indices/human --readFilesIn ${i}.fastq.gz ${i}.fastq.gz --runThreadN 8 --outFileNamePrefix /trimmed/alignments/${i%.fastq.gz}.cd177negtreg_tumor.Out --outSAMtype BAM SortedByCoordinate --readFilesCommand zcat
done

Using the echo the command will not run but you will see how it looks like and what you should tweak further.

ADD REPLY • link 7.2 years ago by WouterDeCoster 47k

0

Entering edit mode

Exactly, because at every iteration STAR will output the file with the same name, and in the end for the loop you will get the file from the last iteration. To avoid this you can rename final files at each iteration.

for i in $(ls | egrep '.[1|2]P.fastq.gz' | rev | cut -c 10- | rev | uniq)
do
STAR --genomeDir /indices/human --readFilesIn ${i}_1.1P.fastq.gz ${i}_2.2P.fastq.gz other_commands 
mv Aligned.out.bam Log_${i}.bam
mv Log.final.out Log_${i}.final.out
mv Log.out Log_${i}.out
mv Log.progress.out Log_{i}.progress.out

done

Bear in mind that you'd need to rename all output files in each iteration, so that nothing would be overwritten by the next iteration.

ADD REPLY • link 7.2 years ago by grant.hovhannisyan ★ 2.6k

0

Entering edit mode

Erm and what about just using ${i} as part of the --outFileNamePrefix? You are overcomplicating this by renaming files.

ADD REPLY • link 7.2 years ago by WouterDeCoster 47k

0

Entering edit mode

Yes, your option is definitely more clean. Nevertheless, the trick with renaming files might be also useful in other situations.

ADD REPLY • link 7.2 years ago by grant.hovhannisyan ★ 2.6k