Question

Calculating number of reads for paired end reads?

1

Entering edit mode

8.5 years ago

Genosa ▴ 160

Sorry if this may appear like a really dumb question, but I am new to RNAseq and feeling lost about some of the terminologies and calculations used in literature.

I understand what's the difference between the principle behind paired-end and single-end reads, however I need to clarify the method to calculate number of reads in paired end vs. single end reads.

For single end reads, I was told that I could calculate the no. of reads by: wc -l <name of="" fastq="" file="">, divided by 4.

For paired end reads, because 2 files are generated (R1, R2), so is it correct that I calculate the number of reads by adding the read count derived from R1 and R2?

Thank you very much for your clarification.

RNA-Seq • 12k views

ADD COMMENT • link updated 8.5 years ago by Devon Ryan 104k • written 8.5 years ago by Genosa ▴ 160

GenoMax · Accepted Answer · 2016-05-13

8

Entering edit mode

8.5 years ago

Devon Ryan 104k

In paired-end sequencing its often less confusing to talk about fragment numbers rather than the more ambiguous read numbers. So count the number of reads in R1. That's the number of fragments sequenced (it's also the number of reads in R2). If you really prefer the words "reads", you could use "we sequenced XXX read pairs", so it's clear what's meant.

ADD COMMENT • link 8.5 years ago by Devon Ryan 104k

1

Entering edit mode

Hi Devon,

Thank you very much for your reply. Just for clarification, I calculated the number of fragments sequenced in the fastQ file for R1 and R2 for one sample and derived this: R1 - 499385 fragments. R2 - 487958 fragments. Because there's a difference between the no. of fragments sequenced for R1 and R2, so how should I correctly describe the number of fragments sequenced? Thank you very much again.

ADD REPLY • link 8.5 years ago by Genosa ▴ 160

1

Entering edit mode

You apparently screwed up the read syncing between the files at some point, likely during trimming. You'll need to resync the files. The reformat.sh tool from BBMap is convenient for this.

ADD REPLY • link 8.5 years ago by Devon Ryan 104k

1

Entering edit mode

Small correction. @Genosa: Tool from BBMap you would want is repair.sh.

repair.sh in1=r1.fq in2=r2.fq out1=fixed1.fq out2=fixed2.fq outsingle=singletons.fq

bbduk.sh is also a great option for trimming (and it is PE aware so this sort of situation would not happen).

ADD REPLY • link 8.5 years ago by GenoMax 147k

0

Entering edit mode

Thank you very much. I'll go and take a look at BBMap and try to fix it. What do you mean by error in 'read syncing' and what is the reason why it was introduced in the first place? I got the files back from the sequencing service and went through all the samples. So apparently, ALL the samples R1/R2 have very different fragment counts and I am not sure why this is happening. If I do not run reformat.sh, will this adversely affect the subsequent QC/mapping/Alignment/Differential Gene Expression analysis? Thank you!

ADD REPLY • link 8.5 years ago by Genosa ▴ 160

0

Entering edit mode

The Nth read in each of the files should originate from the same cluster. This was the case when the files were originally created and pretty much all aligners assume that this is true and will give non-sensical results if you, for example, delete a read in only one file. Having the Nth read in each file arising from the same cluster (btw, you can tell if this is the case because the read names are more or less the same) is referred to as "the reads being in sync."

The most common way of getting paired-end files out of sync is by using a trimmer that processes only one fastq file at a time (e.g., fastx tools). If you use something like "Trim Galore!" or "trimmomatic" then this shouldn't happen.

ADD REPLY • link 8.5 years ago by Devon Ryan 104k

0

Entering edit mode

Hi Ryan, I am not sure if I've done it correctly but I'm having a problem running the bbmap program. I have zero computer commands knowledge and I'm self-teaching as I'm working along. So I downloaded the BBmap package from source forge, and basically followed the read-me.txt and did as it instructed by unzipping. I am currently using Mac OS X. So I entered terminal and changed the parent directory to where the BBMAP *.sh files re kept. Then I entered "run repair.sh" A new X-code screen now pops up and something like this shows up:

==================================
#!/bin/bash
#repair in=<infile> out=<outfile>

usage(){
echo "
Written by Brian Bushnell
Last modified June 2, 2015

Description:  Re-pairs reads that became disordered or had some mates eliminated.

Usage:  repair.sh in=<input file> out=<pair output> outs=<singleton output>

Input may be fasta or fastq, compressed or uncompressed.


Parameters:
in=<file>       The 'in=' flag is needed if the input file is not the first 
                parameter.  'in=stdin' will pipe from standard in.
in2=<file>      Use this if 2nd read of pairs are in a different file.
out=<file>      The 'out=' flag is needed if the output file is not the second
                parameter.  'out=stdout' will pipe to standard out.
out2=<file>     Use this to write 2nd read of pairs to a different file.
outs=<file>     (outsingle) Write singleton reads here.
overwrite=t     (ow) Set to false to force the program to abort rather than
                overwrite an existing file.
showspeed=t     (ss) Set to 'f' to suppress display of processing speed.
ziplevel=2      (zl) Set to 1 (lowest) through 9 (max) to change compression
                level; lower compression is faster.
fint=f          (fixinterleaving) Fixes corrupted interleaved files using read
                names.  Only use on files with broken interleaving - correctly
                interleaved files from which some reads were removed.
repair=t        (rp) Fixes arbitrarily corrupted paired reads by using read 
                names.  Uses much more memory than 'fint' mode.
ain=f           (allowidenticalnames) When detecting pair names, allows 
                identical names, instead of requiring /1 and /2 or 1: and 2:
monitor=f       Kill this process if it crashes.  monitor=600,0.01 would kill
                after 600 seconds under 1% usage.

Java Parameters:
-Xmx            This will be passed to Java to set memory usage, overriding the program's automatic memory detection.
                -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs.  The max is typically 85% of physical memory.

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.
"
}

pushd . > /dev/null
DIR="${BASH_SOURCE[0]}"
while [ -h "$DIR" ]; do
  cd "$(dirname "$DIR")"
  DIR="$(readlink "$(basename "$DIR")")"
done
cd "$(dirname "$DIR")"
DIR="$(pwd)/"
popd > /dev/null

#DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )/"
CP="$DIR""current/"

z="-Xmx4g"
z2="-Xms4g"
EA="-ea"
set=0

if [ -z "$1" ] || [[ $1 == -h ]] || [[ $1 == --help ]]; then
    usage
    exit
fi

calcXmx () {
    source "$DIR""/calcmem.sh"
    parseXmx "$@"
    if [[ $set == 1 ]]; then
        return
    fi
    freeRam 4000m 84
    z="-Xmx${RAM}m"
    z2="-Xms${RAM}m"
}
calcXmx "$@"

repair() {
    #module unload oracle-jdk
    #module load oracle-jdk/1.7_64bit
    #module load pigz
    local CMD="java $EA $z -cp $CP jgi.SplitPairsAndSingles rp $@"
    echo $CMD >&2
    eval $CMD
===================================

Then at this stage I become a bit stuck. I tried entering the command (I have tried both terminal and X-code): "repair.sh in1=<file name="" of="" r1="">.fq in2=<file name="" of="" r2="">.fq out1=fixed1.fq out2=fixed2.fq outsingle=singletons.fq" However, none works.

My files are named in this format : SAMPLENAME_REPLICATE#_R1.fastq.gz.

I wonder what have I done wrong in any of the steps? Thank you very much

ADD REPLY • link updated 8.5 years ago by GenoMax 147k • written 8.5 years ago by Genosa ▴ 160

1

Entering edit mode

Don't do run repair.sh.
Do you have java installed on your Mac? BBMap is a java program.
As shown in the example above you just need to run (provided the BBMap directory is in your $PATH otherwise use full path to repair.sh). There should be no spaces between any of the names, options and = sign.

/path_to/repair.sh in1=SAMPLENAME_REPLICATE#_R1.fastq.gz in2=SAMPLENAME_REPLICATE#_R2.fastq.gz out1=fixed_SAMPLENAME_REPLICATE#_R1.fastq.gz out2=fixed_SAMPLENAME_REPLICATE#_R2.fastq.gz outsingle=singletons_SAMPLENAME_REPLICATE#.fastq.gz

ADD REPLY • link 8.5 years ago by GenoMax 147k

0

Entering edit mode

If your sample files are not in the same directory as BBMap programs then modify your command to accommodate the file paths.

/path_to/repair.sh in1=/path_to/SAMPLENAME_REPLICATE#_R1.fastq.gz in2=/path_to/SAMPLENAME_REPLICATE#_R2.fastq.gz out1=/path_to/fixed_SAMPLENAME_REPLICATE#_R1.fastq.gz out2=/path_to/fixed_SAMPLENAME_REPLICATE#_R2.fastq.gz outsingle=/path_to/singleteons_SAMPLENAME_REPLICATE#.fastq.gz

Spend a couple of hours here: http://korflab.ucdavis.edu/Unix_and_Perl/current.html#part1 familiarizing yourself with a bit of command line.

ADD REPLY • link 8.5 years ago by GenoMax 147k