Hi,
Will always R2 file contain sense transcript and R1 reverse complement of transcript?
If R2 file contains sense transcript, I will have to run HTSeq.scripts.count with "reverse" setting to quantify sense transcription. Quite confusing, isn't it?
Library creation kit -> E7420S NEBNext® Ultra™ Directional RNA Library Prep Kit for Illumina®
Thanks, Bernardo
P.S. Here is the pipeline:
set -ue
#################################################
#BWA test aligner with paired-end data
#################################################
# Get the genome file from the command line
genome_file=$1
# Get the fastq file from the command line
fastq_file_R1=$2
# Get the fastq file from the command line
fastq_file_R2=$3
# Get the fastq file from the command line
fastq_file_R3=$4
# Get the fastq file from the command line
fastq_file_R4=$5
#get gff
GFF=$6
#BWA default settings
bwa index $genome_file
#BWA input
bwa mem -t 8 $genome_file $fastq_file_R1 $fastq_file_R2 | gzip -3 > P_S1_L001_aln-pe.sam.gz
bwa mem -t 8 $genome_file $fastq_file_R3 $fastq_file_R4 | gzip -3 > V_S1_L001_aln-pe.sam.gz
################################################################
#Flagstat
################################################################
#Convert .sam to .bam to input to Flagstat
samtools view -b -S -o P_S1_L001_aln-pe.bam P_S1_L001_aln-pe.sam.gz
samtools flagstat P_S1_L001_aln-pe.bam
samtools view -b -S -o V_S1_L001_aln-pe.bam V_S1_L001_aln-pe.sam.gz
samtools flagstat V_S1_L001_aln-pe.bam
################################################################
#Count reads mapped with htseq-count
################################################################
samtools sort -n V_S1_L001_aln-pe.bam invivo.sorted
python -m HTSeq.scripts.count -m intersection-nonempty -f bam -a 10 -t mRNA -i Parent -r name -s reverse invivo.sorted.bam $GFF | awk 'n>=5 { print a[n%5] } { a[n++%5]=$0 }' > invivo_R.counts
samtools sort -n P_S1_L001_aln-pe.bam plate.sorted
python -m HTSeq.scripts.count -m intersection-nonempty -f bam -a 10 -t mRNA -i Parent -r name -s reverse plate.sorted.bam $GFF | awk 'n>=5 { print a[n%5] } { a[n++%5]=$0 }' > plate_R.counts
################################################################
#IGV
################################################################
#samtools sort -n sorts by name, not coordinate.. index requires sort by coordinate..
samtools sort V_S1_L001_aln-pe.bam V_S1_L001_aln-pe.bam.sorted
samtools index V_S1_L001_aln-pe.bam.sorted.bam
samtools sort P_S1_L001_aln-pe.bam P_S1_L001_aln-pe.bam.sorted
samtools index P_S1_L001_aln-pe.bam.sorted.bam
A file should never contain 3'->5' nucleotide sequence. Whether read #1 or read #2 will dictate the originating strand when a directional kit is used will depend on the kit. When in doubt, just look at a couple genes in IGV or another browser. It should be immediately obvious which which read dictates strand.
Thanks for your answer Devon, the question has been reformulated. You answered to the first question of my post. I performed the IGV inspection step and it's clear that R2 contains sense transcript. I will have to answer how library was prepared to confirm this IGV test.