Hi there.
I am seeking for help as I don't understand the unexpected summary stats after alignment of the bulk RNAseq data with the Rsubjunc function.
The samples are coming from Human. Reference genome and annotation are thus also from human (gencode version 45, primary assembly).
Kit for library prep: Illumina stranded mRNA pre, ligation (first read of pair on antisense strand) Sequencer : NovaSeq
The quality over all reads and bases was amazing (close to 40 phred-score) according to Fastqc (not shown here)
Interestingly, when I don't add the nTrim5 argument to trim the first 10 bases (since this part is not showing an equal distribution of all bases in the fastqc plot), the quantity of the "unexpected read order" drops to 10% of the total fragment number, which is better but still pretty high and I don't know why that is or how i should deal with it.
Using featurecount later, I still get above 80% of fragments that were able to be used for counting. With featurecount, I need to use "strandSpecific" = 2, cause this library prep protocol results in the R1 of the two fastq files having antisense direction.
What I also realized is that I only detect 15k different expressed genes and many HLA and mitochondria genes being active (fibroblasts used). This at least might explain the high duplication rate according to fastqc.
What is happening, what should I try, or is this okay and I can take this data as is?
I will post the sessionInfo( ) later, as mapping is currently running in my Rsession.
Anyway, code until subjunc() below. Thank you.
buildindex(reference = "/home/chuddy/bioinformatics/ref_genomes/human_38/genome/gencode_45/GRCh38.primary_assembly.genome.fa", basename = "hGencode45")
# manual cut and paste of index files into a new folder
fastq_files <- list.files("./fastq/raw/", full.names = T, include.dirs = F, pattern = ".fastq.gz$")
fastq_files_R1 <- fastq_files[grepl(fastq_files, pattern = "_R1.")]
fastq_files_R2 <- fastq_files[grepl(fastq_files, pattern = "_R2.")]
fastq_files_R1_abs <- normalizePath(fastq_files_R1)
fastq_files_R2_abs <- normalizePath(fastq_files_R2)
subjunc(index = "/home/chuddy/bioinformatics/ref_genomes/human_38/genome/subread_gencode_45/hGencode45",
nthreads = 20,
readfile1 = fastq_files_R1_abs,
readfile2 = fastq_files_R2_abs,
sortReadsByCoordinates = TRUE,
reportAllJunctions = F,
isGTF = T,
nTrim5 = 10,
useAnnotation = T,
annot.ext = "/home/chuddy/bioinformatics/ref_genomes/human_38/annotation/gencode_45/gencode.v45.primary_assembly.annotation.gtf")
This run is killing me. or these samples. One sample got a .bai file double the size of its corresponding .bam file. I re-alligned and that one sample looked good. At the featureCount step one sample is printing out something I have never seen before in R (see below).
See https://sequencing.qcfail.com/articles/positional-sequence-bias-in-random-primed-libraries/
In RNAseq you expect there to be some duplication show up since there can be multiple copies of RNAs. So that part is normal. You can take a look at https://sequencing.qcfail.com/articles/libraries-can-contain-technical-duplication/ for additional ideas.