Question

Rsubread with unexpected read order results

0

Entering edit mode

4 months ago

thomas.heigl.ibk • 0

Hi there.

I am seeking for help as I don't understand the unexpected summary stats after alignment of the bulk RNAseq data with the Rsubjunc function.

The samples are coming from Human. Reference genome and annotation are thus also from human (gencode version 45, primary assembly).

Kit for library prep: Illumina stranded mRNA pre, ligation (first read of pair on antisense strand) Sequencer : NovaSeq

The quality over all reads and bases was amazing (close to 40 phred-score) according to Fastqc (not shown here)

Interestingly, when I don't add the nTrim5 argument to trim the first 10 bases (since this part is not showing an equal distribution of all bases in the fastqc plot), the quantity of the "unexpected read order" drops to 10% of the total fragment number, which is better but still pretty high and I don't know why that is or how i should deal with it.

Using featurecount later, I still get above 80% of fragments that were able to be used for counting. With featurecount, I need to use "strandSpecific" = 2, cause this library prep protocol results in the R1 of the two fastq files having antisense direction.

What I also realized is that I only detect 15k different expressed genes and many HLA and mitochondria genes being active (fibroblasts used). This at least might explain the high duplication rate according to fastqc.

What is happening, what should I try, or is this okay and I can take this data as is?

I will post the sessionInfo( ) later, as mapping is currently running in my Rsession.

Anyway, code until subjunc() below. Thank you.

buildindex(reference = "/home/chuddy/bioinformatics/ref_genomes/human_38/genome/gencode_45/GRCh38.primary_assembly.genome.fa", basename = "hGencode45") 
# manual cut and paste of index files into a new folder

fastq_files <- list.files("./fastq/raw/", full.names = T, include.dirs = F, pattern = ".fastq.gz$")

fastq_files_R1 <- fastq_files[grepl(fastq_files, pattern = "_R1.")]
fastq_files_R2 <- fastq_files[grepl(fastq_files, pattern = "_R2.")]

fastq_files_R1_abs <- normalizePath(fastq_files_R1)
fastq_files_R2_abs <- normalizePath(fastq_files_R2)


subjunc(index = "/home/chuddy/bioinformatics/ref_genomes/human_38/genome/subread_gencode_45/hGencode45",  
      nthreads = 20, 
      readfile1 = fastq_files_R1_abs,
      readfile2 = fastq_files_R2_abs,
      sortReadsByCoordinates = TRUE, 
      reportAllJunctions = F, 
      isGTF = T, 
      nTrim5 = 10,
      useAnnotation = T,
      annot.ext = "/home/chuddy/bioinformatics/ref_genomes/human_38/annotation/gencode_45/gencode.v45.primary_assembly.annotation.gtf")

rsubjunc

rsubjunc_summary

rsubread • 362 views

ADD COMMENT • link updated 4 months ago by GenoMax 147k • written 4 months ago by thomas.heigl.ibk • 0

0

Entering edit mode

This run is killing me. or these samples. One sample got a .bai file double the size of its corresponding .bam file. I re-alligned and that one sample looked good. At the featureCount step one sample is printing out something I have never seen before in R (see below).

    counts_r_gene <- featureCounts(bam_files_abs[9],
                          annot.ext = "/home/chuddy/bioinformatics/ref_genomes/human_38/annotation/gencode_45/gencode.v45.primary_assembly.annotation.gtf",
                          GTF.attrType = "gene_name", # summarized over gene names
                          isGTFAnnotationFile = TRUE,  # if external annotation file is a gtf
                          isPairedEnd = T, 
                          GTF.attrType.extra = c("gene_type", "gene_name", "gene_id", "remap_original_id", "mgi_id"), 
                          strandSpecific = 2, # for NEB ultra stranded mRNA library prep kit
                          nthreads = 20)



> sessionInfo()
R version 4.4.0 (2024-04-24)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=de_CH.UTF-8     LC_TIME=de_CH.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=de_CH.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=de_CH.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=de_CH.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Zurich
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] edgeR_4.2.0          limma_3.60.3         gghighcontrast_0.1.0 Rsubread_2.18.0      lubridate_1.9.3      forcats_1.0.0       
 [7] stringr_1.5.1        dplyr_1.1.4          purrr_1.0.2          readr_2.1.5          tidyr_1.3.1          tibble_3.2.1        
[13] ggplot2_3.5.1        tidyverse_2.0.0     

loaded via a namespace (and not attached):
 [1] utf8_1.2.4         generics_0.1.3     stringi_1.8.4      lattice_0.22-5     hms_1.1.3          magrittr_2.0.3    
 [7] grid_4.4.0         timechange_0.3.0   RColorBrewer_1.1-3 Matrix_1.7-0       fansi_1.0.6        scales_1.3.0      
[13] textshaping_0.4.0  cli_3.6.3          rlang_1.1.4        crayon_1.5.3       munsell_0.5.1      withr_3.0.0       
[19] tools_4.4.0        tzdb_0.4.0         colorspace_2.1-0   locfit_1.5-9.10    vctrs_0.6.5        R6_2.5.1          
[25] lifecycle_1.0.4    ragg_1.3.2         pkgconfig_2.0.3    pillar_1.9.0       gtable_0.3.5       glue_1.7.0        
[31] Rcpp_1.0.12        systemfonts_1.1.0  statmod_1.5.0      xfun_0.45          tidyselect_1.2.1   rstudioapi_0.16.0 
[37] knitr_1.47         farver_2.1.2       labeling_0.4.3     compiler_4.4.0

enter image description here

ADD REPLY • link updated 4 months ago by GenoMax 147k • written 4 months ago by thomas.heigl.ibk • 0

0

Entering edit mode

since this part is not showing an equal distribution of all bases in the fastqc plot

See https://sequencing.qcfail.com/articles/positional-sequence-bias-in-random-primed-libraries/

This at least might explain the high duplication rate according to fastqc.

In RNAseq you expect there to be some duplication show up since there can be multiple copies of RNAs. So that part is normal. You can take a look at https://sequencing.qcfail.com/articles/libraries-can-contain-technical-duplication/ for additional ideas.

ADD REPLY • link 4 months ago by GenoMax 147k