Hello. I'm working on rna seq pipeline and I would like to find the strand direction of the data before doing alignment via hisat2.
Hello. I'm working on rna seq pipeline and I would like to find the strand direction of the data before doing alignment via hisat2.
If you want to work out strandedness within a pipeline the maybe how_are_we_stranded_here would work? I believe it does pretty much exactly what @benformatics and @swbarnes2 said above re aligning and figuring out
Hello, thank you so much for responding. I had downloaded the GUESSmyLT docker image on my pop os. I am using the following command to run it:
#!/usr/bin/env bash
cd ~/Desktop/fastq/
for infile in *_1.fastq
do
base=$(basename ${infile} _1.fastq)
docker run quay.io/biocontainers/guessmylt:0.2.5--py_0 GUESSmyLT \
GUESSmyLT --reads ${infile} ${base}_2.fastq \
--reference Glycine_max.Glycine_max_v2.1.dna.toplevel.fa --mode genome \
--annotation Glycine_max.Glycine_max_v2.1.55.gff3 --organism euk
done
BUT it keeps giving me the error:
edit: solved the previous error, now have this new one.
Error. Cannot open --reference '/home/x/Desktop/fastq/Glycine_max.Glycine_max_v2.1.dna.toplevel.fa'. Make sure it exists.
The file does exist in that directory.
How do i go around solving this?
You want the pipeline to make a suggestion (so a script decides the stranded-ness)?
What I do is subset around the first 10-100k reads from the two read files (assuming here standard RNA-seq). Then I align them to my genome + GTF (I use the aligner STAR). Then I have a small R script that counts the number of overlaps (gene counts) across the transcriptome. I split the counts by those fragments aligning in the sense and anti-sense direction. (You could do it a bit more advanced if you have some custom RNA-seq library kits). In theory. the reads could be FF, FR, RF, RR but in general that is not applicable to most situations. However, if you think it is you could code that in and separate the reads from each pair by strand to verify.
Anyway for the standard case for instance like take a random gene:
SRSF2 Plus-strand fragment counts: 498 Minus-strand fragment counts: 512
I take the ratio of all these counts genome wide (+/-). If the fraction is 0.4-0.6 I assume an unstranded library. If the fraction is < 0.4 then it is stranded-antisense and if it is >0.6 then it is stranded-sense. In this case it's close to 50% so this unstranded.
Also the standard/most common Illumina RNA-seq library kit produces anti-sense fragments so that is what you will likely see the most.
You should run a few test cases but this works for me in 99% of situations.
Did you load your data within the container? e.g. if your data are all within the current folder:
docker run -v ${PWD}:/data quay.io/biocontainers/guessmylt:0.2.5--py_0 GUESSmyLT \
--reads /data/${infile} ${base}_2.fastq \
--reference /data/Glycine_max.Glycine_max_v2.1.dna.toplevel.fa --mode genome \
--annotation /data/Glycine_max.Glycine_max_v2.1.55.gff3 --organism euk
If you can list all foder where stand the different data and add all those path using -v
docker option
Read the method section of the publication associated with your SRA files where they should list the kit used for the library preparation. Then based on the protocol you should be able to surmise what strand orientation is expected.
Otherwise as @swbarnes2 already stated. It is much easier to align a small subset of the FASTQ and identify the strand bias that way.
https://rseqc.sourceforge.net/ infer_experiment.py
Ask the people who did the library prep what kit they used.
Otherwise, just align and figure it out.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
That’s pity they didn’t mention our work(GuessmyLT) in their paper.