Hi everyone
I just begin with the analysis of my RNAseq data. I have my own reference genome and its annotated file. I've been reading and I decided to use the following pipeline:
FastQC ->Trimmomatic -> Bowtie2 -> HTSEQ
But, I've had some problems, for example
My reads have good quality but there are some duplicates sequences. When I use Trimmomatic it does not change, the duplicate sequences are still there and I donĀ“t know why.
java -jar /path/trimmomatic.jar PE control_rep1_1.fq.gz control_rep1_2.fq.gz control_rep1_1_paired.fq.gz control_rep1_1_unpaired.fq.gz control_rep1_2_paired.fq.gz control_rep1_2_unpaired.fq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2:keepBothReads LEADING:3 TRAILING:3 MINLEN:36
Despite that I went ahead with the analysis just for practice.
When I used HTSeq the results were like this
Warning: Mate records missing for 26418 records; first such record: <SAM_Alignment object: Paired-end read 'A00877:83:HNHLVDSXX:3:1519:15230:29340' aligned to tig00000001:[616,766)/+>.
8382045 BAM alignment pairs processed.
no feature 7727004
ambiguous 0
Too low aQual 278161
not aligned 376880
alignment_not_unique 0
Some forum say that it could be for the annotation file structure. Somebody knows what is happening?
Pd. I really need help and would like to talk with someone maybe for skype or something like that. If there is somebody with some free time, please let me know.
Thanks
Observing duplication in RNA-seq is normal and expected, don't worry. Did you see any adapter contamination in your data based on
fastqc
? If not then there is no need to run trimmomatic. Please share the command lines for both bowtie and htseq.Recommended literature:
https://www.annualreviews.org/doi/abs/10.1146/annurev-biodatasci-072018-021255
https://www.bioconductor.org/packages/devel/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html
Hi ATpoint This is the command lines for bowtie and htseq
bowtie2 -x /home/liz/RNASEQBA135/genoma/aminobacter_index -1 control_rep1_1_paired.fq.gz -2 control_rep1_2_paired.fq.gz -S control_rep1_paired.sam
samtools view -Sb control_rep1_paired.sam > control_rep1_paired.bam samtools sort control_rep1_paired.bam -o control_rep1_paired_sorted.bam samtools index control_rep1_paired_sorted.bam
htseq-count --format bam --order pos --mode intersection-nonempty -s reverse -t gene -s no control_rep1_paired_sorted.bam /home/liz/RNASEQBA135/anotacion/BA135_aminobacter.gtf > control1_htseq.tsv
I have 2 conditions (2 replicates each one) Control and Stress. I have to found wich genes are expressed differentially for my bacteria
Then simply follow the linked DESeq2 workflow (the second link). It includes everything you need.
Thanks, the DESeq2 workflow is very detailed But they start using Salmon. I can't do that because I don't have a previous reference transcriptome. Think that necessarily I have to pass by alignment step and start from my genome fasta file. My problem is about the count step, using HTSeq. I don't know how figure out the Warning message
But you have a GTF, right? It should contain gene annotations. Can you show some lines from this GTF? There are other tools that can make a count matrix such as featureCounts. HTSeq is old, no need to stick with it. Lets first of all see if the GTF is fine.
I got the GTF file from rast annotation server