Correcting erroneous k-mers and removing rRNA sequences
1
0
Entering edit mode
21 months ago
m.habib • 0

Hi all,

I have 112 reads from two animal species (56 reads/each). Those reads represent different time points (control, 6h, 1d, 2d) with 7 replicates in each point.

I want to analyze gene expression in the two species but I do not have a reference genome. I need to make a de novo transcriptome assembly. What I did so far was:

  • QC using FastQC and MultiQC using the following slurm scripts:
#!/bin/bash
#SBATCH --job-name=fastqc
#SBATCH --ntasks=40
#SBATCH --time=24:10:00
#SBATCH --mem=200G
#SBATCH --mail-user=
#SBATCH --mail-type=All
#SBATCH --output=fastqc.out
#SBATCH --error=fastqc.err
#SBATCH --partition=ceti

echo Node list: $SLURM_JOB_NODELIST
echo Number of tasks: $SLURM_NTASKS

module load fastqc-0.11.9-gcc-10.2.0-kdeze47

#This script runs 'fastqc' and 'multiqc' on the raw RNA-seq reads

# Define PATH to fastq data directory
 DATA_DIR=/users/habibm/taos-scratch/fastq

 for file in $DATA_DIR/*R1.fastq.gz
 do
         withpath="${file}"
         filename="${withpath##*/}"
         base="${filename%*_*.fastq.gz}"
         echo "${base}"

        fastqc -t 8 \
          $DATA_DIR/"${base}"R1.fastq.gz \
          $DATA_DIR/"${base}"R2.fastq.gz \
          -o $DATA_DIR
 done

 exit 0

=====================

MultiQC

#!/bin/bash
#SBATCH --job-name= multiqc
#SBATCH --partition=ceti
#SBATCH --ntasks=40
#SBATCH --mem=200G
#SBATCH --mail-user=
#SBATCH --mail-type=ALL
#SBATCH --output=multiqc.out
#SBATCH --error=multiqc.err

echo Node list: $SLURM_JOB_NODELIST
echo Number of tasks: $SLURM_NTASKS
# Define PATH to fastq data directory
DATA_DIR=/users/habibm/taos-scratch/QC
OUT_DIR=/users/habibm/taos-scratch/multiqc

module load miniconda3-4.7.12.1-intel-19.0.5-fdz2vxj
conda create --name multi_test python=3.7
source activate multi_test
pip install multiqc

#Run multiqc to get a summary for all samples
 cd $DATA_DIR
 multiqc .
done
exit 0

==========================

Now I need to filter and trim my samples but I have problems with this issue. Some reads have overrepresnted sequences (rRNA). I downloaded the SSU and LSU rRNA for my species as fasta files and I now need to build a slurm script for these jobs. Any suggestions for script that can work with my data?

Thanks,
Mohamed

QC • 1.1k views
ADD COMMENT
0
Entering edit mode

A few questions/comments:

  1. Why do you create a conda env and install multiqc in the script? Why not create the environment with all necessary software before running the script and merely activate the env in the script?
  2. Why are you creating a conda env then using pip to install multiqc? Why not conda create -n multiqc_env -c bioconda multiqc? (Again, you'd run this command before running the script; in the script, you'd just source activate multiqc_env)
ADD REPLY
0
Entering edit mode

I tried to install MultiQC through the conda environment but it failed (I do not know why!?). So, I used pip to install it. I forgot to remove the commands for conda env.

ADD REPLY
0
Entering edit mode

but it failed (I do not know why!?)

The error message will tell you why. Bypassing that error using pip is only inviting more errors to your env. Fix the conda error and things will work smoother.

ADD REPLY
0
Entering edit mode
21 months ago
Mensur Dlakic ★ 28k

I hope that by 56 reads/each you mean 56 datasets of reads for each. Because if you have a total of 56 reads, there is not much you can do that would pass any statistical test.

By far the best tool I tried for rRNA removal is RiboDetector:

https://github.com/hzi-bifo/RiboDetector

However, it may be difficult to get it to work on a cluster. If so, I suggest you read this thread as it has several solutions.

ADD COMMENT
0
Entering edit mode

Many Thanks Mensur. Yes I mean 56 datasets of reads for each. I will look for your suggestions.

ADD REPLY
0
Entering edit mode
  1. Please do not add answers unless you're answering the top level question. Use Add Comment or Add Reply as appropriate.
  2. Please use the formatting bar (especially the code option) to present your posts better. You can use backticks for inline code (`text` becomes text), or select a chunk of text and use the highlighted button to format it as a code block. If your code has long lines with a single command, break those lines into multiple lines with proper escape sequences so they're easier to read and still run when copy-pasted. I've done it for you this time.
    code_formatting
ADD REPLY

Login before adding your answer.

Traffic: 2331 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6