Question

Bacterial RNASeq doubts

0

Entering edit mode

4 months ago

Julián • 0

Hello, I am a bioinformatics student and I am learning how to do rnaseq so I am new and I have doubts that I don't know if it is my fault or because of the quality of the sequencer. I have to do an Rna-seq of chromohalobacter salexigens DM3043. They are paired readings. I follow the next algoritm for each sample:

bwa mem ../../Genoma_referencia_gtf/GCF_000055785.1_ASM5578v1_genomic.fna combinedRpoS25C_R1.fastq combinedRpoS25C_R2.fastq > aligned_reads_RpoS25C.sam

samtools view -Sb aligned_reads_RpoS25C.sam | samtools sort -o aligned_reads_RpoS25C_sorted.bam

samtools flagstat aligned_reads_RpoS25C_sorted.bam > aligned_stats_RpoS25C.txt

samtools stats aligned_reads_RpoS25C_sorted.bam > alignment_stats_RpoS25C_full.txt

featureCounts -a ../../Genoma_referencia_gtf/genomic.gtf -o counts_bwa_RpoS25C.txt -t gene -g gene_id -p aligned_reads_RpoS25C_sorted.bam

I also tried trimming the data although it did not vary much and with hisat2 and bowtie and in both cases in some samples I got a very low alignment % of 20 and then in 4 samples when I do the FeatureCounts I got very low counts with featureCounts and in another 4 samples if I get a high number of accounts applying the same protocol. Also when I do FeatureCounts I dont know if I should use --countReadPairs because it takes too much time to perform de analyisis.

What could be the reason for this difference when I apply the same protocol and everyone in the previous FASTQC has good qualities.

samtools bacteria featureCounts rnaseq • 855 views

ADD COMMENT • link updated 4 months ago by GenoMax 147k • written 4 months ago by Julián • 0

1

Entering edit mode

I dont know if I should use --countReadPairs

If this is paired end data then you have to use that option with -p in order not to double count reads.

When a situation like this arises (low % mapping), take a sampling of unmapped reads and blast them at NCBI to make sure they are aligning to correct genome i.e. you don't have unexpected contamination.

Are the original reads being aligned as is or did you do any trimming/scanning? If yes, then were the reads trimmed together?

ADD REPLY • link 4 months ago by GenoMax 147k

0

Entering edit mode

Hi. At first I had each sample with 4 or 5 .fastq R1 files and 4 or 5 .fastq R2 files.

Then I did this to combine them: (for example with fordward(R1))

cat Wt25_CCGTCC_L008_R1_001.fastq Wt25_CCGTCC_L008_R1_002.fastq Wt25_CCGTCC_L008_R1_003.fastq Wt25_CCGTCC_L008_R1_004.fastq Wt25_CCGTCC_L008_R1_005.fastq > combinedWt25_R1.fastq

And then the FASTQC:

../../FastQC/./fastqc combinedWt25_R1.fastq combinedWt25_R2.fastq

But I am working with the untrimmed samples because the number of counts does not vary too much and it is more direct.With the original combined R1 and R2.

ADD REPLY • link updated 4 months ago by GenoMax 147k • written 4 months ago by Julián • 0

0

Entering edit mode

With the bwa I do sample by sample, not all at once.Another thing I can say is that in some samples I had 30 million counts and then in other samples I had 5 million which doesn't make much sense.

ADD REPLY • link 4 months ago by Julián • 0

0

Entering edit mode

Here I show a summary of my data. enter image description here

ADD REPLY • link 4 months ago by Julián • 0

0

Entering edit mode

Looking at the 001/002 .. in file names this must be old data. Those kind of file pieces have not been in use for quite some time.

As long as the order of the files catted was exactly identical for the two reads (R1/R2) thing should be OK. You can use a tool like repair.sh from BBMap suite to confirm that read identifiers are matching.

Another thing I can say is that in some samples I had 30 million counts and then in other samples I had 5 million which doesn't make much sense.

Are you referring to variable number of reads across samples or file pieces for individual samples?

ADD REPLY • link 4 months ago by GenoMax 147k

0

Entering edit mode

For example when I do: featureCounts -a ../../Genoma_referencia_gtf/genomic.gtf -o counts_bwa_RpoS25C.txt -t gene -g gene_id -p aligned_reads_RpoS25C_sorted.bam

I got an excel where I get the counts and in some samples I have 30373011 counts with a lot of MQ0 and in others I have the same with low MQ0. The brown bars in the image.

ADD REPLY • link 4 months ago by Julián • 0

0

Entering edit mode

And when I use hisat2 to align I have counts with 30373011 total counts and others with 5373011 of a same replicate. I also have 6 samples with SOLID samples that were already preprocessed and I have to work with the rest of Illumina and it is difficult because those already processed give a very high number of counts and reads and I am applying another protocol to them.

ADD REPLY • link 4 months ago by Julián • 0

0

Entering edit mode

I also have 6 samples with SOLID samples

It would be best not to try and pool illumina/SOLiD data.

ADD REPLY • link 4 months ago by GenoMax 147k

0

Entering edit mode

I have 30373011 counts with a lot of MQ0

If you are referring to mapping quality there then MQ0 means those reads are multi-mapping. featureCounts will not count those reads unless you add -M option. Multi-mapping reads have special considerations in RNAseq.

If you have overlapping genes then that could also be an issue.

Have you considered using a program like https://cs.wellesley.edu/~btjaden/Rockhopper/ that is targetted for bacterial RNAseq.

ADD REPLY • link 4 months ago by GenoMax 147k

0

Entering edit mode

Yes I used Rockhopper and I some samples It appear like 20-10% protein coding and the rest 80% RNAr. Its a university project I need to use both data.

ADD REPLY • link 4 months ago by Julián • 0

0

Entering edit mode

But If I want to study differential expression I shouldnt use Multimapping reads no? Is there any way I could contact to you to show how Im doing the analysis?

ADD REPLY • link 4 months ago by Julián • 0

0

Entering edit mode

It appear like 20-10% protein coding and the rest 80% RNAr.

That is the answer. Not much you can do about data you did not generate yourself. If this is a project assigned to you then show what you did and the issues you ran into.

ADD REPLY • link 4 months ago by GenoMax 147k

0

Entering edit mode

And what should I do with the data? I also tried to give more or less missmatches but it varies very little.

I will try to use the countReadPairs of FeatureCounts although it takes too long for a single sample because until now I used only -p.

And when I normalize the data and do a pca, the results come out a little different.And there are some samples that are grouped in another group. I used the DESEQ2 normalizer. Could this disparity be due to the lack of counts and the use of SOLID and Illumina data?

ADD REPLY • link 4 months ago by Julián • 0

0

Entering edit mode

My teacher told me that the samples were depleted, so why do I get those results in RockHopper?

ADD REPLY • link 4 months ago by Julián • 0

0

Entering edit mode

My teacher told me that the samples were depleted

Experiments don't work as expected every time. If your result is accurate then it is possible that the depletion did not work in some samples.

Part of learning data analysis involves being able to provide logical explanation for results you are observing (may not be possible 100% of time). Perhaps your teacher is expecting something similar from all of you.

ADD REPLY • link 4 months ago by GenoMax 147k

score 0 · Answer 1 · 2024-07-08

When posting here, you can omit the flagstat and stats command. They aren't relevent for troubleshooting.

Also, for a long time now, samtools sort can handle sam format. And you pretty much should never write a full sam file; it should be piped to something that will convert it right away without writing a sam file.

so your first command should be

bwa mem ... | samtools sort -o sorted.bam