Question

How to fill no represented regions?

0

Entering edit mode

5.9 years ago

apl00028 ▴ 90

With the aim to analyze the different polymorphisms that I have in my RNAseq data.

I have done an alignment using bowtie to alignm reads against a reference sequence:

 bowtie2 -f -x scaffold_filt_PV014_OD_DB.fa PV002_cmv3.fa -S scaffold_nofilt_OD_PV002.sam

I change sam format to bam format:

 samtools view -bS scaffold_nofilt_OD_PV002.sam > scaffold_nofilt_OD_PV002.bam

I removed unpaired reads: samtools view -F 0x04 -b scaffold_nofilt_OD_PV002.bam >scaffold_nofilt_OD_rm_PV002.bam

I select those with a map quality higher than 10

samtools view -q 10 -b scaffold_nofilt_OD_rm_PV002.bam > scaffold_nofilt_OD_MapQ10_PV002.bam

I sort and I index it:

 samtools sort scaffold_nofilt_OD_MapQ10_PV002.bam scaffold_nofilt_OD_PV002_index
 samtools index scaffold_nofilt_OD_PV002_index.bam

When I visualized this library and others (using IGV tool) I get this results:

enter image description here

Each row is a different Library, as you can see there are some regions that there are not well represented

How can I fill in the no representated regions? Is there any algorithm for get this goal? I need this information

Thanks a lot in advantage.

RNA-Seq alignment • 1.2k views

ADD COMMENT • link updated 5.9 years ago by ATpoint 88k • written 5.9 years ago by apl00028 ▴ 90

0

Entering edit mode

How can I fill in the no representated regions? Is there any algorithm for get this goal?

If there is no sequence data representing the region in your current dataset then there no "algorithm" that can fill that data in. You can sequence more deeply (sample more) to see if you get reads in that area. If that does not work you may need to make a new library to see if that works better.

You are aligning to scaffolds (did these come from a de novo transcriptome assembly?). Do you know that the entire region represents expressed sequences?

ADD REPLY • link 5.9 years ago by GenoMax 151k

0

Entering edit mode

No, I am aligning my reads against my reference sequence. Answering the second question, the reference sequence is divided in 5 different regions: these regions go from: 1region goes 1 to 119 bp, 2 region goes 120 to 959bp, 3 region goes 960 to 1256bp, 4 region goes 1257 to 1913bp and, 5 region goes 1914 to 2214bp. I am interesting only in the regions 2 and 4 but their read depth vary among the different libraries. What do you suggest ? Thanks a lot

ADD REPLY • link 5.9 years ago by apl00028 ▴ 90

0

Entering edit mode

I don't completely understand what you are saying above. Where did your reference come from? De novo RNAseq assembly? What exactly does it present? Without a gene model/annotation to refer to it is not clear what the regions you are referring to represent. Are they exons? Or is this a simple prokaryotic gene? Are these mixed amplicon from 5 regions that are being sequenced?

If each block above represents an independent sample then it looks like you have significant variation between the libraries. That may be expected (differences in expression) or it may be due to the characteristic of the libraries. One sample seems to have no reads in the region.

I select those with a map quality higher than 10

Did you check the BAM from before this step to see if you are removing reads that multi-map in the regions you are interested in? They may be simply getting removed after this step.

ADD REPLY • link 5.9 years ago by GenoMax 151k

0

Entering edit mode

I am sorry for my briev answer, My reference comes from do an assembly of one of my libraries (using idba_ud). I used this reference for alignm my data. This regions refers to a Cucumber Mosaic Virus genome (3º segment).The different libraries comes from different habitats, it is normal that this virus is low represented in some habitats than others... but I can not compare these habitats because they have different depth reads. Forget the sample that looks it that it has not reads, I will discard them.

ADD REPLY • link 5.9 years ago by apl00028 ▴ 90

0

Entering edit mode

Ah. This is an RNA virus. So the entire sequence/regions should be present?

The different libraries comes from different habitats, it is normal that this virus is low represented in some habitats than others... but I can not compare these habitats because they have different depth reads

I don't know enough about RNA viruses but if this was a regular RNAseq experiment then a package like DESeq2 takes into consideration variations in read depth (which is common) when calculating differential expression.

Are you looking for differential presence (if that is right word) of this RNA virus in different samples? Perhaps not since you are only interested in two specific regions (are those genes?).

Note: Did you see my comment about the filtering you did on mapq above?

ADD REPLY • link 5.9 years ago by GenoMax 151k

0

Entering edit mode

Yes, This virus have 3 different genomic segments. In this case I show only information about the 3rd genomic segment which shows different regions, that are always presented in the 3rd genomic segment.

I am looking for polymorphisms that this virus can present in different habitats. But in some habitats, It does not show a representation or lower represented of some regions. This does that my diversity analysis will be falses when I comparing all habitats. Answering your question about MAPQ, that figure comes after remove map quality <10

I have thought two different alternatives: To compare the polymorphisms that have a minimum coverage but I don't see any article that stablish a threeshold about it. Another alternative is create a reference sequence that comes after combining all sequences together.

What do you suggest?

Thanks a lot

ADD REPLY • link 5.9 years ago by apl00028 ▴ 90

0

Entering edit mode

Answering your question about MAPQ, that figure comes after remove map quality <10

Question is is that filtering removing reads in the regions you are interested in? Perhaps that is why it looks like there is no/uneven coverage in some regions. Have you looked at the original BAM files in IGV before you did any manipulations. Since you did not change the default, the way bowtie2 handles multi-mapping likely applies.

You could do both. Use a coverage cutoff (say 20 reads covering a base) to call SNP's. Your virus likely differs from the reference in GenBank so trying to create a reference by de novo assembly is another option. I thought you have already done that with IDBA_UD and that is what you are using as reference in these alignments?

ADD REPLY • link 5.9 years ago by GenoMax 151k

0

Entering edit mode

Yes, I did my reference with IDBA_UD but I used the biggest scaffold of one library. I though bin all libraries and create a reference. I will prove dont remove those reads.

Do you know how to stablish the cutoff of 20 reads covering a base to call SNP's? I mean, after visualized this aligment I use this function:

    samtools mpileup -g -f scaffold_filt_PV014_OD_DB.fa scaffold_nofilt_OD_PV002_index.bam > scaffold_filt_OD_PV078_q10_variant.bcf

And then I call the variants:

   bcftools view -v -c -g scaffold_filt_OD_PV078_q10_variant.bcf > scaffold_filt_OD_PV003_q10_variant.vcf

How can I add that cutoff? There is an article that stablish a cutoff of MapQ, SNPquality and read depth? Thanks a lot

ADD REPLY • link 5.9 years ago by apl00028 ▴ 90