Question

Read counts an order of magnitude higher on one chromosome

0

Entering edit mode

2.5 years ago

Samuel • 0

Hi,

I am having an issue with a sequencing run that when demultiplexed, aligned, and filtered each individual has 1-2 million reads, but these reads are predominantly on one chromosome. For background these are oncorhynchus mykiss and o. clarki samples. The 6th chromosome has an order of magnitude higher reads than the other chromosomes. The libraries were prepared for rad-seq (sbf1) with 19 plates on one novaseq 6000 lane.

The demultiplexing was done using deML. Aligning to the O. mykiss reference genome used bwa and filtering/sorting in samtools.

I can't figure out if the issue is a library prep issue, sequencing issue, or something that I am doing wrong in the splitting and creation of the bam files? If anyone has had a similar problem please let me know.

Thanks, Sam

novaseq alignment Rad-seq sequencing demultiplex • 2.1k views

ADD COMMENT • link 2.5 years ago by Samuel • 0

0

Entering edit mode

I can't figure out if the issue is a library prep issue, sequencing issue, or something that I am doing wrong in the splitting and creation of the bam files?

You should also consider possible biological reasons for this observation, like aneuploidy, repeat sequences, nucleotide composition biases, etc.

I strongly suggest visualizing the read coverage of this chromosome. Is the increase in read coverage uniform or are there specific regions with lots of reads?

ADD REPLY • link 2.5 years ago by jv ★ 1.8k

0

Entering edit mode

Hi jv,

I selected two individuals at random from the plate to look at the depth across chromosome 6. It seems that there is one small region that is accounting for the majority of the reads. The individuals are colored differently. Others have sequenced these species using Rad-seq without this issue so not sure it is biological, but I am new to this. I appreciate your insight.

This is the depth on chromosome 6. Position along chromosome on x axis and depth on the y.

ADD REPLY • link 2.5 years ago by Samuel • 0

1

Entering edit mode

I mean it's clearly not the whole chromosome it's specific locus. How long is this region? What does the underlying sequence look like?

ADD REPLY • link 2.5 years ago by benformatics 4.0k

0

Entering edit mode

These regions seem to be ~60 bases long. The sequence for the omy06 region is "ATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT". When I do a blast search it comes up as H. pylori, COVID-19, or synthetic construct. Seems to be pointing toward contamination.

ADD REPLY • link 2.5 years ago by Samuel • 0

0

Entering edit mode

Check my answer. These are adapters and should be trimmed prior to alignment. Sometimes one round of trimming isn't enough (in the case that you already trimmed your reads).

ADD REPLY • link 2.5 years ago by benformatics 4.0k

0

Entering edit mode

Could you do this plot for other chromosomes as well? The peak might be there, too.

ADD REPLY • link 2.5 years ago by Michael 55k

0

Entering edit mode

zoomed in at depth greater than 1000 depth at positions greater than 300 depth

ADD REPLY • link 2.5 years ago by Samuel • 0

score 2 · Answer 1 · 2022-05-26

2

Entering edit mode

2.5 years ago

benformatics 4.0k

Trim your adapters before you do your alignment.

ATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT is the read2 adapter for some types of RAD-seq: https://github.com/MikkelSchubert/adapterremoval/issues/54 and https://patents.google.com/patent/US20160122753A1/en

I'm guessing this is the same for your case. Why the adapter sequence exists in the genome... who knows.

ADD COMMENT • link 2.5 years ago by benformatics 4.0k

0

Entering edit mode

Good catch. @OP, did you use bwa (or any local aligner) for alignment?

ADD REPLY • link 2.5 years ago by ATpoint 85k

0

Entering edit mode

Yes, I used bwa for alignment.

ADD REPLY • link 2.5 years ago by Samuel • 0

score 1 · Answer 2 · 2022-05-24

Others have sequenced these species using Rad-seq without this issue so not sure it is biological

Others who? Other researchers in your group? How do you know this isn't something others have observed?

Presumably you are doing research, and now you've made an observation, one that you assume varies from others' observations and that is ok, that is research, now you need to follow up on the observation and get a better understanding as to the source of these over-represented reads in your Rad-seq data. Only once you investigate can you know if it's artifact or something real.

Did you do quality control of your raw sequencing data? It is very important that you do. With QC you'll get a better sense of whether library prep issues or sequencing issues exist. Currently I wouldn't expect library prep or sequencing issues to cause this result but I've never worked with rad-seq data before so can't say for sure.

Some additional questions to consider: What genes or sequences are located in this region of chromosome 6? Do all of your samples have it?

score 1 · Answer 3 · 2022-05-24

Possibly this is not a complete explanation but keep in mind that salmonids have undergone several rounds of whole-genome duplications, see the genome paper (Berthelot et al 2014). In particular, see Fig. 2b of the same. Chromosome 6, 11, and partially sex, 6, 12, and 5, were all derived from a single ancestral karyotype by Ts3R and Ss4R duplication events. A collapsed and conserved repetitive region might be sufficient to attract a lot of reads.

score 1 · Answer 4 · 2022-05-24

1

Entering edit mode

2.5 years ago

ATpoint 85k

I am neither a RAD-seq person, nor familiar with these organisms but regions that are prone to accumulate excessive read counts are not uncommon for NGS experiments. For human and mouse there are even established blacklists to remove known artifact regions. These could be some kind of low-complexity regions that accumulate false alignments that then pile up or these regions have sequence similarities with regions that are not properly represented in the reference genome, yet present in the genome, hence also prone to accumulate false alignments. Did you check for the alignment scores of the reads in these regions? Are these low? Are there other datasets you could check to see whether this is a common issue?

ADD COMMENT • link 2.5 years ago by ATpoint 85k

0

Entering edit mode

The alignment seems good, 98% mapped. No it is not a common problem with other datasets I've checked.

607841 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 591688 + 0 supplementary 0 + 0 duplicates 597497 + 0 mapped (98.30% : N/A) 16153 + 0 paired in sequencing 8449 + 0 read1 7704 + 0 read2 0 + 0 properly paired (0.00% : N/A) 5809 + 0 with itself and mate mapped 0 + 0 singletons (0.00% : N/A) 5538 + 0 with mate mapped to a different chr 5343 + 0 with mate mapped to a different chr (mapQ>=5)

ADD REPLY • link 2.5 years ago by Samuel • 0