Genome 2 GEnome Alignment and Unmapped contigs extraction
0
0
Entering edit mode
10 weeks ago
SomeOne ▴ 170

HI,

I have two genome assemblies which i am aligning (query.fasta vs Ref.fasta) to identify contigs which are not part of Ref.fasta.

Background

  • The Ref.fasta fungal genome has 11 chromsomes in NCBI RefSeq assembly.
  • The Query genome is the same fungal strain as Ref but has an EXTRA chromosome transfered to it. CHR1-11 same as Ref.fasta + Chr1 (extra transfered)
  • Then Sequenced with illumina 150bpx2 and genome assembled De-novo using SPAdes v4.0. (Contigs < 5000bp removed)

What I did so far

1. Aligned the query.fasta to Ref.fasta using minimap2 via command:

minimap2 -x asm5 -t 92 $Ref_fasta $Qry_fasta > $output_dir/${prefix}.alignment.paf

2. Extracted the unmapped contigs from Query.fasta using this python snippet.

python3 - << EOF
import pandas as pd
from Bio import SeqIO

### Load PAF file
paf_file = "$output_dir/${prefix}.alignment.paf"
paf = pd.read_csv(paf_file, sep="\t", header=None)
### Columns of interest: Query sequence name (0), Target sequence name (5)
mapped_contigs = set(paf[0].unique())
### Load all contigs from the query genome (Chr14-47-3a.fasta)
query_fasta = "$Qry_fasta"
contigs = list(SeqIO.parse(query_fasta, "fasta"))
### Identify unmapped contigs
unmapped_contigs = [contig for contig in contigs if contig.id not in mapped_contigs]
### Save unmapped contigs to a new FASTA file
output_fasta = "$output_dir/${Qry_name}.unmapped_contigs.fasta"
with open(output_fasta, "w") as output_handle:
    SeqIO.write(unmapped_contigs, output_handle, "fasta")
print(f"Unmapped contigs saved to '{output_fasta}'.")
EOF

3. Results i get:

With this approach, i was able to identify 5 contigs in query.fasta which showed no alignment at all with the Ref.fasta. but these 5 contigs do not cover up the entire transfered chromosome.

4. Question

  • For alignment, what criteria i should add. like if contig 2 chromosome alignment is less then a certain %identity, I should consider it as unmapped ?
  • Is it a good approach or if you have some other ideas to tackle this, kindly share your views.
assemblies minimap2 genome alignment fungus • 523 views
ADD COMMENT
0
Entering edit mode

Hi,

excluding contigs smaller 5 kbp sounds too strict for me, you might miss some regions of your transferred chromosome .

As an alternative, you can try FastANI; you can visualize your covered regions.

ADD REPLY
0
Entering edit mode

Hi, Can you suggest some criteria or method to identify a good cutoff limit to exclude contigs smaller than that limit.

I will have a look at FastANI. Thanks

ADD REPLY
0
Entering edit mode

From working with bacteria and especially Prokka, I remember the minimal contig length is 200 bp.

ADD REPLY
0
Entering edit mode

CHR1-11 same as Ref.fasta + Chr1 (extra transfered)

What does this exactly mean? There is an extra copy of Chr1 that is stably maintained (even if your fungus is normally haploid)? Are these sequence differences/rearrangements in this extra chromosome (or you don't know that).

but these 5 contigs do not cover up the entire transfered chromosome.

Does that mean they do partly cover the extra Chr1? How do you know they are from that extra copy?

It would be tricky to detect this extra copy as strictly 2x coverage for Chr1 compared to other chromosomes.

ADD REPLY
0
Entering edit mode

Actially,

The ref.fasta is an endophyte strain. having total 11 chromosomes in genome assembly. We took the same strain and transfered an accessory chromosome from another fungus to it. so Chr1-11 + chr12= 12 in transformant assembly.

now i wanted to mapp this transformant genome assembly against the reference.

theoratically the contigs in transformant assembly which are part of chr1-11 of ref would map to these 11 chromosomes and the extra transfered chromosome wont map.

in that way i can extract the contigs of that chromosome.

ADD REPLY
0
Entering edit mode

Just to confirm. You are aligning against the original endophyte strain as reference and are interested in contigs that do not map? In theory those would belong to the 12th Chr.

How similar are sequences from Chr12 with original genome? When you do an assembly there is a good likely hood that some of the extraneous sequence is going to be assembled with main genome, if there are sequence similarities.

If you have the sequence of the 12th chromosome you may want to use bbsplit.sh or seal.sh (that uses k-mer based read classification) from BBMap suite (guide here: https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/seal-guide/ ) to bin your reads prior to assembly so you can remove reads belonging to Chr12 and then assemble them independently.

ADD REPLY
0
Entering edit mode

Just to confirm. You are aligning against the original endophyte strain as reference and are interested in contigs that do not map? In theory those would belong to the 12th Chr.

Yes. That is the goal.

How similar are sequences from Chr12 with original genome? When you do an assembly there is a good likely hood that some of the extraneous sequence is going to be assembled with main genome, if there are sequence similarities.

Well, theoratically, the endophyte strain does not has this chromosome at all (only 11 chr). this Chr12 is from another strain and is part of its accessory chromosomes. the genome assembly of donner strain is incomplete (scaffold level) where only 3 scaffolds are known to be part of chr12. (small ones)

As far as genomic comparison is conserned: the core genome (11 chromosomes) are considered conserved and remaining are accessory. with high Transposons and low gene density. and this transfered chr is accessory.

If you have the sequence of the 12th chromosome you may want to use bbsplit.sh or seal.sh (that uses k-mer based read classification) from BBMap suite (guide here: https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/seal-guide/ ) to bin your reads prior to assembly so you can remove reads belonging to Chr12 and then assemble them independently.

i tried something similar,

  1. Mapped raw reads to Reference genome. and extracted all unmapped reads.

    Reference size 50,356,703
    Number of reads 44,835,528
    Mapped reads 42,827,164 / 95.52%
    Unmapped reads 2,008,364 / 4.48%
    Mapped paired reads 42,827,164 / 95.52%
    
  2. Assembled them using SPAdes v4.0.

these assemblies do map to required resion but number of contigs is too much.

ADD REPLY

Login before adding your answer.

Traffic: 1624 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6