Hi everyone,
As you may know, the NCBI offer a "assembly-assembly remapping service" which basically map contigs of one genome (let's call it Genome 1) to the assembly of another genome (lets call it GenomeRef) My purpose is to generate a new file (FASTA) which consists on all the sequences of my mapped contigs (in the right order depending on the mapping) and replace the GenomeRef positions which have not been mapped by N .
I downloaded the GFF file provided on the NCBI website :
Example of a line :
Contig1 RefSeq match 49000 49020 . + . ID=xxxxxxxxxxxx;Target=GenomeRefContig5 6467734 6467754 +;best_on_query=1;best_on_query_same_unit=1;best_on_subject=1;best_on_subject_same_unit=1;gap_count=0;genomic_to_genomic=1;num_ident=21;num_mismatch=0;pct_coverage=0.000282069;pct_coverage_hiqual=0.000282069;pct_ident_quantized=98;pct_identity_gap=100;pct_identity_gapopen_only=100;pct_identity_ungap=100;reciprocity=3;same_unit_reciprocity=3
The problems i am facing while doing it are :
The GFF file is not sorted depending on the GenomeRef sequences but on the Genome 1 contigs
I think i would be able to use samtools faidx and a for loop to retrieve all my contigs sequences from the Genome 1 fasta file but how to replace gaps between those contigs by N depending on the gap length of the GenomeRef ?
[There will be a lot of gaps since the Genome 1 is a very poorly sequenced genome while the GenomeRef is a greatly sequenced genome (PacBio)]
I hope it is clear,
Thanks a lot !
I have already been able to do ->
Remove the extra informations i don't want :
Sort the simplified file depending on the Genome ref scaffold and remove headers :
Then i seperated every informations in separated files :
Then for each line, i can make a samtools to catch the sequence of my Genome1 and add it to a file that have the GenomeRef scaffold as name :
But i really don't see how i can add the N's ....