Hi,
I have done an assembly by using reads from a bird infected with a parasite (by using Trinity) >> Trinity.fasta
I have created a index for a genome by using:
gmap_build -d genome -k 15 GCF_000534875.1_SCA1_genomic.fa
A directory called "genome" has been create where now there is:
ll
total 5772532
-rw-rw-r--. 1 luz_garcia_longoria luz_garcia_longoria 12774633 Jan 15 11:28 genome.chromosome
-rw-rw-r--. 1 luz_garcia_longoria luz_garcia_longoria 20699247 Jan 15 11:28 genome.chromosome.iit
-rw-rw-r--. 1 luz_garcia_longoria luz_garcia_longoria 6 Jan 15 11:28 genome.chrsubset
-rw-rw-r--. 1 luz_garcia_longoria luz_garcia_longoria 19461467 Jan 15 11:28 genome.contig
-rw-rw-r--. 1 luz_garcia_longoria luz_garcia_longoria 21602471 Jan 15 11:28 genome.contig.iit
-rw-rw-r--. 1 luz_garcia_longoria luz_garcia_longoria 432037584 Jan 15 11:29 genome.genomebits128
-rw-rw-r--. 1 luz_garcia_longoria luz_garcia_longoria 432037548 Jan 15 11:29 genome.genomecomp
drwxr-xr-x. 2 luz_garcia_longoria luz_garcia_longoria 4096 Jan 15 11:27 genome.maps
-rw-rw-r--. 1 luz_garcia_longoria luz_garcia_longoria 72007680 Jan 15 11:53 genome.ref081locoffsets64meta
-rw-rw-r--. 1 luz_garcia_longoria luz_garcia_longoria 660530864 Jan 15 11:53 genome.ref081locoffsets64strm
-rw-rw-r--. 1 luz_garcia_longoria luz_garcia_longoria 2249041724 Jan 15 11:53 genome.ref081locpositions
-rw-rw-r--. 1 luz_garcia_longoria luz_garcia_longoria 140648 Jan 15 11:53 genome.ref081loctable
-rw-rw-r--. 1 luz_garcia_longoria luz_garcia_longoria 134217744 Jan 15 11:35 genome.ref153offsets64meta
-rw-rw-r--. 1 luz_garcia_longoria luz_garcia_longoria 360216272 Jan 15 11:35 genome.ref153offsets64strm
-rw-rw-r--. 1 luz_garcia_longoria luz_garcia_longoria 1496235468 Jan 15 11:44 genome.ref153positions
-rw-rw-r--. 1 luz_garcia_longoria luz_garcia_longoria 7 Jan 15 11:27 genome.version
(hope I did it correctly)
Now, in order to "clean" my assembly (Trinity.fasta file) I want to align the Trinity transcript contigs to the genome BUT I only want to keep those contigs that DO NOT match with the genome. I have been reading the --help information that gmap provides but I don't have a clear idea about what output options use. My idea is something like:
gmap -D . -d genome trinity_out_dir/Trinity.fasta --failsonly -f samse > trinity_gmap.sam
According to gmap help:
--failsonly Print only failed alignments, those with no results
So, in theory, by using this command the program should align my assembly to the genome and create an output file with only those contigs that do not match with the genome, right?
Please, let me know if I am doing something wrong.
Thanks,
Luz
Is your "genome" - assembled with trinity - actually a trasncriptome assembly ?
No. I downloaded the genome from NCBI.
Very very specific question. Just try it ?
I remember from using Gmap extensively in the past it does a great job of creating GFF3 files of correctly mapped transcripts with intron exon structures which you can view in IGV JBrowse etc. Excellent.
It would also print a list of contigs which no mapping was found for on the screen. Even if the option you suggested above does not work, you could use the list of standard contigs output and write a script, eg in biopython, to select the unmapped contigs.
I would certainly use multiple mappers to get a robust result though, eg bwa mem or minimap2 as well if they perform well on your dataset.