Entering edit mode
7.1 years ago
Thomas B.
▴
30
Hi everyone,
I mapped coding sequences (CDSs) onto a genome using GMAP and got a report in psl format.
Is there a readily available tool / way to extract the genome sequences that matched - I mean those that correspond to blocks - then recompute coding sequences ?
You could convert your PSL file to BED using BEDOPS: http://bedops.readthedocs.io/en/latest/content/reference/file-management/conversion/psl2bed.html
Then, convert the BED to GTF or GFF and extract FASTA sequence over these regions using gffread: http://ccb.jhu.edu/software/stringtie/gff.shtml#gffread
Hi Kevin,
Thanks for your reply!
I converted my PSL file to BED using psl2bed as you suggested.
psl2bed < my_report.psl > my_report.bed
I then used the online bed_to_gff_converter tool from Galaxy to convert the BED file to GFF.
I finally used gffread to extract FASTA sequences as you also suggested.
./gffread my_report.gff -g my_ref_genome.fa -w my_transcripts.fa
But there is no sequence in the final output file.
Do you know what may be wrong? I would also like to ask you how you convert BED to GTF or GFF?
Hey Thomas,
Have you looked at the contents of your output files at each step in order to see if the format is okay? The BED-to-GTF format conversion is difficult, but you may want to take a look here: How To Convert Bed Format To Gtf?
Here too: http://onetipperday.sterding.com/2012/08/convert-bed-to-gtf.html
Kevin
Oh wait, there is a nice
--exons=genomic
option in GMAP that does what I want. However the output should be processed (something likecat my_report.txt | grep -v '<' > my_report.fasta
) in order to obtain a true FASTA file. It seems also that if there are multiple matches for a same query, the default behaviour is to keep only the best one.Now back to what you suggested. It is also possible to convert a BED file directly into FASTA using bedtools: http://bedtools.readthedocs.io/en/latest/
My command was
bedtools getfasta -fi reference_genome.fasta -bed my_report.bed -split -fo my_report.fasta
. The-split
option aims to extract and concatenate the sequences from the BED blocks. Here each match is retained, even multiple matches for a same query.Thanks for your help!
Hi Thomas, great that you got it all sorted out.