How to recompute coding sequences from a psl format ?
0
0
Entering edit mode
7.1 years ago
Thomas B. ▴ 30

Hi everyone,

I mapped coding sequences (CDSs) onto a genome using GMAP and got a report in psl format.

Is there a readily available tool / way to extract the genome sequences that matched - I mean those that correspond to blocks - then recompute coding sequences ?

genome alignment sequence GMAP CDS • 2.3k views
ADD COMMENT
1
Entering edit mode

You could convert your PSL file to BED using BEDOPS: http://bedops.readthedocs.io/en/latest/content/reference/file-management/conversion/psl2bed.html

Then, convert the BED to GTF or GFF and extract FASTA sequence over these regions using gffread: http://ccb.jhu.edu/software/stringtie/gff.shtml#gffread

ADD REPLY
0
Entering edit mode

Hi Kevin,

Thanks for your reply!

I converted my PSL file to BED using psl2bed as you suggested.

psl2bed < my_report.psl > my_report.bed

I then used the online bed_to_gff_converter tool from Galaxy to convert the BED file to GFF.

I finally used gffread to extract FASTA sequences as you also suggested.

./gffread my_report.gff -g my_ref_genome.fa -w my_transcripts.fa

But there is no sequence in the final output file.

Do you know what may be wrong? I would also like to ask you how you convert BED to GTF or GFF?

ADD REPLY
0
Entering edit mode

Hey Thomas,

Have you looked at the contents of your output files at each step in order to see if the format is okay? The BED-to-GTF format conversion is difficult, but you may want to take a look here: How To Convert Bed Format To Gtf?

Here too: http://onetipperday.sterding.com/2012/08/convert-bed-to-gtf.html

Kevin

ADD REPLY
1
Entering edit mode

Oh wait, there is a nice --exons=genomic option in GMAP that does what I want. However the output should be processed (something like cat my_report.txt | grep -v '<' > my_report.fasta) in order to obtain a true FASTA file. It seems also that if there are multiple matches for a same query, the default behaviour is to keep only the best one.

Now back to what you suggested. It is also possible to convert a BED file directly into FASTA using bedtools: http://bedtools.readthedocs.io/en/latest/

My command was bedtools getfasta -fi reference_genome.fasta -bed my_report.bed -split -fo my_report.fasta. The -split option aims to extract and concatenate the sequences from the BED blocks. Here each match is retained, even multiple matches for a same query.

Thanks for your help!

ADD REPLY
0
Entering edit mode

Hi Thomas, great that you got it all sorted out.

ADD REPLY

Login before adding your answer.

Traffic: 2532 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6