Hello Biostars,
My question concerns the processing of quite a few files in fasta or bed format in order to do some conservation analysis.
Basically I have about 50 human sequence files which span the entire genome grouped by some features. What I have planned, is to get some comparative genomics on other vertebrates ( for now I have dogs and fugu in mind).
My general idea was: LiftOver between my files and those species using the UCSC LiftOver tool but I was warned that this may cause problems since it was designed to transfer between human assemblies. But for now, while taking the results with a grain of salt, this is my first idea. After using liftOver I planned to use Clustal Omega to do multiple alignments between the sequences.
The UCSC Genome browser actually provides the information I want for all my bed intervals / fasta files when I enter an interval "by hand" - but since I have quite a lot of data this not practical of course.
My question therefore is, is my LiftOver approach actually useful or is there a pipeline / hands on tutorial somebody could point me to for the analysis of this kind of data. I am somewhat stuck at this point - Any help would be greatly appreciated.
Thank you,
Have you tried downloading the relevant conservation tracks from UCSC table browser? You can download the entire genome and intersect with your regions of interest
I just downloaded them in *.maf format and I will try your proposed approach. I am just not quite sure how to handle the *.maf format together with bed/fasta files. Maybe you would have another tip ? I already have the entire genome downloaded for hg19.
Thanks for your help!
The easiest way would probably be to convert maf to vcf and then use vcftools to intersect it with bed files. See this thread: converting maf to vcf for conversion option.