What is the recommended genome reference sequenceid format for the generation of BAM files to be attached remotely to UCSC and Ensembl?
For what I've seen so far, it seems one like to have 'chr'N sequenceids, and the other doesn't want 'chr' in front of the sequence_id.
I like having the 'chr' prefix because, it makes things easier for me to clearly identify the column for the chromosome, to join the informations with the UCSC data, to 'grep' a chromosome, etc... .
as far as I know there is no proper recommendation for using any contig_id format, it's just a matter of being consistent inside your entire pipeline.
take for instance our own experience: we work with bed files defining our resequencing regions coming from Agilent based on UCSC's nomenclature (chrN), and we afterwards use annotation tools that use the same internal coding, so when we built our reference multifasta sequence from UCSC's fasta contigs we left their own nomenclature.
I guess you just have to decide which convention helps you the most, considering the advantages and disadvantages of one or the other like Pierre did.
Probably we should ask both UCSC and ensembl web to give the option to use both. Or probably someone could come across with a way of adding synonyms when indexing the sequence so you can use several synonyms for the seq_id when using tabix or the custom tracks in a genome browser.
I am using bam files produced by different labs and I have the seq_ids for the reference as chr[0-9XY], [09XY] and even Chr[0-9XY], so I have three different references for using with tabix and other tools.
I also prefer to have a 'chr' prefix for the fasta, as far I don need to sort the multi fasta entries by chromosome ;-)