Question

bam file coordinate system change

0

Entering edit mode

7.1 years ago

Gregor Rot ▴ 540

Dear all,

i am mapping some long reads to the human transcriptome:

ftp://ftp.ensembl.org/pub/release-90/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz

Each transcript in this reference is one sequence, and the ID contains the genomic location of the transcript.

After i map my reads to this reference, i would like to visualize the resulting bam file in IGV. For that to happen, i would probably need to convert the coordinates in the bam file to the hg38 coordinate system (which is given in the id of each transcript).

What would be the best approach to change the bam file? Bam -> Bed + convert coordinates -> Bam?

Or is there a more elegant way to parse the bam with pysam directly, change the coordinates on the fly (by looking at the reference transcript id and computing the new coordinates of the aligned read in the hg38 space) and write a resulting (new) bam file?

Thanks, Gregor

bam RNA-Seq • 2.7k views

ADD COMMENT • link 7.1 years ago by Gregor Rot ▴ 540

2

Entering edit mode

What do you need to visualize ? In IGV you could just create a new REF genome using your Homo_sapiens.GRCh38.cdna.all.fa

to convert the BAM, the cdna file wouldn't be enough, because the structure of the cDNA (exon,intron) is missing in the fasta header. you don't have the information to split the cigar string at the intron-exon junctions.

ADD REPLY • link 7.1 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Hi Pierre, thanks for the answer. I think there is enough information to convert the coordinates, for example, a transcript has this info: 14:22449113:22449125 in the ID, so where in the genome the sequence is spanning. Based on where on the transcript my read is aligning, i can then convert the coordinates to the genomic position by changing to chr 14 and changing the alignment position (0 = transcript start) by adding 22449113.

I would like to have genomic coordinates for my alignments since then i would like to retrieve some additional sequences around my reads to do further analysis. This makes sense? Thanks :)

ADD REPLY • link 7.1 years ago by Gregor Rot ▴ 540

0

Entering edit mode

Of course, i see, there are no introns in the cDNA file, the spanning region coordinates in the transcript IDs are then of course not enough to convert to genomic coordinates. Another approach would be to mask the non-coding genome fasta and map to that perhaps.

ADD REPLY • link 7.1 years ago by Gregor Rot ▴ 540