Splitting CRAM files
0
0
Entering edit mode
24 months ago
langziv ▴ 70

Hello.
It looks like the CRAM files I have consist of multiple genomes' data. If that's even possible, is there a way to split each file into separate ones so that each will include data from a single genome?

CRAM-files sequence-alignment • 2.0k views
ADD COMMENT
0
Entering edit mode

I need to do variant calling, and I need to associate variants with their respective genome.

ADD REPLY
0
Entering edit mode

most SV callers will accept a BED file / a range to call a specific interval.

ADD REPLY
0
Entering edit mode

I noticed the problem after I did the variant calling. I got VCF files with no associations between variants and genomes.

ADD REPLY
0
Entering edit mode

If this is related to Getting information on CRAM files from headers inside the files then we don't know that there is actually more than one genome in the files you have.

My suspicion is that you don't have multiple genomes. Examine the read headers and see if you have multiple flowcells/lanes/flowcell serials numbers present.

ADD REPLY
0
Entering edit mode

Thanks @genomax.
I'm not sure how to identify flowcells/lanes/flowcell serials numbers in CRAM files. Can you give an example?

ADD REPLY
0
Entering edit mode

You will need to examine the reads id's in column 1 of the alignments.

Sequence identifiers are explained in this Wikipedia section.

ADD REPLY
0
Entering edit mode

Thanks, but this link explains the structure of FASTA files. I don't have FASTA files. My initial data are in CRAM files.

ADD REPLY
0
Entering edit mode

Thanks, but this link explains the structure of FASTA files.

it's not. It's about FASTQ.

read carefully what Genomax said:

You will need to examine the reads id's in column 1 of the alignments.

ADD REPLY
0
Entering edit mode

So I need to convert the CRAM files to FASTQ files in order to get that information?

ADD REPLY
0
Entering edit mode

Yes. You could do this on the fly.

$ samtools view new.bam | cut -f1 -d$'\t' | cut -f1-4 -d$':' | sort | uniq 
NS500177:19:H2HLYAFXX:1
NS500177:19:H2HLYAFXX:2
NS500177:19:H2HLYAFXX:3
NS500177:19:H2HLYAFXX:4

This is the same FC with 4 lanes.

ADD REPLY
0
Entering edit mode

Thanks.
So it means that it's a single genome?

ADD REPLY

Login before adding your answer.

Traffic: 2633 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6