Splitting CRAM files

0

Entering edit mode

24 months ago

langziv ▴ 70

Hello.
It looks like the CRAM files I have consist of multiple genomes' data. If that's even possible, is there a way to split each file into separate ones so that each will include data from a single genome?

CRAM-files sequence-alignment • 2.0k views

ADD COMMENT • link 23 months ago by langziv ▴ 70

0

Entering edit mode

why would you want to do that ?

anyway : How To Split A Bam File By Chromosome ; How Can I Split Bam Into Chromosome (In A Loop) Using Samtools? ; split sorted bam file chromosome wise ; etc...

ADD REPLY • link 24 months ago by Pierre Lindenbaum 164k

0

Entering edit mode

I need to do variant calling, and I need to associate variants with their respective genome.

ADD REPLY • link 24 months ago by langziv ▴ 70

0

Entering edit mode

most SV callers will accept a BED file / a range to call a specific interval.

ADD REPLY • link 24 months ago by Pierre Lindenbaum 164k

0

Entering edit mode

I noticed the problem after I did the variant calling. I got VCF files with no associations between variants and genomes.

ADD REPLY • link 24 months ago by langziv ▴ 70

0

Entering edit mode

If this is related to Getting information on CRAM files from headers inside the files then we don't know that there is actually more than one genome in the files you have.

My suspicion is that you don't have multiple genomes. Examine the read headers and see if you have multiple flowcells/lanes/flowcell serials numbers present.

ADD REPLY • link 24 months ago by GenoMax 147k

0

Entering edit mode

Thanks @genomax.
I'm not sure how to identify flowcells/lanes/flowcell serials numbers in CRAM files. Can you give an example?

ADD REPLY • link 24 months ago by langziv ▴ 70

0

Entering edit mode

You will need to examine the reads id's in column 1 of the alignments.

Sequence identifiers are explained in this Wikipedia section.

ADD REPLY • link 24 months ago by GenoMax 147k

0

Entering edit mode

Thanks, but this link explains the structure of FASTA files. I don't have FASTA files. My initial data are in CRAM files.

ADD REPLY • link 24 months ago by langziv ▴ 70

0

Entering edit mode

Thanks, but this link explains the structure of FASTA files.

it's not. It's about FASTQ.

read carefully what Genomax said:

You will need to examine the reads id's in column 1 of the alignments.

ADD REPLY • link 24 months ago by Pierre Lindenbaum 164k

0

Entering edit mode

So I need to convert the CRAM files to FASTQ files in order to get that information?

ADD REPLY • link 24 months ago by langziv ▴ 70

0

Entering edit mode

Yes. You could do this on the fly.

$ samtools view new.bam | cut -f1 -d$'\t' | cut -f1-4 -d$':' | sort | uniq 
NS500177:19:H2HLYAFXX:1
NS500177:19:H2HLYAFXX:2
NS500177:19:H2HLYAFXX:3
NS500177:19:H2HLYAFXX:4