Any ways of extracting and filtering read sequences from cram files with samtools?
1
0
Entering edit mode
9 months ago
sacryt • 0

I am trying to extract read sequences from a cram file. It would be easiest with pysam but this is not working for me on my work environment in DNAnexus RAP for some reason.

Is there an alternative way of printing and saving just the read sequences from a cram file with samtools e.g. to an array or text file? I want just the sequence, not any other information associated like cigar strings

Additionally, is there any way to filter the reads, e.g. according to mapping quality before displaying them?

Thank you

samtools cram reads sequence • 1.2k views
ADD COMMENT
4
Entering edit mode
9 months ago
Huiyang ▴ 190

You can use "samtools fastq" or "samtools fasta" to convert CRAM to FASTQ or FASTA. If your CRAM is sorted by position, you need to use "samtools sort -n" to convert your CRAM to a read name sorted format before output.

ADD COMMENT
0
Entering edit mode

Thank you, I've converted them to FASTQ files, though they are now much larger.

I was planning on trying to use the read mapping quality field in the cram files to specifically extract the reads which were of low quality - is there any way for me to do that with samtools, and then I could convert them to FASTQ?

ADD REPLY
0
Entering edit mode

You can use the '-F' parameter in sambamba view for filtering, for example, -F "mapping_quality < 10" to obtain reads with mapping quality lower than 10. It is worth noting that versions of sambamba >= 0.8 no longer support the CRAM format, while versions < 0.8 do support the CRAM format.

ADD REPLY
0
Entering edit mode

Try: samtools view -u -e 'mapq<10' in.cram | samtools fasta -

The main samtools.1 man page has a section "FILTER EXPRESSIONS" describing the syntax. http://www.htslib.org/doc/samtools.html#FILTER_EXPRESSIONS.

It's worth noting that if your read data is paired, then any filtering is problematic as it'll break pairs up.

ADD REPLY
0
Entering edit mode

Thank you! Sorry I just saw this, is there an alternative for paired read data? I ran this

samtools view -u -e 'mapq<10' "path/in.cram" > "path/out.cram"

but it is running indefinitely for several hours, much longer than just viewing the cram file itself. I'm guessing it's because my data might be paired? I wasn't sure

ADD REPLY

Login before adding your answer.

Traffic: 1595 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6