Paired-end to Single-end
2
0
Entering edit mode
3.3 years ago
kstangline ▴ 80

I have a bam file that is paired ended.

I'm told our pipeline only uses single-end (R1 only). How could I pull only the R1 reads from the bam?

bam • 1.5k views
ADD COMMENT
1
Entering edit mode
3.3 years ago

You seem unsure on which processing was applied to your original data. You know, you can view the commands used by looking at the BAM [SAM] header:

samtools view -H myalignment.bam ;

The last command used will appear at the bottom of the header (or should appear, at least). Please share here, if you wish.

Otherwise, there are many commands available to convert a BAM to FASTQ. Please see:

With these commands, one can control output of R1 and R2. In your case, we seem to not have 100% certainty about which was aligned, though. This will be revealed by observing the header, as mentioned above.

Kevin

ADD COMMENT
0
Entering edit mode

Hi Kevin,

The last command is below:

@SQ SN:ERCC-00168 LN:1024 @SQ SN:ERCC-00170 LN:1023 @SQ
SN:ERCC-00171 LN:505 @SQ SN:phiX174 LN:5386 @PG ID:STAR PN:STAR VN:STAR_2.5.1b CL:STAR --runThreadN 32 --genomeDir out
--genomeLoad NoSharedMemory --readFilesIn /cromwell_root/encode-processing/caper_out_v04_05/.caper_tmp/encode-public/2015/02/21/8582f524-e4a5-457f-a24c-d715b657e9fe/ENCFF549JWM.fastq.gz /cromwell_root/encode-processing/caper_out_v04_05/.caper_tmp/encode-public/2015/02/21/661e08d1-d0a1-450b-9a75-bac022d8e5a7/ENCFF923HGD.fastq.gz --readFilesCommand zcat --limitBAMsortRAM 120000000000 --outSAMtype BAM SortedByCoordinate --outSAMattributes NH HI AS NM MD --outSAMunmapped Within --outSAMheaderHD @HD
VN:1.4 SO:coordinate --outSAMheaderCommentFile COfile.txt
--outFilterType BySJout --outFilterMultimapNmax 20 --outFilterMismatchNmax 999 --outFilterMismatchNoverReadLmax 0.04 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1
--sjdbScore 1 --quantMode TranscriptomeSAM @CO user command line: STAR --genomeDir out --readFilesIn /cromwell_root/encode-processing/caper_out_v04_05/.caper_tmp/encode-public/2015/02/21/8582f524-e4a5-457f-a24c-d715b657e9fe/ENCFF549JWM.fastq.gz /cromwell_root/encode-processing/caper_out_v04_05/.caper_tmp/encode-public/2015/02/21/661e08d1-d0a1-450b-9a75-bac022d8e5a7/ENCFF923HGD.fastq.gz --readFilesCommand zcat --runThreadN 32 --genomeLoad NoSharedMemory --outFilterMultimapNmax 20 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --outFilterMismatchNmax 999 --outFilterMismatchNoverReadLmax 0.04 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --outSAMheaderCommentFile COfile.txt --outSAMheaderHD @HD VN:1.4 SO:coordinate --outSAMunmapped Within --outFilterType BySJout --outSAMattributes NH HI AS NM MD --outSAMtype BAM SortedByCoordinate --quantMode TranscriptomeSAM --sjdbScore 1 --limitBAMsortRAM 120000000000

Could I use the following to separate into R1 and R2? I need to stay in bam format if possible.

samtools view -hbf 64 mydata.bam > R1.bam
samtools view -hbf 128 mydata.bam > R2.bam
ADD REPLY
0
Entering edit mode

It looks like the reads that were aligned are:

/cromwell_root/encode-processing/caper_out_v04_05/.caper_tmp/encode-public/2015/02/21/8582f524-e4a5-457f-a24c-d715b657e9fe/ENCFF549JWM.fastq.gz

/cromwell_root/encode-processing/caper_out_v04_05/.caper_tmp/encode-public/2015/02/21/661e08d1-d0a1-450b-9a75-bac022d8e5a7/ENCFF923HGD.fastq.gz

I trust that you know to what these relate? Are these paired or just 2 single-end files? Based on the STAR command, it can inferred that they are paired:

For paired-end reads, use comma separated list for read1 /space/ comma separated list for read2, e.g.: --readFilesIn sample1read1.fq,sample2read1.fq,sample3read1.fq sample1read2.fq,sample2read2.fq,sample3read2.fq.

[source: https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf]

Why not get 100% clarification from the group that did the work? - if they cannot tell you with 100% clarity what they did, then, perhaps, they should not be doing this type of work.

SAM flags 64 and 128? - sure, if that's what you need, but there are other flags that can indicate read pairing, and other commands, like those to whose web-pages I linked (above).

64  First in pair
128 Second in pair
ADD REPLY
0
Entering edit mode
3.3 years ago
Divon ▴ 230

For the sake of completeness, with my Genozip software you could:

genozip mydata.bam
genocat mydata.bam.genozip --FLAG +0x40 --output mydata-R1.bam

As a side benefit, since Genozip is a highly compressed format, your .bam.genozip files will be be 2x-5x times smaller that .bam.

Documentation: https://genozip.com

Paper: https://www.researchgate.net/publication/349347156_Genozip_-_A_Universal_Extensible_Genomic_Data_Compressor

ADD COMMENT

Login before adding your answer.

Traffic: 2877 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6