Hi,
I have done some alignments using bowtie2
and I am having trouble with the processing of the .sam
file.
As explained in the bowtie2 Manual "When Bowtie 2 prints a SAM alignment for a pair, it prints two records (i.e. two lines of output), one for each mate. The first record describes the alignment for mate 1 and the second record describes the alignment for mate 2". I therefore need to "maintain pairs" to continue with the analysis, but the problem is that not all pairs are valid. For me, a pair is valid when the two records are 76M
and have the flags XM:i:0
and XN:i:0
. I have done the processing with Excel but the problem is that by doing so, I cannot use samtools sort to convert the .sam
file to .bam
file and continue with the analysis.
I explain it better with an example:
OUTPUT 1
NB501214:74:H2MLWAFXY:1:11101:9943:2048 97 CHRVI 7025 39 76M CHRXIV 9092 0 GTTTAATTCTGAGAAATCATGGAGCCCTGTTGGCCTGGAAGATGCTAAACTTCCCAAGGAAGCTTACCGATTTAAG AAAAAEEEEEEEEEEEEEEEEEEEEAAEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEAEEEEEE AS:i:0 XS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:76 YT:Z:UP
NB501214:74:H2MLWAFXY:1:11101:9943:2048 145 CHRXIV 9092 42 76M CHRVI 7025 0 ATGTGCCTTCTATTACGCACCTTTTATCTCGGGTGGATTCTTTTCATGTTGGTACAAGGTTTCCAAAACATGAGGA EEEEEE/EEEEEAEEAEAEEEEEEEE/EE/EAEAEEEEAEEEEEEEEEEEAEEEAEEEEEEEEEE/EEEEEAAAAA AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:76 YT:Z:UP
Valid pair because both records are 76M, XM:i:0 and XN:i:0.
OUTPUT 2
NB501214:74:H2MLWAFXY:1:11101:11467:2768 65 CHRXII 1076743 39 76M CHRII 1044 0 ATTCCATCCAGTCACCGACATTAACAAAGAGTCGTATAAGCGGAAAGGGAGTCAAATGGTTTTGCTAGAGAGAAAG AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE/EEEEEEEEE AS:i:0 XS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:76 YT:Z:UP
NB501214:74:H2MLWAFXY:1:11101:11467:2768 129 CHRII 1044 42 75M CHRXII 1076743 0 CTACAGGGTCCCCATGATATGGCTCGATGTCTTCCAAGTATTCTTTGTATTCCTCATCATTTCGCAGCATTCTCT AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEAEEEEEEEE/EEEEEEAAEEAEEEE/EEEEAEEEEE AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:75 YT:Z:UP
Not valid because one record is 76M and the other one is 75M.
Is there any way to do this processing without Excel (i.e. command line)? I know how to extract lines that have the flags that I want, but I do not know how to maintain pairs and do the processing taking into account that when one of the records of a pair is not valid, the whole pair has to be deleted.
Thank you very much.
hum.. what is "the analysis" ? why to you need to keep all the "valid" pairs ?
I need to get a coverage file but I am interested in discordant reads (only those that have no mismatch), so I need to filter out everything that doesn't fit with it.
That 75M read is likely a result of trimming. Maybe you should not trim if it's causing you to throw things out that you would otherwise keep.