Hello, I used samtools to get the data for chromosome 22. The command is:
samtools view input.bam 'chr22' >raw_chr22.txt
Then I extracted the 3,4 and 10th columns from it. The partial data for demo purpose like
chr22 14430092 NNNNCNAGCNGAGCGNNTCTGGGNACCTCGAAGGCAGACATG
chr22 14430092 NNNNNNNNNNNTNNNNNNNNANNGNNTNNNNNNNNNNNNNNN
chr22 14430092 CCTCGCGGGACTGGTATGGGGACGGTCATGCAATCTGGACAA
The data has three columns, chromosome, position and sequence. My question is that they all have the same position but different sequence. So what is the real sequence in this specific position.
Thanks
The file (
test_chr22.txt
) is here.Thanks.
out of curiosity, if you down-vote, could you also let me know why. I'm happy to be enlightened.
Sorry, it was a mistakenly hit. I supposed to hit up-vote. But my brain is not working well today. I will correct it. So can you get the real sequence within the region? Any language is fine.
zhs... my answer already shows how to get the real sequence with samtools faidx.
But it is from gh18/hg19 reference, my bam file is from a patient. Can I get the sequence from the patient based on my output file? Maybe it is a wrong question because my ignorance of genome area?
In that case, ask the patient if they have the reference file that they used to do the alignment.
I still don't understand it. I want to write C# code to extract the sequence by "FLAG" and start position from the output file. Suppose a region is given, can I use substring method in c#/java from
Such as