Entering edit mode
8.7 years ago
fufuyou
▴
110
Hi, I want to use bam or sam file get such as following format file. one is map file:
read tag Chr start position end position strand
t0000035 nscaf1690 4798998 4799024 +
t0000035 nscaf1690 4805385 4805411 +
t0000035 nscaf1690 7588502 7588528 +
t0000072 nscaf1690 2923961 2923988 -
t0000093 nscaf1690 784585 784612 +
t0000093 nscaf1690 1539278 1539305 +
t0000093 nscaf1690 2223484 2223511 +
t0000093 nscaf1690 5848415 5848442 +
t0000093 nscaf1690 7501339 7501366 +
t0000093 nscaf1690 2400901 2400928 -
one is readgroup file:
>t0000035 3234
GAATGGATAAGGATTAGCGATGATACA
>t0000072 1909
TTGCAGTATGTAGGAAATCAAAACGTTC
>t0000093 1477
AAGTACATACTTGTTGAACGTCAAAAGA
>t0000146 868
TAAGGATTAGCGATGATACACAACTTTA
>t0000275 515
TATCTCAGGAATGTGGACAAGATAGGAC
>t0000397 380
AACAAACGACTAACAAATATATGAAAAT
>t0000437 353
TAGGAATAAATACATCAGTAAAGACCAA
>t0000479 331
TATCTCAGGAATGTGGACAAGATAGGA
>t0000620 273
GAATGGATAAGGATTAGCGATGATAC
>t0000788 223
TTGCAGTATGTAGGAAATCAAAACGTT
>t0000813 219
TGATAACTGCAACTCTAATTATAGGCAA
>t0000861 210
TCTAATCGGAAATTGAATACGCTTAAAT
>t0001003 188
AACAAACGACTAACAAATATATGAAAATA.
In the readgroup, first is the read tag, then is gourp how many times, finally is read sequence. Could you tell me how to get these files? Thanks, Fuyou
So, let me see if I understand this correctly: with only a BAM file as input, you want to aggregate the read group identifiers (@RG-ID) from the BAM header according to their presence in the reads of a BAM file. In the "map file," you want to display every region of the genome on which each read group is found. In the "readgroup file", you want to show the total number of times each read group identifier was found in the BAM file, along with the consensus sequence of the reads containing that read group identifier.
Here's what's not clear: some identifiers in your example are spread out across the contig. How then does that affect what should be in the read sequence line of your readgroup file?
Hi Dan, Thanks for your reply. I want to get unique read number in my short read data. Each read group include the group name and how many reads are in these group. The map is the information of the group map to reference genome. Thanks, Fuyou
So many questions.
The first - readgroups dont usually all have the same SEQ dna. Are you sure this is the case for you? Or do you expect multiple rows from the same read group in the readgroup file?
Second - what is the read tag? The name of the read (QNAME?)
Third - where is your read strand information coming from?
Hi John, Thank you very much. Readgroup is unique read in my short read data. I want to find all unique read in my short read data. The number is how many short read in this unique read. Read tag is the name of readgroup. Read strand information is from the read mapping information. Fuyou
Ahhh, so you have already made sure that all reads with the DNA "TCTAATCGGAAATTGAATACGCTTAAAT" are always tagged with RGID "t0000861" for example? Is that correct? If so I can tell you how to make the second table very easily. However, if you have a table will a billion reads then my method will take up too much RAM :(
The first table I can also make for you, but im not totally sure how you want to calculate end position and strand. Can you program in Python? My tool will do all of this for you, but since "end position" and "strand" can be encoded in a BAM file in different ways, you would have to specify exactly what stat you would want as a module for this program.
Also, i'm currently finishing off version 0.2 of this program, so this would be a nice and easy test for me :)
Hi John, Thanks, I can do The first table use BEDTOOLS. But I need first step firstly to do second table. I need get unique reads from my all reads. The read tag is different with my short read sequence tag. There I have a perl read can get the sequence and read numbers. I can not get a name about sequence.