Question

BAM to map and read group file

0

Entering edit mode

8.7 years ago

fufuyou ▴ 110

Hi, I want to use bam or sam file get such as following format file. one is map file:

read tag            Chr                start position end position   strand
t0000035        nscaf1690       4798998 4799024 +
t0000035        nscaf1690       4805385 4805411 +
t0000035        nscaf1690       7588502 7588528 +
t0000072        nscaf1690       2923961 2923988 -
t0000093        nscaf1690       784585  784612  +
t0000093        nscaf1690       1539278 1539305 +
t0000093        nscaf1690       2223484 2223511 +
t0000093        nscaf1690       5848415 5848442 +
t0000093        nscaf1690       7501339 7501366 +
t0000093        nscaf1690       2400901 2400928 -

one is readgroup file:

>t0000035 3234
GAATGGATAAGGATTAGCGATGATACA
>t0000072 1909
TTGCAGTATGTAGGAAATCAAAACGTTC
>t0000093 1477
AAGTACATACTTGTTGAACGTCAAAAGA
>t0000146 868
TAAGGATTAGCGATGATACACAACTTTA
>t0000275 515
TATCTCAGGAATGTGGACAAGATAGGAC
>t0000397 380
AACAAACGACTAACAAATATATGAAAAT
>t0000437 353
TAGGAATAAATACATCAGTAAAGACCAA
>t0000479 331
TATCTCAGGAATGTGGACAAGATAGGA
>t0000620 273
GAATGGATAAGGATTAGCGATGATAC
>t0000788 223
TTGCAGTATGTAGGAAATCAAAACGTT
>t0000813 219
TGATAACTGCAACTCTAATTATAGGCAA
>t0000861 210
TCTAATCGGAAATTGAATACGCTTAAAT
>t0001003 188
AACAAACGACTAACAAATATATGAAAATA.

In the readgroup, first is the read tag, then is gourp how many times, finally is read sequence. Could you tell me how to get these files? Thanks, Fuyou

RNA-Seq • 1.9k views

ADD COMMENT • link updated 5.9 years ago by Biostar 20 • written 8.7 years ago by fufuyou ▴ 110

2

Entering edit mode

So, let me see if I understand this correctly: with only a BAM file as input, you want to aggregate the read group identifiers (@RG-ID) from the BAM header according to their presence in the reads of a BAM file. In the "map file," you want to display every region of the genome on which each read group is found. In the "readgroup file", you want to show the total number of times each read group identifier was found in the BAM file, along with the consensus sequence of the reads containing that read group identifier.

Here's what's not clear: some identifiers in your example are spread out across the contig. How then does that affect what should be in the read sequence line of your readgroup file?

ADD REPLY • link 8.7 years ago by Dan D 7.4k

0

Entering edit mode

Hi Dan, Thanks for your reply. I want to get unique read number in my short read data. Each read group include the group name and how many reads are in these group. The map is the information of the group map to reference genome. Thanks, Fuyou

ADD REPLY • link 8.7 years ago by fufuyou ▴ 110

2

Entering edit mode

So many questions.

The first - readgroups dont usually all have the same SEQ dna. Are you sure this is the case for you? Or do you expect multiple rows from the same read group in the readgroup file?

Second - what is the read tag? The name of the read (QNAME?)

Third - where is your read strand information coming from?

ADD REPLY • link 8.7 years ago by John 13k

1

Entering edit mode

Hi John, Thank you very much. Readgroup is unique read in my short read data. I want to find all unique read in my short read data. The number is how many short read in this unique read. Read tag is the name of readgroup. Read strand information is from the read mapping information. Fuyou

ADD REPLY • link 8.7 years ago by fufuyou ▴ 110

1

Entering edit mode

Ahhh, so you have already made sure that all reads with the DNA "TCTAATCGGAAATTGAATACGCTTAAAT" are always tagged with RGID "t0000861" for example? Is that correct? If so I can tell you how to make the second table very easily. However, if you have a table will a billion reads then my method will take up too much RAM :(

The first table I can also make for you, but im not totally sure how you want to calculate end position and strand. Can you program in Python? My tool will do all of this for you, but since "end position" and "strand" can be encoded in a BAM file in different ways, you would have to specify exactly what stat you would want as a module for this program.

Also, i'm currently finishing off version 0.2 of this program, so this would be a nice and easy test for me :)

ADD REPLY • link 8.7 years ago by John 13k

0

Entering edit mode

Hi John, Thanks, I can do The first table use BEDTOOLS. But I need first step firstly to do second table. I need get unique reads from my all reads. The read tag is different with my short read sequence tag. There I have a perl read can get the sequence and read numbers. I can not get a name about sequence.

ADD REPLY • link 8.7 years ago by fufuyou ▴ 110