Question

TSV count file contains the corrected number or reads for each assumed fragment length and autosomes

0

Entering edit mode

4.3 years ago

Lucas ▴ 20

Hi

I've been searching for tools for trisomy detection, I found this github repo, but the script uses tsv files for training and analyzing:

TSV file with the corrected number or reads for each assumed fragment length (50-220, organised in columns) and autosomes (chr1..chr22, organised in rows).

But the author didn't provide any tools to produce such files.

How can I generate such files using fastq/bam files?

The files look like this:

         50 51  52  53  54  55  56  57  58  59 ... 219
   chr1  12 11  30  23  19  17  38  45  40  61 ...
   chr2  12 16  18  12  23  37  44  38  59  73 ...
    .
    .
  chr22  4   5  2   8   5   2   4   6   10  10 ...

Thanks.

trisomy down • 1.5k views

ADD COMMENT • link 4.3 years ago by Lucas ▴ 20

1

Entering edit mode

This README page is indicating that the said .tsv file should be in the examples (e.g. example/test/t21.tsv) directory.

ADD REPLY • link 4.3 years ago by GenoMax 147k

0

Entering edit mode

Thanks Yes, I know. I'm asking about creating such files from raw NGS fastq/bam/sam files.

ADD REPLY • link 4.3 years ago by Lucas ▴ 20

0

Entering edit mode

Ah I see. Then you may want to post an example snippet in your original question so people know what format you need.

ADD REPLY • link 4.3 years ago by GenoMax 147k

1

Entering edit mode

I added an example.

ADD REPLY • link 4.3 years ago by Lucas ▴ 20

1

Entering edit mode

Looks like this will be the TLEN value from your aligned BAM files added to specific length bin for each chromosome (SAM spec, page 8).

ADD REPLY • link 4.3 years ago by GenoMax 147k

0

Entering edit mode

Would you please check the answer I added and tell me your opinion. Thanks.

ADD REPLY • link 4.3 years ago by Lucas ▴ 20

score 0 · Answer 1 · 2020-08-13

Is the file produced using an algorithm like this?

sam = '''
chr1_22009_22554_0:0:0_0:0:0_94d2    99    chr1    22009    60    100M    =    22455    546    CCTCTCAAAATCTGGGGATTGGAGGCCTAGTAGTAATGGCCTCATTTTGAAGGAGTTGGGAGAAGGAGTGGCCAGCAACCTGGAAGTGATGTTCTCTGAG    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:delPN01_deb_read1    XT:A:U    NM:i:0    SM:i:37    AM:i:37    X0:i:1    X1:i:0    XM:i:0    XO:i:0    XG:i:0    MD:Z:100
chr1_22009_22554_0:0:0_0:0:0_94d2    147    chr1    22455    60    100M    =    22009    -546    TTCTGAACGCCGTTCTTATTGCTAACGAAACCCTTGATTCTAGATTGAAAGACAACAAACCGGGTCTCCTTCTCAAGATGGACATTGAGAAAGCTTTTAA    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:delPN01_deb_read1    XT:A:U    NM:i:0    SM:i:37    AM:i:37    X0:i:1    X1:i:0    XM:i:0    XO:i:0    XG:i:0    MD:Z:100
'''

lines = sam.split('\n')
main_dic = {}
values_dic = {50:0, 51:0, 52:0, 53:0, 54:0, 55:0, 56:0, 57:0, 58:0, 
              59:0, 60:0, 61:0, 62:0, 63:0, 64:0, 65:0, 66:0, 67:0, 
              68:0, 69:0, 70:0, 71:0, 72:0, 73:0, 74:0, 75:0, 76:0, 
              77:0, 78:0, 79:0, 80:0, 81:0, 82:0, 83:0, 84:0, 85:0, 
              86:0, 87:0, 88:0, 89:0, 90:0, 91:0, 92:0, 93:0, 94:0, 
              95:0, 96:0, 97:0, 98:0, 99:0, 100:0, 101:0, 102:0, 103:0, 
              104:0, 105:0, 106:0, 107:0, 108:0, 109:0, 110:0, 111:0, 
              112:0, 113:0, 114:0, 115:0, 116:0, 117:0, 118:0, 119:0, 
              120:0, 121:0, 122:0, 123:0, 124:0, 125:0, 126:0, 127:0, 
              128:0, 129:0, 130:0, 131:0, 132:0, 133:0, 134:0, 135:0, 
              136:0, 137:0, 138:0, 139:0, 140:0, 141:0, 142:0, 143:0, 
              144:0, 145:0, 146:0, 147:0, 148:0, 149:0, 150:0, 151:0, 
              152:0, 153:0, 154:0, 155:0, 156:0, 157:0, 158:0, 159:0, 
              160:0, 161:0, 162:0, 163:0, 164:0, 165:0, 166:0, 167:0, 
              168:0, 169:0, 170:0, 171:0, 172:0, 173:0, 174:0, 175:0, 
              176:0, 177:0, 178:0, 179:0, 180:0, 181:0, 182:0, 183:0, 
              184:0, 185:0, 186:0, 187:0, 188:0, 189:0, 190:0, 191:0, 
              192:0, 193:0, 194:0, 195:0, 196:0, 197:0, 198:0, 199:0, 
              200:0, 201:0, 202:0, 203:0, 204:0, 205:0, 206:0, 207:0, 
              208:0, 209:0, 210:0, 211:0, 212:0, 213:0, 214:0, 215:0, 
              216:0, 217:0, 218:0, 219:0}
for line in lines:
    splitted = line.split()
    if splitted != []:
        chr_name = splitted[2]
        length = len(splitted[9])
        values_dic[length] = values_dic[length] + 1
        main_dic[chr_name] = values_dic

print(main_dic)