TSV count file contains the corrected number or reads for each assumed fragment length and autosomes
1
0
Entering edit mode
4.3 years ago
Lucas ▴ 20

Hi

I've been searching for tools for trisomy detection, I found this github repo, but the script uses tsv files for training and analyzing:

TSV file with the corrected number or reads for each assumed fragment length (50-220, organised in columns) and autosomes (chr1..chr22, organised in rows).

But the author didn't provide any tools to produce such files.

How can I generate such files using fastq/bam files?

The files look like this:

         50 51  52  53  54  55  56  57  58  59 ... 219
   chr1  12 11  30  23  19  17  38  45  40  61 ...
   chr2  12 16  18  12  23  37  44  38  59  73 ...
    .
    .
  chr22  4   5  2   8   5   2   4   6   10  10 ...

Thanks.

trisomy down • 1.5k views
ADD COMMENT
1
Entering edit mode

This README page is indicating that the said .tsv file should be in the examples (e.g. example/test/t21.tsv) directory.

ADD REPLY
0
Entering edit mode

Thanks Yes, I know. I'm asking about creating such files from raw NGS fastq/bam/sam files.

ADD REPLY
0
Entering edit mode

Ah I see. Then you may want to post an example snippet in your original question so people know what format you need.

ADD REPLY
1
Entering edit mode

I added an example.

ADD REPLY
1
Entering edit mode

Looks like this will be the TLEN value from your aligned BAM files added to specific length bin for each chromosome (SAM spec, page 8).

ADD REPLY
0
Entering edit mode

Would you please check the answer I added and tell me your opinion. Thanks.

ADD REPLY
0
Entering edit mode
4.3 years ago
Lucas ▴ 20

Is the file produced using an algorithm like this?

sam = '''
chr1_22009_22554_0:0:0_0:0:0_94d2    99    chr1    22009    60    100M    =    22455    546    CCTCTCAAAATCTGGGGATTGGAGGCCTAGTAGTAATGGCCTCATTTTGAAGGAGTTGGGAGAAGGAGTGGCCAGCAACCTGGAAGTGATGTTCTCTGAG    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:delPN01_deb_read1    XT:A:U    NM:i:0    SM:i:37    AM:i:37    X0:i:1    X1:i:0    XM:i:0    XO:i:0    XG:i:0    MD:Z:100
chr1_22009_22554_0:0:0_0:0:0_94d2    147    chr1    22455    60    100M    =    22009    -546    TTCTGAACGCCGTTCTTATTGCTAACGAAACCCTTGATTCTAGATTGAAAGACAACAAACCGGGTCTCCTTCTCAAGATGGACATTGAGAAAGCTTTTAA    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:delPN01_deb_read1    XT:A:U    NM:i:0    SM:i:37    AM:i:37    X0:i:1    X1:i:0    XM:i:0    XO:i:0    XG:i:0    MD:Z:100
'''

lines = sam.split('\n')
main_dic = {}
values_dic = {50:0, 51:0, 52:0, 53:0, 54:0, 55:0, 56:0, 57:0, 58:0, 
              59:0, 60:0, 61:0, 62:0, 63:0, 64:0, 65:0, 66:0, 67:0, 
              68:0, 69:0, 70:0, 71:0, 72:0, 73:0, 74:0, 75:0, 76:0, 
              77:0, 78:0, 79:0, 80:0, 81:0, 82:0, 83:0, 84:0, 85:0, 
              86:0, 87:0, 88:0, 89:0, 90:0, 91:0, 92:0, 93:0, 94:0, 
              95:0, 96:0, 97:0, 98:0, 99:0, 100:0, 101:0, 102:0, 103:0, 
              104:0, 105:0, 106:0, 107:0, 108:0, 109:0, 110:0, 111:0, 
              112:0, 113:0, 114:0, 115:0, 116:0, 117:0, 118:0, 119:0, 
              120:0, 121:0, 122:0, 123:0, 124:0, 125:0, 126:0, 127:0, 
              128:0, 129:0, 130:0, 131:0, 132:0, 133:0, 134:0, 135:0, 
              136:0, 137:0, 138:0, 139:0, 140:0, 141:0, 142:0, 143:0, 
              144:0, 145:0, 146:0, 147:0, 148:0, 149:0, 150:0, 151:0, 
              152:0, 153:0, 154:0, 155:0, 156:0, 157:0, 158:0, 159:0, 
              160:0, 161:0, 162:0, 163:0, 164:0, 165:0, 166:0, 167:0, 
              168:0, 169:0, 170:0, 171:0, 172:0, 173:0, 174:0, 175:0, 
              176:0, 177:0, 178:0, 179:0, 180:0, 181:0, 182:0, 183:0, 
              184:0, 185:0, 186:0, 187:0, 188:0, 189:0, 190:0, 191:0, 
              192:0, 193:0, 194:0, 195:0, 196:0, 197:0, 198:0, 199:0, 
              200:0, 201:0, 202:0, 203:0, 204:0, 205:0, 206:0, 207:0, 
              208:0, 209:0, 210:0, 211:0, 212:0, 213:0, 214:0, 215:0, 
              216:0, 217:0, 218:0, 219:0}
for line in lines:
    splitted = line.split()
    if splitted != []:
        chr_name = splitted[2]
        length = len(splitted[9])
        values_dic[length] = values_dic[length] + 1
        main_dic[chr_name] = values_dic

print(main_dic)
ADD COMMENT

Login before adding your answer.

Traffic: 1643 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6