Question

PennCNV - How to remove samples at once.

0

Entering edit mode

2.9 years ago

Viktor Messi • 0

I am doing CNV analysis on 250 samples on PennCNV.

What I want: make a file with Choromosome that passes my SNP density threshold in .cnv format/remove chromosome that did not pass threshold from the .cnv file.

My Scenario: I merged any CNV calls that overlap another (or several other) CNV call by a minimum of 20% of their lengths overlapped using this command.

./clean_cnv.pl combineseg --fraction 0.2 --signalfile ./lib/cohort.pfb cohort_filtered.annotated.cnv > cohort_gapmerged.cnv

So, the file is currently on CNV format. I want to check the SNP density by dividing numsnp/length. I paste the info to excel and filter according to my threshold. The problem is, once I know which chromosome didn't pass & have the list, I have to delete the chromosome one by one at the .cnv file.

I really appreciate if someone could tell me how to delete the chromosome using other tools and return it back to the original .cnv format. Or tell me other methods which is better lol

PennCNV • 2.8k views

ADD COMMENT • link 2.8 years ago by Viktor Messi • 0

1

Entering edit mode

Hi, For better help, if possible, paste the output of the following commands

bash(linux): "head -n 20 ./yourfile.cnv"

powershell(win) : "type .\yourfile.cnv | select -First 20"

ADD REPLY • link 2.9 years ago by mohammadhassanj ▴ 260

0

Entering edit mode

chr16:77201441-77224685       numsnp=10     length=23,245      state5,cn=3 sample.split220 startsnp=kgp5597583 endsnp=kgp11117565 conf=24.089
chr4:69401231-69493479        numsnp=145    length=92,249      state5,cn=3 sample.split171 startsnp=4:69401231_CNV_UGT2B17 endsnp=rs17663945 conf=118.433
chr17:41267632-41276134       numsnp=85     length=8,503       state5,cn=3 sample.split207 startsnp=rs8176100 endsnp=rs886040902 conf=29.833
chr19:1220367-1234676         numsnp=84     length=14,310      state5,cn=3 sample.split90 startsnp=rs567202367 endsnp=19:1234676-GA conf=88.991
chr19:41385767-41385775       numsnp=6      length=9           state1,cn=0 sample.split90 startsnp=19:41385767_CNV_CYP2A7 endsnp=19:41385775_CNV_CYP2A7_Ilmndup2 conf=21.121
chr17:44171207-44237068       numsnp=14     length=65,862      state5,cn=3 sample.split166 startsnp=JHU_17.44171206 endsnp=rs2532292 conf=21.762
chr1:1266841-1269724          numsnp=27     length=2,884       state5,cn=3 sample.split60 startsnp=rs577691125.1 endsnp=rs536836467.2 conf=20.617
chr1:2381164-2389702          numsnp=14     length=8,539       state5,cn=3 sample.split60 startsnp=1:2381164 endsnp=1:2389702 conf=20.634
chr1:228497228-228612939      numsnp=71     length=115,712     state5,cn=3 sample.split60 startsnp=rs761649292 endsnp=rs531385963.3 conf=23.547
chr13:32914976-32919384       numsnp=92     length=4,409       state5,cn=3 sample.split78 startsnp=rs886040659 endsnp=rs191253965 conf=55.675
chr22:24301643-24301695       numsnp=16     length=53          state1,cn=0 sample.split209 startsnp=22:24301643_CNV_GSTT2B_Ilmndup1 endsnp=22:24301695_CNV_GSTT2B_Ilmndup1 conf=66.493
chr22:24302601-24302603       numsnp=7      length=3           state1,cn=0 sample.split209 startsnp=22:24302601_CNV_GSTT2B endsnp=22:24302603_CNV_GSTT2B_Ilmndup3 conf=25.047
chr22:42522134-42522313       numsnp=12     length=180         state1,cn=0 sample.split209 startsnp=22:42522134_CNV_CYP2D6 endsnp=22:42522313_CNV_CYP2D6_Ilmndup2 conf=43.298
chr13:32900191-32900291       numsnp=31     length=101         state5,cn=3 sample.split147 startsnp=rs81002842 endsnp=rs276174848 conf=25.818
chr13:32903511-32906684       numsnp=101    length=3,174       state5,cn=3 sample.split147 startsnp=rs61948377 endsnp=rs886040343 conf=49.625
chr13:32913339-32913896       numsnp=122    length=558         state5,cn=3 sample.split147 startsnp=rs398122786 endsnp=rs80358763 conf=65.213
chr13:32914891-32921037       numsnp=153    length=6,147       state5,cn=3 sample.split147 startsnp=rs80359581 endsnp=rs876661201 conf=143.004
chr22:19137233-19152094       numsnp=13     length=14,862      state2,cn=1 sample.split26 startsnp=rs540621015 endsnp=kgp5062455 conf=20.778
chr22:24301643-24302227       numsnp=36     length=585         state5,cn=3 sample.split26 startsnp=22:24301643_CNV_GSTT2B_Ilmndup1 endsnp=22:24302227_CNV_GSTT2B_Ilmndup1 conf=35.287
chr22:24374342-24386612       numsnp=126    length=12,271      state1,cn=0 sample.split26 startsnp=22:24374342_CNV_GSTT1 endsnp=22:24386612_CNV_GSTT1 conf=317.105

ADD REPLY • link updated 2.9 years ago by Pierre Lindenbaum 166k • written 2.9 years ago by Viktor Messi • 0

1

Entering edit mode

If I understand correctly, you want to delete the rows of the this file based on a list of "chr:start:end" similar to the first column of this file?

ADD REPLY • link 2.9 years ago by mohammadhassanj ▴ 260

0

Entering edit mode

Yes. Thank you for replying by the way.

ADD REPLY • link 2.9 years ago by Viktor Messi • 0

1

Entering edit mode

This is exactly what you want: https://stackoverflow.com/questions/35728766/awk-to-filter-file-by-specific-field-in-another

ADD REPLY • link 2.9 years ago by mohammadhassanj ▴ 260

0

Entering edit mode

I tried using the code suggested by the karakfa in the website.

awk 'NR==FNR{a[$1];next} FNR==1 || ($7 in a)' file1 file2

So, I changed, the file1 = file containing the desired list file2 = file containing all the chromosome

Do I need to change other things? because it doesn't seem to work.

ADD REPLY • link 2.9 years ago by Viktor Messi • 0

score 1 · Answer 1 · 2023-01-17

if file1 exactly be (tab delimited with row number in second column):

chr13:32914976-32919384 1
chr13:32900191-32900291 2
chr22:19137233-19152094 3
chr4:69401231-69493479  4
chr17:44171207-44237068 5

and

file2 exactly be (tab delimited):

chr16:77201441-77224685 numsnp=10   length=23,245   state5,cn=3sample.split220startsnp=kgp5597583endsnp=kgp11117565conf=24.089
chr4:69401231-69493479  numsnp=145  length=92,249   state5,cn=3sample.split171startsnp=4:69401231_CNV_UGT2B17endsnp=rs17663945conf=118.433
chr17:41267632-41276134 numsnp=85   length=8,503    state5,cn=3sample.split207startsnp=rs8176100endsnp=rs886040902conf=29.833
chr19:1220367-1234676   numsnp=84   length=14,310   state5,cn=3sample.split90startsnp=rs567202367endsnp=19:1234676-GAconf=88.991
chr19:41385767-41385775 numsnp=6    length=9    state1,cn=0sample.split90startsnp=19:41385767_CNV_CYP2A7endsnp=19:41385775_CNV_CYP2A7_Ilmndup2conf=21.121
chr17:44171207-44237068 numsnp=14   length=65,862   state5,cn=3sample.split166startsnp=JHU_17.44171206endsnp=rs2532292conf=21.762
chr1:1266841-1269724    numsnp=27   length=2,884    state5,cn=3sample.split60startsnp=rs577691125.1endsnp=rs536836467.2conf=20.617
chr1:2381164-2389702    numsnp=14   length=8,539    state5,cn=3sample.split60startsnp=1:2381164endsnp=1:2389702conf=20.634
chr1:228497228-228612939    numsnp=71   length=115,712  state5,cn=3sample.split60startsnp=rs761649292endsnp=rs531385963.3conf=23.547
chr13:32914976-32919384 numsnp=92   length=4,409    state5,cn=3sample.split78startsnp=rs886040659endsnp=rs191253965conf=55.675
chr22:24301643-24301695 numsnp=16   length=53   state1,cn=0sample.split209startsnp=22:24301643_CNV_GSTT2B_Ilmndup1endsnp=22:24301695_CNV_GSTT2B_Ilmndup1conf=66.493
chr22:24302601-24302603 numsnp=7    length=3    state1,cn=0sample.split209startsnp=22:24302601_CNV_GSTT2Bendsnp=22:24302603_CNV_GSTT2B_Ilmndup3conf=25.047
chr22:42522134-42522313 numsnp=12   length=180  state1,cn=0sample.split209startsnp=22:42522134_CNV_CYP2D6endsnp=22:42522313_CNV_CYP2D6_Ilmndup2conf=43.298
chr13:32900191-32900291 numsnp=31   length=101  state5,cn=3sample.split147startsnp=rs81002842endsnp=rs276174848conf=25.818
chr13:32903511-32906684 numsnp=101  length=3,174    state5,cn=3sample.split147startsnp=rs61948377endsnp=rs886040343conf=49.625
chr13:32913339-32913896 numsnp=122  length=558  state5,cn=3sample.split147startsnp=rs398122786endsnp=rs80358763conf=65.213
chr13:32914891-32921037 numsnp=153  length=6,147    state5,cn=3sample.split147startsnp=rs80359581endsnp=rs876661201conf=143.004
chr22:19137233-19152094 numsnp=13   length=14,862   state2,cn=1sample.split26startsnp=rs540621015endsnp=kgp5062455conf=20.778
chr22:24301643-24302227 numsnp=36   length=585  state5,cn=3sample.split26startsnp=22:24301643_CNV_GSTT2B_Ilmndup1endsnp=22:24302227_CNV_GSTT2B_Ilmndup1conf=35.287
chr22:24374342-24386612 numsnp=126  length=12,271   state1,cn=0sample.split26startsnp=22:24374342_CNV_GSTT1endsnp=22:24386612_CNV_GSTT1conf=317.105

This command probably works:

awk '{str=substr($0,1,index($0,"\t"))} FNR==NR{a[str];next} (str in a)' file1 file2

result:

chr4:69401231-69493479  numsnp=145      length=92,249   state5,cn=3sample.split171startsnp=4:69401231_CNV_UGT2B17endsnp=rs17663945conf=118.433
chr17:44171207-44237068 numsnp=14       length=65,862   state5,cn=3sample.split166startsnp=JHU_17.44171206endsnp=rs2532292conf=21.762
chr13:32914976-32919384 numsnp=92       length=4,409    state5,cn=3sample.split78startsnp=rs886040659endsnp=rs191253965conf=55.675
chr13:32900191-32900291 numsnp=31       length=101      state5,cn=3sample.split147startsnp=rs81002842endsnp=rs276174848conf=25.818
chr22:19137233-19152094 numsnp=13       length=14,862   state2,cn=1sample.split26startsnp=rs540621015endsnp=kgp5062455conf=20.778