Question

Read pairs in discoSnp - how to analyse them together?

0

Entering edit mode

10.2 years ago

hugo.v ▴ 10

The command to enter multiple files into discoSnp just requires a set of file names... but each file is then treated as an separate individual/organism. I have sequence output files from paired reads on Illumina... is there a way to tell discoSnp to keep the files with forward and reverse read sequences together for each individual? I can of course first concatenate the files, then enter them into the command, but I wonder if there would be a way to "concatenate" the files while entering them into the command line.

Thanks for suggestions.

SNP discoSnp • 2.4k views

ADD COMMENT • link updated 2.9 years ago by Ram 44k • written 10.2 years ago by hugo.v ▴ 10

Ram · Answer 1 · 2014-09-08

0

Entering edit mode

10.2 years ago

pierre.peterlongo ▴ 900

Hello, thanks for your question Hugo.

Indeed there currently no way to use a "virtual concatenation" of read files. This is a planned improvement

Note that, as the length of the discoSnp output does not span insert size, there is no specific need to use the pair information.

Thus, for now there are two more or less good solutions:

Don't concatenate files but use all of them: discoSnp -r "setA_1.fq setA_2.fq setB_1.fq setB_2.fq setC_1.fq setC2_.fq" ... Then you should parse the output to sum the coverages couples of paired read sets. This as the advantage not to need to concatenate read files, but this requires to analyse the output. Another drawback (or not) is that the k-mer solidity filtration is done independently on each set of each pair instead of on each couple (you thus should adapt the -c option).
concatenate the couples. (I agree, it's a dirty solution).

Best,
Pierre

ADD COMMENT • link updated 2.9 years ago by Ram 44k • written 10.2 years ago by pierre.peterlongo ▴ 900

1

Entering edit mode

Hello Pierre:

I tried the following file conversion:

cat coherent.fa | paste -d@ - - - - | sed 's$|C[0-9]\+_$\t$g' | sed 's$|Q[0-9]\+_$\t$g' | sed 's$|rank$\t|rank$g' > test.txt

paste -d@ - - - - will combine 4 lines of text into one, separated by @. I will use the @ as a placeholder to reconstruct the original coherent.fa format.

The sed commands will search for the count data and the quality data and only keep the numbers, separated by a tab. The resulting file can be opened in a spreadsheet programme (as long as the number of rows does not exceeed the programme's capabilities), with some manipulation I can add the values from two, three, four columns as required by the number of files belonging to one individual and add columns with required |C#_ and |Q#_ values (create a set of new columns, get the sum of values from individual columns). To be able to delete the original columns of C and Q values, I copy the file, paste it in a new sheet keeping only the numerical values (thus not the formulas that refer to other cells). Then the original columns can be deleted without disruptinig the new C and Q values.

After that I can delete the tabs to get one uninterrupted line of data and use the "@" to separate it back into 4 lines with "upper-path" and "lower-path" and the corresponding allele sequences. (sed 's/\t//g' test.csv | sed 's/@/\n/g' > test2.csv) Then do the genotyping.

I am not sure whether taking a weighted average of the Q values would be the best thing to do (INT((C1Q1)+(C2Q2)++...) / total count) but since the genotyping script does only use the count values, it does not really matter for my purpose.

This went quite well, except for one entry where the contig length was 8800 on one side and about 7500 on the other... there the sequence had been wrapped over two lines and the genotyping script did not accept that. Luckily it was only one occurrence... so I could correct that in a text editor.

With Perl or Python script it would probably be easier to sum the counts and calculate qualities without going to a spreadsheet programme... but I am essentially Perl and Python agnostic.

I still think though that "virtually concatenating" the files would be better as the Q values and "Rank" will be calculated more properly.

run_discoSnp.sh -g ######### -p output_file_prefix -r1 "file1 file2" -r2 "file3 file4 file5" -r3 "file6 fle7" ...

Internally the programme somehow renames the sequence entries for each file and only outputs "C1_", "C2_" with corresponding "Q1_" and "Q2_"? So reading through from file1 into file2 might not cause problems for the calculations, I suppose?

ADD REPLY • link updated 2.9 years ago by Ram 44k • written 10.2 years ago by hugo.v ▴ 10

0

Entering edit mode

Hello Pierre:

The discoSnp program does indeed not need the pair-end information... it is just that using the two read files will increase the coverage of a potential SNP and thus in the later analysis / genotyping of the output, one could use a higher treshold value. Since the reads in the files are analysed independently, it is indeed just a matter of "summing up" the numbers obtained from each file. I am not sure if I would be able to do that without altering the format of the output "coherent.fa" file. If the output file has a (slightly) different format, the genotyping script might not accept the data.

There is the script to convert the "coherent.fa" file into a comma delimited file... one can sum up the columns in a spreadsheet programme... but to do the genotyping I would have to convert the new file back into the proper format! I am afraid there would be some difficulties doing that properly. I'll try anyway.

Thanks for developing this program and making it available

Hugo

ADD REPLY • link updated 2.9 years ago by Ram 44k • written 10.2 years ago by hugo.v ▴ 10

0

Entering edit mode

Hi Hugo,

Thanks again for all this constructive comments and your suggestion scripts. They are indeed useful.

I do agree with your suggestion: "run_discoSnp.sh -g ######### -p output_file_prefix -r1 "file1 file2" -r2 "file3 file4 file5" -r3 "file6 fle7" ..." This is definitely something we have to do as soon as possible.

Also, have you tried the "./discoSnp_to_csv.py" python script provided in the "output_analyses" directory?

the programme somehow renames the sequence entries for each file and only outputs "C1_", "C2_" with corresponding "Q1_" and "Q2_"?

Yes

reading through from file1 into file2 might not cause problems for the calculations

Yes again :)

ADD REPLY • link updated 2.9 years ago by Ram 44k • written 10.2 years ago by pierre.peterlongo ▴ 900

0

Entering edit mode

Hi Hugo,

It's been a while, but the current discoSnp++ version may fix the concerns you raised.

This new version http://gatb-tools.gforge.inria.fr/versions/src/DiscoSNP++-2.1.2-Source.tar.gz provides a VCF and takes into account the pair of reads.

The choose solution is a bit different from what you proposed, and all datasets need to be paired.

The format is

run_discoSnp.sh -r "file1_1.fa *file1_2.fa file2_1.fa file2_2.fa file3_1.fa file3_2.fa ..." -P

With the P option each pair of read set is considered as representing paired read sets.

ADD REPLY • link updated 2.9 years ago by Ram 44k • written 9.7 years ago by pierre.peterlongo ▴ 900