Hi,
I need to convert my pair-ended libraries into tab (only headers) with "insert size" information in a separate column. Basically, something like this: "read1 read2 interread-distance". Are you aware of a perl/python script to do such a job? I am starting in python world and it is a simple and it will be a good exercise however some help would be appreciate.
Basically, I will have two datasets (multifasta files format txt-like, *.fasta) like this:
DATASET #1
>header_1/1 (this can have different txt)
CACGGTGGGATCGACCTTTTGCCGCTCGAGGGATTTCTGCACCCTGCCGA
>header_2/1
TGGAACAAGGCCGGCCTGCGCGGCTAGGTGATCAACAAGGGGTTCAACGG
>header_3/1
GCGAAAGCGATACCTGGACTCGCAGAACGAAGACCTGGCCGAAGCGGTGG
>header_4/1
GGTTTAGAACGTCGTGAGACAGTTCGGTCCCTATCTGCCGTGGACGTTTG
DATASET #2
>header_1/2
CACGGTGGGATCGACCTTTTGCCGCTCGAGGGATTTCTGCACCCTGCCGA
>header_2/2
TGGAACAAGGCCGGCCTGCGCGGCTAGGTGATCAACAAGGGGTTCAACGG
>header_3/2
GCGAAAGCGATACCTGGACTCGCAGAACGAAGACCTGGCCGAAGCGGTGG
>header_4/2
GGTTTAGAACGTCGTGAGACAGTTCGGTCCCTATCTGCCGTGGACGTTTG
The only conserved notation is /1 or /2 depending on being dataset1 or 2 (mate-pair reads). The size of the datasets are aroung 5,000,000 header entries but can be more
Additionally, I will have a numeric insert-value input
So, in total I have 3 inputs:
dataset 1
dataset 2
insert value (number)
e.g. perl/python headerextract dataset1 dataset2 [#insert value] output
I want to:
1.extract ">" description from each DATASET like:
header_1/1
header_2/1
header_3/1
header_4/1
and
header_1/2
header_2/2
header_3/2
header_4/2
[to easily extract the headers it can be used grep -e ">" my.fasta since ">" is only in the header line]
2.
print a file (format can be .csv or .txt) with both list of headers (each row with identical header designation only varying in /1 or /2) plus an additional column with insert-value (eg. 300):
header_1/1 header_1/2 300
header_2/2 header_2/2 300
header_3/2 header_3/2 300
header_4/2 header_4/2 300
A real dataset may looks like this:
>ILLUMINA-52179E_0009:5:1:1059:13539#CTTGTA/1
CACGGTGGGATCGACCTTTTGCCGCTCGAGGGATTTCTGCACCCTGCCGA
>ILLUMINA-52179E_0009:5:1:1094:12872#CTTGTA/1
TGGAACAAGGCCGGCCTGCGCGGCTAGGTGATCAACAAGGGGTTCAACGG
>ILLUMINA-52179E_0009:5:1:1106:7745#CTTGTA/1
GCGAAAGCGATACCTGGACTCGCAGAACGAAGACCTGGCCGAAGCGGTGG
>ILLUMINA-52179E_0009:5:1:1108:16460#CTTGTA/1
GGTTTAGAACGTCGTGAGACAGTTCGGTCCCTATCTGCCGTGGACGTTTG
What do the headers look like?
Hi, my read libraries are from illumina, e.g.:
ideally, such script should be something like: perl/python "fa2csv" reads1 reads2 insertsize output. Currently I am looking on how to convert multifasta to csv in a way that in the same row I do have both pairs...
how do you compute/extract the "interread-distance"? If you mean the distance between both pairs, the distance is expected to be some value close to the library preparation (i.e. Illumina protocol selects fragments ~400b, if your are sequencing 100b in both ends, your insert size is ~200). The real distance can be known only after you map your reads to some reference.
Actually, I do not intend to compute the insert value using such script. That I already know, by, as you referred, mapping the reads against a reference. The issue here is that one of the tools I would like to use to analyze my data requires a tab-like format with each pair header in a row and with an approx. insert value.
You'll have to get the insert length from the mapping data (not raw reads) - so what file format is that in?
Please give me an example of files and of a desired output so I could write the script for you.
So you are interested in headers, right? Doesnt need any sequence information?
No, I do not need sequence information in the output. The software I want to use requires: - dataset1.fasta - dataset2.fasta - file with header lines plus insert size [the file described in this thread]