Hi everyone,
I have a paired-end read sequencing data of Brassica. I preprocessed these raw fastq using trimmomatic (removed adapter sequence, low-quality basses). Then, I aligned the trimmed fastq file with the reference genome using BWA-mem. And I extracted the unmapped reads with SAMtools (flag 0x4). I converted the BAM file containing unmapped reads into fastq using SAMtools bam2fq, and I got a fastq file.
I want to assemble these extracted unmapped reads into contigs with MaSuRCA. But MaSuRCA requires the configure file for running. In the tutorial it is clearly mentioned how to specify the Paired - End data but it is not mentioned how to specify Single-end Illumina data in the config file ? In the attached image is an example of a configure file of MaSuRCA, It requires the insert size and stdev of insert size for paired-end reads data, but in my case, my extracted unmapped reads are Single-end, How can I get these value for configure file?
Since you have single-end data there is no way to accurately determine the size of the inserts for data you have just using those reads.
Update : Since these reads came from the same library you can re-use the numbers for PE data.
If your initial dataset was paired-end how come you ended up with all single end data after the process you describe above?
MASURCA config file says the following so you may need to use other assemblers, if you have single end reads.
I focus on the unmapped reads, and I want to assemble these unmapped reads into novo contigs. When I aligned my trimmed paired end reads fastq file with reference genome, I got SAM file.
Then, I used samtools to extract all unmapped reads as a single group (
samtools view -b -f 4 SRR4289357_mapped.sorted.bam > SRR4289357_unmapped.bam
) instead of extract both paired and singletons unmapped reads separately (like this:It means that after this step, my got only 1 bam file containing all unmapped reads, (not R1, R2 of paired end reads). And I converted this bam file to fastq
I read comment on this post, they said that in the configuration file for MaSuRCA, just replace PE => SE (single end) , like this: PE= se 500 50 /../ummapped.fastq
My question is : how to get the numbers "500 and 50" in this parameter of configuration file : "PE= se 500 50 /../ummapped.fastq". I can calculate if it is paired end read. But in this case, I have only one fastq file
Since these unmapped reads came from the same library you can use the same numbers as you are using for PE.
In this post: How to specify Illumina Single End data in the MaSuRCA Assembler config file , "Just specifying one file should work and replacing pe by se."
This post is 10 years old and the configuration example shown on the GitHub page at present clearly says that Illumina reads must be paired-end: https://github.com/alekseyzimin/masurca
That said if you are able to get it to work following directions included in the old thread then there is no harm in trying.
I don't know how many reads you have that are single end but unless it is a large number you are likely chasing after diminished returns.