SSPACE:一款专门做scaffolding的软件

楼主  收藏   举报   帖子创建时间:  2018-10-30 00:00 回复:10 关注量:111

在做基因组拼接的时候,有时候已经获得了一堆contig或者scaffold,然后需要利用另外的一个pair-end或者mate-pair文库进一步做scaffolding的话。用什么软件好呢?

最近使用了一款SSPACE貌似不错,这款软件可以利用Pair-End(PE)和Mate-Pair(MP)的二代数据将已有的contig或者scaffold连接成更大的scaffold。同时这个软件还能够利用PE或者MP的数据对已有的contig或者scaffold进行延伸(或者填补部分的gap)。

这个软件做的不错,但是还是要小小地complain一下:这个软件分basic版和premium版,前者免费使用,后者则要求用户捐 € 250给他们才能使用。基于GNU的软件还有这么做的?!!!

关于软件的相关论文:

SSPACE has shown excellent performance on various datasets. The results have been published in the high-impact journal Bioinformatics (2011, vol. 27(4), pag. 578-9).

这个软件使用方法很简单,大家可以下载basic版,然后阅读里面的User_Manual 和 Tutorial。前者告诉你每个参数的意义后者是软件使用的简单教程。另外里面还有一个README文件是介绍软件算法过程的。如果由国内用户无法下载的可以联系我。我这现在有最新的1.2版本。
下面是附上SSPACE的参数介绍、参数含义、算法过程(工作原理):

SSPACE参数


Usage: perl SSPACE_v1-2.pl options

-l Library file containing two mate pate files with insert size, error and either mate pair or paired end indication.
-s Fasta file containing contig sequences used for extension. Inserted pairs are mapped to extended and non-extended contigs (REQUIRED)
-x Indicate whether to extend the contigs of -s using paired reads in -l. (-x 1=extension, -x 0=no extension, default -x 1)
-m Minimum number of overlapping bases with the seed/contig during overhang consensus build up (default -m 32)
-o Minimum number of reads needed to call a base during an extension (default -o 20)
-t Trim up to -t base(s) on the contig end when all possibilities have been exhausted for an extension (default -t 0, optional)
-g Maximum number of allowed gaps during mapping with Bowtie. Corresponds to the -v option in Bowtie. *higher number of allowed gaps can lead to least accurate scaffolding* (default -v 0, optional)
-b Base name for your output files (optional)
-v Runs in verbose mode (-v 1=yes, -v 0=no, default -v 0, optional)
============ Options below only considered for scaffolding ============
-k Minimum number of links (read pairs) to compute scaffold (default -k 5, optional)
-a Maximum link ratio between two best contig pairs *higher values lead to least accurate scaffolding* (default -a 0.7, optional)
-n Minimum overlap required between contigs to merge adjacent contigs in a scaffold (default -n 15, optional)
-u Fasta/fastq file containing unpaired sequence reads (optional)
-p Make .dot file for visualisation (-p 1=yes, -p 0=no, default -p 0, optional)

SSPCAE的参数含义:


The ‘-l’ library file:
---------------
The library file contains information about each library. The library file contains six columns, each separated by a space. An example of a library file is;

Lib1 file1.fasta file2.fasta 400 0.5 0
Lib1 file2.fasta file2.fasta 400 0.5 0

Lib2 file3.fastq file3.fastq 4000 0.75 1

Each column is explained in more detail below; Column 1:
---------
Name of the library. A short name to keep track of the names of the libraries. All temporary files and summary statistics are named by this library name. Libraries having same name are considered to be of same distance and devaition (column 4 and 5). Libraries should be sorted on distance, the first library is scaffolded first, followed by next libraries.

Column 2 & 3:
---------
Fasta or fastq files for both ends. For each paired read, one of the reads should be in the first file, and the other one in the second file. The paired reads are required to be on the same line. No naming convention of the reads is required, because names of the headers are not used in the protocol. Thus names of the headers shouldn’t be the same and do not require any overlap of names like (…).x and (…).y, which is commonly used in assembly programs.

During the reading step, the sequences of both pairs are merged together and filtered for;
-Mapping. the filtered read pairs with only ACGT characters and no duplicates are used. The remaining merged read pairs are split to single reads, and are mapped to contigs.
-Extension. only unmapped single reads are used for extension of the pre-assembled contigs.
-Scaffolding. Besides the filtering of mate pairs containing “N” and non- ACGT character, duplicate mate pair sequences are also removed.

Concluding, each read should be larger than 16 (or the ‘–m’ parameter if
-x 1). If they are shorter, the program will simply omit them from the assembly.

Column 4 & 5:
---------
The fourth column represents the expected/observed inserted size between paired reads. The fifth column represents the minimum allowed error. A combination of both means e.g. that with an expected insert size of 4000 and 0.5 error, the distance can have an error of 4000 * 0.5 =
2000 in either direction. Thus pairs between 2000 and 6000 distance are valid pairs.

Column 6:
---------
The final column indicates that the reads in both inserted fasta files should be reverse complemented or not. This column should either be set to 0 (no reverse complement) or 1 (reverse complement). SSPACE requires an A--><--B orientation of the reads. If the library has A--><--B orientation of the reads, typical paired-end library, no reverse complement should be used. For a library with A<---->B orientation, which is typically used for mate pair libraries, one should set this column to 1, to reverse complement both reads to A--><--B orientation

The ‘-s’ contigs fasta file
---------------
The ‘–s’ contigs file should be in a .fasta format. The headers are used to trace back the original contigs on the final scaffold fasta file. Therefore, names of the headers should not be too complex. A naming of “>contig11” or “>11”, should be fine. Otherwise, headers of the final scaffold fasta file will be too large and hard to read.
Contigs having a non-ACGT character like “.” or “N” are not discarded. They are used for extension, mapping and building scaffolds. However, contigs having such character at either end of the sequence, could fail for proper contig extension and alignment of the reads.

The ‘-x’ contig extension option
---------------
Indicate whether to do extension or not. If set to 1, contigs are tried to be extended using the unmapped sequences. If set to 0, no extension is performed.

The ‘–m’ minimum overlap
---------------
Minimum number of overlapping bases of the reads with the contig during overhang consensus build up. Higher ‘-m’ values lead to more accurate contigs at the cost of decreased contiguity. We suggest to take a value close to the largest read length. For example, for a library with
36bp reads, we suggest to use a -m value between 32 and 35 for reliable contig extension. For more information, see the SSPACE README file
or the SSAKE paper/poster.

The -o number of reads
---------------
Minimum number of reads needed to call a base during an extension, also known as base coverage. The higher the ‘-o’, the more reads are considered for an extension, increasing the reliability of the extension.

The ‘-t’ trimming option
---------------
Trims up to ‘-t’ base(s) on the contig end when all possibilities have been exhausted for an extension. See SSAKE help files for information.

The ‘-k’ minimal links and ‘-a’ maximum link ratio
---------------
Two parameters control scaffolding (-k and -a). The -k option specifies the minimum number of links (read pairs) a valid contig pair must have to be considered. The -a option specifies the maximum ratio between the best two contig pairs for a given contig being extended. For more information see the .readme file or the poster of SSAKE.

The ‘-n’ contig overlap
---------------
Minimum overlap required between contigs to merge adjacent contigs in a scaffold. Overlaps in the final output are shown in lower-case characters.

The '-g' maximum gaps
---------------
Maximum allowed gaps for Bowtie, this parameter is used both at mapping during extension and mapping during scaffolding. This option

corresponds to the -v option in Bowtie. We strongly recommend using no gaps, since this will slow down the process and can decrease the reliability of the scaffolds. We only suggest to increase this parameter when large reads are used, e.g. Roche 454 data or Illumina 100bp.

The ‘-u’ unpaired single reads
---------------
Unpaired reads can be inserted for contig extension. These reads are not used for scaffolding.

The ‘-p’ plot option
---------------
Indicate whether to generate a .dot file for visualisation of the produced scaffolds.

The ‘-b’ prefix base name
---------------
All files start with the ‘-b’ prefix to allow for multiple runs on the same folder without overwriting the results.

The ‘-v’ verbose option
---------------
Indicate whether to run in verbose mode or not. If set, information about the process is printed on the screen.

SSPACE算法原理


The program consists of several steps, a short overview;

The first steps are reading the data and filter them. The protocol is slightly different when -x is set to either 0 or 1. We treat them separately here;

With -x 0 the steps are;
1) Read -l library file;
A) For each library in the -l library file;
-merge both reads of a pair in to a single string and store them in a file. Filter this data on reads containing N's.
2) Convert the inserted contig file to appropriate format.

With -x 1 the steps are;

1) Read -l library file;
A) For each library separately
- merge both reads of a pair in to a single file for mapping. Filter this data on reads containing N's.
B) For all libraries
- store the single reads to a new file. Only reads containing only ACGT characters are stored.
2) Extend the pre-assembled contigs
A) Map single reads of step 1B to (-s) contig file with Bowtie.
B) Read unmapped reads into memory.
C) Go through each contig in the (-s) contig file, and try to extend the contig. The new contigs are stored in a new file.

After producing either a formatted or an extended contig file, the next step is to go through each library in the -l library file and map the filtered paired reads of step 1A to the new contigs;

3) Use Bowtie to map single reads of 1A to either the formatted or extended contigs. Map only reads that are on the edges of the contigs. Only reads that map to only one contig are stored in a file. Position and orientation of each read is stored in the file.
4) Store the position information of each found read on a contig to an hash.
5) Go through the paired reads and use only reads for contig pairing if both pairs are found in the hash. If a pair is already used for contig pairing, it is not used again.
6) Pair contigs based on the number of links (-k) and link ratio (-a)
7) Merge, orient and order the contigs to produce scaffolds.

8) If multiple libraries are in -l file, the produced scaffolds in fasta format are the input for the new library. Steps 3 till 8 are repeated for each library.

A more detailed view of the six main steps are given below.

Detailed view
------------

1. Reading and filtering libraries
Both fasta/fastq files inserted at the -l library file are read and merged together, forming a single string separated by a semicolon;

>read1_A of file 1
ACGATGCTAT

>read1_B of file 2
ACCGCGCCCC

>merged_read
ACGATGCTAT:ACCGCGCCCC

If the merged read contains only ACGT characters, it is stored in a new file. Filtered pairs are used for contig pairing, see step 4.

If -x 1 is set, for contig extension, single reads containing only ACGT characters are stored in a new file. The single reads are mapped to contigs at the next step.

2. Mapping when -x 1
To extend contigs, only reads that are not already present on the contigs should be used. Otherwise, reads are re-used and cause erroneous contigs. To filter these reads out, Bowtie is used. Bowtie maps the produced single reads at step 1 to the (-s) pre-assembled contigs. A file is generated with reads that did not map to the contigs. The unmapped read file is read in memory, populating a hash table keyed by unique sequence reads with pairing values representing the number of sequence occurrences. The hash is used for contig extension at the next section.

3. Extending when -x 1
Contigs are extended, when -x set to 1, using the unmapped reads with a method developed by SSAKE. With SSAKE, contigs extension is initiated by generating the longest 3'-most word (k-mer) from the unassembled read u that is shorter than the sequence read length l. Every possible 3' most k-mers will be generated from u and used in turn for the search until the word length is smaller than a user-defined minimum, m. Meanwhile, all perfectly overlapping reads will be collected in an array and further considered for 3' extension once the k-mer search is done. At the same time, a hash table c will store every base along with a coverage count for every position of the overhang (or stretches of bases hanging off the seed sequence u).

Once the search complete, a consensus sequence is derived from the hash table c, taking the most represented base at each position of the overhang. To be considered for the consensus, each base has to be covered by user-defined -o (set to 2 by default). If there's a tie (two bases at a specific position have the same coverage count), the prominent base is below a user-defined ratio r, the coverage -o is to low or the end of the overhang is reached, the consensus extension terminates and the consensus overhang joined to the contig. All reads overlapping are searched against the newly formed sequence and, if found, are removed from the hash table and prefix tree. If they are not part of the consensus, they will be used to extend other contigs, if applicable. If no overlapping reads match the newly formed contig, the extension is terminated from that end and SSAKE resumes with a new contig. That prevents infinite looping through low complexity DNA sequences. In the former case, the extension resumes using the new [l-m] space to search for joining k-mers.

The process of progressively cycling through 3'-most k-mer is repeated after every contig extension until nothing else can be done on that side. Since only left-most searches are possible with a prefix tree, when all possibilities have been exhausted for the 3' extension, the complementary strand of the contiguous sequence generated is used to extend the contig on the 5' end. The DNA prefix tree is used to limit the search space by segregating sequence reads and their reverse-complemented counterparts by their first eleven 5' end bases.

There are two ways to control the stringency in SSPACE:
1) Disallow contig extension if the coverage is too low (-o). Higher -o values lead to shorter contigs, but minimizes sequence misassemblies.
2) Adjust the minimum overlap -m allowed between the contig and short sequence reads. Higher m values lead to more accurate contigs at the cost of decreased contiguity.

After the sequence assembly, a file is generated with .extendedcontigs.fasta extension in the 'intermediate_results' folder. This file contains both extended and non-extended contigs.

The next steps are looped through each library, present in the (-l) library file.

4. Mapping unique paired reads

At step 1, pairs of each library were filtered. Pairs containing N's are unable to correctly map to the contigs, therefore they are removed to save time. Of the remaining pairs, only single reads are extracted. Bowtie maps the single reads to the contigs, produced either after extending (if -x 1), or after formatting (if -x 0), or after step 5 if multiple libraries are inserted on -l.

Before mapping, contigs are shortened, reducing the search space for Bowtie. Only edges of the contigs are considered for mapping. Cutting of edges is determined by taking the maximal allowed distance inserted by the user in the library file (insert size and insert standard deviation). The maximal distance is insert_size + (insert_size * insert_stdev). For example, with a insert size of 500 and a deviation of 0.5, the maximal distance is 750. First 750 bases and last 750 bases are subtracted from the contig sequence, in this case.

------------------------------------------
| |
------------ ------------
750bp 750bp

This step reduces the search space by merging the two sequences, divided by a <N> character.

The algorithm of mapping goes through each pair and checks its occurrence on the edges of the contigs. If both reads are found, the reads of the pair is stored and contigs could be paired in the next step. Otherwise, it is not stored and the read pair is not used for contig pairing. If a pair is previously found and used for contig pairing, the pair is not considered again. Otherwise same links between contigs are found based on same read pair, which can generate misleading results.

If either of the two reads of a read pair occur on multiple contigs, one can not tell which contig should be paired. For example, the left read occurs at contigs 1 and 3, and the right read at contig 2. For this situation it is impossible to tell if contigs 1 and 2 should be paired, or contigs 1 and 3. Therefore, reads that occur multiple times on contigs are not considered for contig pairing.

5. Building scaffolds
The final step is scaffolding. SSPACE uses an updated version of the SSAKE scaffolder for this. For each read pairs, putative contig pairs (pre-scaffolding stage) are tallied based on the position/location of the paired reads on different contigs. Contig pairs are only considered if the calculated distance between them satisfy the mean distance specified (fourth column in -l file) while allowing for a deviation (fifth column in -l file), also defined by the user. Only contig pairs having a valid gap or overlap are allowed to proceed to the scaffolding stage.
Please note that this stage accepts redundancy of contig pairs (i.e. a given contig may link to multiple contigs, and the number of links (spanning pairs) between any given contig pair is recorded, along with a mean putative gap or overlap(-)).

Once pairing between contigs is complete, the scaffolds are built using contigs as seeds. Every contig is used in turn until all have been incorporated into a scaffold.

Consider the following contig pairs (AB, AC and rAD):

A B
========= ========
-> <-
-> <-
-> <-
-> <-

A C
========= ======
-> <-
-> <-

rA D equivalent to rDA, in this order
========= =======
-> <-
-> <-
-> <-

Two parameters control scaffolding (-k and -a). The -k option specifies the minimum number of links (read pairs) a valid contig pair MUST have to be considered. The -a option specifies the maximum ratio between the best two contig pairs for a given contig being extended. For example, contig A shares 4 links with B and 2 links with C, in this orientation. contig rA (reverse) also shares 3 links with D. When it's time to extend contig A (with the options -k and -a set to 2 and 0.7, respectively), both contig pairs AB and AC are considered. Since C (second-best) has 2 links and B (best) has 4 (2/4) = 0.5 below the maximum ratio of 0.7, A will be linked with B in the scaffold and C will be kept for another extension. If AC had 3 links the resulting ratio (0.75), above the user-defined maximum 0.7 would have caused the extension to terminate at A, with both B and C considered for a different scaffold. A maximum links ratio of 1 (not recommended) means that the best two candidate contig pairs have the same number of links -- SSPACE will accept the first one since both have a valid gap/overlap. The above method was adopted from SSAKE. The SSPACE improved this method by introduing another method if a contig can link to more than one alternative. Both methods (original SSAKE method and our method) for handling alternatives are explained below;

If a contig can be linked to more than one alternative, connections between these alternatives are searched and linked together if a connection is found. Otherwise a ratio is calculated between the two best alternatives. If this ratio is below a threshold (-a) a connection with the best scoring alternative is established. The two methods are shown below;

The first method;
A has 10 links with B
A has 5 links with C
B has 10 links with C;

Result is a scaffold containing A-B-C

The second method adopted from SSAKE (only used if first method did not produce a scaffold);
A has 10 links with B
A has 5 links with C
B has 0 links with C;

Since no connection is found between B and C, a ratio of the links is estimated. Here, this is 5/10 (C/B) = 0.5. If this is below the -a ratio threshold, the scaffold is A-B.

When a scaffold extension is terminated on one side, the scaffold is extended on the "left", by looking for contig pairs that involve the reverse of the seed (in this example, rAD). With AB and AC having 4 and 2 links, respectively and rAD being the only pair on the left, the final scaffolds outputted by SSPACE would be:

1) rD-A-B
2) C

SSPACE outputs a .scaffolds file with linkage information between contigs (see "Understanding the .scaffolds csv file" below)
Accurate scaffolding depends on many factors. Number and nature of repeats in your target sequence, optimum adjustments of insert size, error, parameters -k and -a and data quality/size of sequence set (more doesn't mean better) will all affect SSPACE's ability to build scaffolds.

6. Merging contigs
SSAKE scaffolder produces links between contigs and determines the possible gap between them. For a positive gap, m number of N's will be placed between them if a gap of size m is predicted to occur. When a negative gap is generated, a putative overlap is predicted to occur. The adjacent contigs are searched for overlap within a window given at -n option till 50 bp. If an overlap was found, contigs are merged and the region is marked with lowercase nucleotides. Otherwise, if no overlap was detected, a single "n" will be placed between the contigs. A short overview of this step with three examples;

>contig_1
AGCTAGTCGTAGCTTGTAC
>contig_2
ACGTAGTGATATTATTGTC

Example 1:
A link between contig_1 and contig_2 is found, with a putative gap of 10. In the final output, the gaps is indicated by 10 N's between the two contigs.

Link = contig_1 with contig_2. Gap = 10;
AGCTAGTCGTAGCTTGTACNNNNNNNNNNACGTAGTGATATTATTGTC

Example 2;
A link between contig_1 and contig_2 is found, with a putative gap of -10. When using the -n 10 option, no overlap was found and a small <n> is inserted between the two contigs.

Link = contig_1 with contig_2. Gap = -10. -n = 10;
AGCTAGTCGTAGCTTGTACnACGTAGTGATATTATTGTC

Example 3;
A link between contig_3 and contig_4 is found, with a putative gap of -10. When using the -n 10 option, an overlap of 13 nucleotides was found, indicated in lower case in the final output.

>contig_3
AGTGTTAGATAGTTATAGA
>contig_4
AGATAGTTATAGAAGTAGT

Link = contig_3 with contig_4. Gap = -10. -n = 10;
AGTGTTagatagttatagaAGTAGT

  • ***2ybzhao 1970-01-01 08:00
    #1

    已经发送,请查收邮件。

  • ***3echoduan 2013-09-25 09:22
    #2

    你好,请问你能给我发一份这个软件吗?我也在做拼接,我的邮箱duanshoufu@126.com,不好意思,真的麻烦你一下,谢谢!

  • ***1cwc111071 2012-11-28 14:48
    #3

    你好,我对SSPACE也有兴趣,想研究一下,能麻烦把软件以及使用说明发我邮箱吗?多谢!我邮箱weicheng1612@126.com

  • ***1xylym 2013-10-18 11:27
    #4

    不好意思,可以请问一下关于SSPACE的问题么:请教各位,在SSPACE生成了一个关于contigs gaps的文件,里面记录了可能存在的contig link,我想问一下,这是里面的一条记录:f3545 has 23 links with r3245 and gap of -68 bases,这里说有68 bases大小的gap,然后前面那个“ – ”表示是什么意思呢?我看了README,推测可能是negative gaps,可能可以通过contig之间的overlap消除的gap么?另外,我这有一条组装成scaffold的记录:scaffold6|size68841|tigs7f_tig3325|size6927|links7|gaps-694|merged25f_tig3146|size3331|links6|gaps-623r_tig3405|size10398|links5|gaps-621f_tig3266|size5457|links15|gaps-649f_tig3358|size8089|links8|gaps383r_tig3042|size2074|links5|gaps-615|merged15f_tig3539|size32219这里有f_tig3325和f_tig3146之间存在gaps-694,后面说merged25,也就是说,这条contig之间存在gaps而且有overlap但是只能通过overlap merge 25 bases,是这个意思么?有些困惑~非常感谢……

  • ***1lichen 2014-03-25 16:37
    #5

    很管用

  • ***1lichen 2014-03-25 16:40
    #6

    你好,我最近做了一项细菌全基因序列测序,有7个scaffold,据说是未能测到的基因,看到这个SSPACE,不妨一试,发一个给我,好吗,谢谢!

  • ***1lichen 2014-03-25 16:49
    #7

    你好,我最近做了一项细菌全基因序列测序,有7个scaffold和10个contig,这个SSPACE刚好可以用上,能发给我吗,谢谢!

  • ***1晨辰 2016-03-23 11:44
    #8

    你好,我是中科院的一名学生,现在在做基因组scaffolding的工作,能够将SSPACE最新版本传给我吗,我的邮箱是306437856@qq.com,谢谢,以及你这篇分享的文章!祝生活愉快

  • ***1晨辰 2016-03-23 11:52
    #9

    你好,我是中科院的一名学生,现在在做基因组scaffolding的工作,能够将SSPACE最新版本传给我吗,我的邮箱是306437856@qq.com谢谢,以及你这篇分享的文章!祝生活愉快。

  • ***2Quan 2016-04-09 21:49
    #10

    最新版本免费的,申请就可以下载使用