SAGE Bioinformatics

After the concatemers have been sequenced, the bioinformatics work starts. The series of computational events is, in a way, the reverse of the molecular cloning events performed in the laboratory. First, the concatemer sequences are split into ditags. Then ditags that are present more than one time are removed. The ligation of tags to ditags prior to PCR amplification is a random event. Thus, two ditags with the same sequence are thought to be an artefact resulting from a preferential PCR amplification. By removing these ditags, which make up a negligible percentage of all ditags, PCR artefacts are discriminated. Hence, the SAGE technology, although using the benefits of PCR amplification, rules out the general pitfalls of PCR-based methodologies, such as differential display. In the next step, ditags are split to generate the appropriate SAGE tags. The process is finalized by assignment of tags to genes and counting the tags. This can be done using one of several publicly available SAGE analysis software packages (some are online, e.g., at http://www.ncbi.nlm.nih.gov/ SAGE). At Memorec we have developed our own software for accurate SAGE tag mapping using an extensive proprietary tag database. It includes automatic annotation derived from EST/genomic data that takes into account several hundred manually annotated tags that are elusive to automatic annotation. SAGE artefacts and uninformative tags derive from polymorphic tags, ribosomal RNA, mitochondrial RNA, linker tags, LINE/SINE tags, and sequencing errors are removed by proprietary filtering algorithms. The software allows the comparison of two (or more) different libraries by providing tools for normalization, calculation of significance levels, and interactive graphical output. Already at this stage, SAGE results can be used to define the complete, unrestricted transcriptome of a given tissue. The level of sensitivity is determined not primarily by a labelling or detection system but by the amount of sequencing that can be carried out in a cost-effective manner. A typical SAGE analysis of 50 000 tags will identify 10 000-15 000 different transcripts in most cell types. It has been shown that, at a sufficient depth, SAGE can identify the entire set of genes represented in a cell type. A typical SAGE library contains many more tags than can be practically sequenced. This represents an inexhaustible resource when additional data is required for statistical purposes.

However, most applications of SAGE point to the identification of differentially expressed genes within two probes. This is achieved by simply comparing the number of counted tags in each of the probes. Because of the digital nature of SAGE data, it is possible to assess the differential expression analysis by statistical methods, which is not true when two microarray experiments are compared [3]:

where N1 and N2 represent the total number of tags counted per library, x and y are the number of tags for a given gene, and p is the significance level of regulated gene expression.

As one can see, the degree of significance of differentially expressed genes depends on the number of tags counted for a given gene. The differential expression of genes with tag counts of, for example, 200: 100 is much more significant than for genes with tag counts of 20: 10 or even 2: 1. Finally, sensitivity improvements are achieved by clustering multiple tags belonging to one gene and also by biological pathway analysis.

0 0

Post a comment