Pages

Tuesday, February 24, 2009

Fusion of Multiple Data Sources for Genomic Data Mining

Fusion of Multiple Data Sources for Genomic Data Mining

Abstract:

There exist diverse genomic data sources for functional identification of cellular molecules such as genes or proteins. For example, the following knowledge sources may be accessible for any specific gene: the gene's expression data; the gene sequences; conserved motifs in the upstream region of that gene; the gene's interactions; the gene's hydrophobicity profile; as well as the protein the gene encodes and the proteins that interact with the given gene's protein product. Each of these distinct sources provides its own viewpoint on the gene and its cellular machinery. Naturally, best decision can be obtained when most (or all) of these sources are available and the different knowledge sources are properly combined. The first part of this talk will highlight the following categories of fusion strategies: (a) Secondary Source as Validation Tool; (b) Linear and Nonlinear Score Fusion. Each data source is processed independently with an assessed score, and then the (independently assessed) scores may be jointly considered to reach a final decision. Both linear and nonlinear score fusion are promising for their own suitable applications. (c) Direct Data Fusion. For some important application scenarios, all the data sources are represented by vectors over the real field. In this case, the most direct fusion method is by creating an expanded vector comprising all the individual vectors: vector = [vector 1, vector 2, vector 3]. (d) Kernel or Correlation-Based Data Fusion. This last category has recently drawn a lot of research attention. The second part of this talk will place its main emphasis on the very last category. Each type of data is independently represented as a matrix of kernel similarity values, which are then combined to make overall predictions. Following the SVM framework, Lanckriet et al. proposed a weighted linear combination of kernels, and demonstrate how to estimate the kernel weights from the data. More precisely, the problem is formulated as a convex optimization problem that can be solved with semi-definite programming techniques. The approach yields predictions that reflect the proper (or optimal) amount of contributions from multiple data sources.

No comments:

Post a Comment