Real Intelligence for Personalized Medicine
Manny October 3rd, 2007
Saffron enjoyed another successful symposium in July with its government customers and partners. This was co-sponsored by Choicepoint i2, highlighting our integration with Analysts Notebook, the dominant desktop tool for intelligence analysis and other similar domains of financial fraud and crime investigation. The symposium also featured one of our customers, who is using Saffron for "associative targeting" of bad guy networks.
The maturity of the entity analytic market within national security is becoming clear. The government has been a leader in the adoption of entity extractors. Although all this rich information about people, places, and things and how they are linked has been made available, new ways are required to exploit this information. I heard a wonderful phrase to describe the current problem: Information overload has lead to "write once, read never". Data can be easily stored but is never exploited. With entity extraction, entity overload rather than document overload now overwhelms the analyst. As argued in prior postings, associative memories provide the "real intelligence" needed to properly represent and support inferential arguments over massive, complex entity networks.
Personalized medicine shares many of these same problems in data and entity analytics. I will be speaking on this topic in October at the 1st Annual Total Cancer Care Summit: the Future of Personalized Medicine, sponsored by the Moffitt Cancer Center. My presentation is entitled "From Bad Guys to Bad Genes: Associative Targeting of Terrorist Networks Applies to Gene Networks for Personalized Medicine". This is another critical mission that affects us all, and new methods developed by Saffron are required to uncover the signals that are clinically important for similar life-and-death challenges.
As highlighted in a recent article in the New England Journal of Medicine entitled "Drinking from the Fire Hose -Statistical Issues in Genomewide Association Studies", the science of genomics is well developed and data is available in unprecedented volume. However, appropriate methods of data analysis are lacking:
"But as we delve further into the genome in the search for networks of interacting gene variants and interactions between these networks and environment factors, much more sophisticated methods of statistical analysis are likely to be required."
This article highlights the statistical problems with SNP data, collecting 500,000 data points per individual case. Gene expression microarrays are an order of magnitude smaller, ranging from 10,000 to 30,000 simultaneous gene expressions, but statistical methods are still inadequate for such scale. "Association studies" (as they are called) hope to map genes to diseases and treatments, but they consider only linear associations, assuming that each gene maps simply to each disease or treatment. Even if the association studies are "genome wide", it is unlikely and probably wrong to assume that a single gene will represent the key. For example, BCR1 is a gene associated with breast cancer, the presence of which can help the physician in selecting the appropriate treatment. However, not all patients with BRC1 treated by accepted guidelines respond uniformly. All patients with BCR1 are not the same, and yet these simplistic linear assumptions treat them all as a single, homogeneous bucket. Clearly, other significant individual differences remain.
Increasingly, data analytic efforts are being criticized as having little or nothing to do with the complexity of known biology. See Joseph Terwilliger's "utter refutation" for a particularly scathing review. Linear association studies or even tree-based predictive models do not capture the complete, context-dependent network structure of genomic interactions. We know the principles of biology to be complex and non-linear, and yet we still base actions on naive assumptions because we lack the mathematical and software tools to address the complexity and scale of the problem.
Since Terwilliger's paper in 2006, a recent paper by the ENCODE consortium received a great of press this summer. This extensive 4 year study challenges the notion of discreet genes. DNA transcription is an interwoven complex of gene networks. One report of this effort, "Change to Gene Theory Raises New Challenges for Biotech", concludes that this new understanding will undermine the entire industry based on gene patents, with claims for how DNA sequences map to a "specific functional product". They do not.
Much like the challenges of national security, functional genomics presents another extreme case of large and complex networks. Like "bad guys", bad genes are responsible for rare, adverse events. Like taking out one bad guy, taking out only one bad gene might have very little lasting effect, if any effect at all. Moreover, other lessons learned from the targeting of bad guys will be relevant to bag genes: While linear associations of single genes to specific functions is giving way to understanding genes within networks, even the modeling of non-linear, pair-wise, co-regulations of genes remains a shallow model. Accounting for gene-gene interactions will also need to capture the context-dependency of network dynamics. For example, associative targeting of bad guys cannot rely on mere "guilt by association". The context of interaction becomes a critical factor in determining which associations are relevant to a problem and which are mere coincidences of other, more normal interaction. When do environmental factors and many other gene expressions regulate the functional interactions? The underlying physics and biochemistry affect if and when entities effectively meet to produce a clinically relevant consequence.
As argued in my prior postings, associative memory representation and reasoning are emerging as separate from and more powerful than traditional statistics and neural networks, which are also commonly used in bioinformatics. Traditional methods have tried to address "the curse of dimensionality" in massive and complex bio-data by assuming data-to-model reductionism - fitting complex raw data into simpler abstract models. By definition, such reductionism looses information and accuracy. Association studies have relied on one such reductionism, assuming a linear model between gene and function. These models are easy to compute but do not reflect real world complexities. Associative memories do.
Traditional data mining also suffers from the notorious costs and delays of knowledge engineering. Models tend to behighly parametric and non-incremental, requiring the "black arts" of model selection, feature selection, dataset training, parameter tweaking, and other management methods to ensure that such model fitting does not lead to over-fitting, causing furtherreduction in accuracy. As such, models are slow to be developed and deployed. Their slow construction speed and limited predictive accuracy must be replaced by new methods to provide a more scalable, rapid, and robust industry of personalized decision support. Now that genomic and other "omic" data has become available and there is a stronger scientific understanding of genetic complexity, methods of data analysis remain as a roadblock to the wide-spread clinical availability of personalized genomic medicine.
Clinical testing of single genes and SNPs is a step in the right direction, but these first generation "personalized" medical systems are poor and do not represent our scientific understanding of genes and the complex signature of the individual patient with a complex disease. Along with the broader measurement of gene networks rather than single genes in the clinic, it is imperative that new data analytic methods rapidly learn and deliver accurate intelligence and decision support to the clinician. As with associative targeting of bad guy networks for operational commanders, associative memories address the required scale and complexity of bad gene networks to also support the physician.
Come join me in the Bahamas this month for the Future of Personalized Medicine!