Visualizing Multivariate Analysis of Cancer Data


Dick Kreisberg
Institute for Systems Biology

http://bit.ly/regulome

In Breast Cancer , which inter-chromosomal molecular features are most associated with BRCA1 mRNA expression ?

Outline

  • Introduction
  • The Cancer Genome Atlas
  • Heterogeneous Cancer Data
  • Visualizing Multivariate Analysis
  • Challenges in Cancer Genomics Visualization

Who am I?

A software engineer in the Shmulevich Lab at the Institute for Systems Biology in Seattle, Washington in the USA.

Who is the Shmulevich Lab?

We are a group of bioinformaticians and software engineers focused on the challenges of computational biology.

What is the Institute for Systems Biology?

It is a non-profit research institute dedicated to understanding biological complexity.



www.systemsbiology.org

The Cancer Genome Atlas

cancergenome.nih.gov

TCGA Tumor Data

Clinical information (age, gender, vital status, tumor grade, histology)
Tumor sample high-throughput molecular data

mRNA expression (Agilent arrays and/or Illumina GA/HiSeq)
microRNA expression (Illumina GA/HiSeq)
protein expression (Reverse Phase Protein Array)

DNA methylation (Illumina 27k/450k)
DNA copy-number segmentation (Affymetrix SNP6)
DNA mutations (Illumina)

Downstream Information:

Subtype / Cluster assignments (single-platform analysis)
GISTIC regions of interest (copy-number)
Tumor Sample purity / ploidy
Somatic Mutation rates
Micro-Satellite Instability

The Reality of Research Data:


messy
noisy
nuanced

contradictive

Heterogeneous Feature Matrix

Association Analysis

Pairwise Analysis


Pairwise Associations

Significantly associated features (according to corrected p-value) form a pair of connected nodes.

Why do we need tools that explore cancer data?

  • Quickly ask a question of the data
  • Lower the barrier to the data
  • Provide supporting information and context
  • Distribute the effort across the research community

Multivariate Analysis

There are many methods being used in cancer research. In general, we see three types of approaches.

  1. Statistical
  2. Information Theoretic
  3. Machine Learning

Colorectal Cancer Aggressiveness

A combined p-value approach to identifying molecular features associated with tumor aggresiveness.
Vesteinn Thorsson

Clinical variables contributing to CRC tumor aggressiveness

Tumor aggresiveness as a composite of six p-value associations

Fisher’s Product Method for combining statistical tests (Fisher, 1948)

Follows χ 2 -distribution with 2 x 6=12 degrees of freedom
Weights w i used to equalize contributions of the 6 clinical variables

Random Forest

Decision Tree Ensemble Learning


RF-ACE: Random Forest with Artificial Contrast Ensembles
developed by Timo Erkkila
http://rf-ace.googlecode.com



Breiman, Leo (2001). "Random Forests", Machine Learning , 45 (1):5-32. Tuv, Eugene, et al. "Feature selection with ensembles, artificial variables, and redundancy elimination." The Journal of Machine Learning Research 10 (2009): 1341-1366.

A Decision Tree


Learning the data for the Histological Feature "HER2 Status"

Randomness - Bootstrapping


Randomness - Bagging

A Forest of Voting Trees

Multi-scale Association Explorer

A tool for exploring associations among genomic and non-genomic features.

Features for Exploring the Data

  • Edges Between Genomic Features
  • Edges With Non-Genomic Features
  • Managing Scale
  • Analytical Insight and Oversight
    (scatterplot, violinplot, cubbyhole)


Go to the tool!

PubCrawl

Incorporating Semantic and Interaction Associations
Andrea Eakin, Brady Bernard

Normalized Google Distance

A measure of semantic similarity.

Protein Domain Interactions

Raghavachari, Balaji, et al. "DOMINE: a database of protein domain interactions." Nucleic acids research 36.suppl 1 (2008): D656-D661.

Semantic and Domain Associations for MYC combined with GBM statistical and multivariate analysis.

Challenges in (Interactive) Cancer Visualization

There are many challenges left!

Associations

Problems:
  • Comparison and grouping of multiple cancer types (and subtypes) across all types of features
  • Inclusion of functional, semantic, analytical and physical associations
  • Highly connected (and confusing) networks

Possible Solutions:
  • Multigraphs
  • Managed layout
  • Automatic grouping based on scale
  • Topological Data Analysis (identify the simplicial complexes)


Lum, P. Y., et al. "Extracting insights from the shape of complex data using topology." Scientific Reports 3 (2013).

Scale

Problems:
Tying together information at many scales
Protein -> Pathway -> Hallmark -> Tumor -> Patient

Portraying feature data per patient with thousands of samples across multiple cancer types

A seemingly endless number of interesting results

Possible Solutions:
???
Collaborative tools that enable scientific effort to be distrbuted.

Interpretation

How best to convey the information to the cancer biologist?

Which visual abstractions are most useful to reason on?

When is an exploratory tool called for? When is it not called for?

Scared?


Don't be

High-Throughput Computation


600,000 cores running Random Forest on Google Compute Engine

The Center for Systems Analysis of the Cancer Regulome

Ilya Shmulevich (ISB) + Wei Zhang (MD Anderson Cancer Center)

www.cancerregulome.org

Projects @ ISB

Regulome Explorer
explorer.cancerregulome.org

Pubcrawl
explorer.cancerregulome.org/pubcrawl/

Genespot
www.genespot.org

Transcriptional Regulation & Epigenetic Landscape
trel.systemsbiology.net


Search 'codefor@systemsbiology.org' at code.google.com

Acknowledgements

 
Ilya Shmulevich   Wei Zhang
     
Andrea Eakin Hector Rovira Da Yang
Jake Lin Ryan Bressler Yuexin Lin
Brady Bernard Timo Erkkila Yan Sun
Sheila Reynolds Vesteinn Thorsson  
Lisa Iype Kalle Leinonen  
Patrick May Lesley Wilkerson




dick.kreisberg@systemsbiology.org

何か質問はありますか?

Resources

d3.js d3js.org
Science.js github.com/jasondavies/science.js/
CytoscapeWeb cytoscapeweb.cytoscape.org/
Cytoscape.js cytoscape.github.com/cytoscape.js/
Circos circos.ca

Reveal.JS

lab.hakim.se/reveal-js/

JSFiddle

jsfiddle.net