UW Information Technology

December 23, 2015

__cloud content__ Case Studies


Case Study: Genomics Imputation

Timothy Durham is a postdoc in Bill Noble’s Genomics lab. He is imputing information relating human cells to proteins to the human genome. The computation involved is fairly parallel on a per-iteration basis; where generally 50 iterations are required to ‘more or less’ approach a convergence but 350 iterations are really needed for good asymptotic behavior.

Tim’s first implementation was non-cloud based, using the Hadoop architecture on a single machine. 120 minutes per iteration.

Second implementation: Hadoop on a Microsoft Azure HDInsight cluster. 40 minutes per iteration.

Third implementation: Mapper / Reducer rebalanced; and algorithm improved by sorting map outputs. 13 minutes per iteration.

Fourth implementation: Apache Spark on AWS. 15 seconds per iteration

That is a speed-up factor of 500.


As of October 2015 the Microsoft Azure HDInsight implementation of Apache Spark showed an RDD persistence failure bug that precipitated Tim trying AWS as an alternative cloud platform.

Tim is using 1 M3.XLarge head node and 5 M3.XLarge worker nodes (4 cores each) to analyze 1.8 million base pairs, representing 0.01% of the total genome.

A trial against 30 million base pairs (1% of the genome) required 12.4 hours to run at a cost of $25.

The cost is determined as 6 machines x 12.4 hours x ($.266 + $.07) per hour. The second charge is an Elastic Map Reduce (EMR) surcharge in addition to the base per-node cost, i.e. the cost of using a pre-built Apache Spark framework.

The implication is that the entire genome could be processed in under a day by scaling out; at a cost of perhaps $2500. This is certainly ‘to be demonstrated’ and some preliminary work is still needed prior to the attempt. In terms of Tim’s starting point processor this job would have required 87 years.

Open Question: How do Spot Market task nodes differ from core nodes in a Spark implementation? We know that they do not contribute their disks to the HDFS which in turn taxes the core nodes and has produced a crash in a test. Is there a fundamental incompatibility between Spark and Spot?


Case Study: Aral DIF

Case Study: Biogeochem Data System

Case Study: LiveOcean

Case Study: Ice2Ocean