Introducing distributed dynamic data-intensive (D3) science: Understanding applications and infrastructure

From AcaWiki
Jump to: navigation, search

Citation: Shantenu Jha, Daniel S. Katz, Andre Luckow, Neil Chue Hong, Omer Rana, Yogesh Simmhan (2017/02/02) Introducing distributed dynamic data-intensive (D3) science: Understanding applications and infrastructure. Concurrency and Computation: Practice and Experience (RSS)
DOI (original publisher): 10.1002/cpe.4032
arXiv (preprint): arXiv:1609.03647
Semantic Scholar (metadata): 10.1002/cpe.4032
Sci-Hub (fulltext): 10.1002/cpe.4032
Internet Archive Scholar (search for fulltext): Introducing distributed dynamic data-intensive (D3) science: Understanding applications and infrastructure
Download: https://onlinelibrary.wiley.com/doi/10.1002/cpe.4032
Tagged: Computer Science (RSS)

Summary

  • Traditional application := program run by one group written to find answer to scientific question.
  • Infrastructure application := a program written in multiple stages run by different groups.
  • Big data
  • Distributed := presence of data in different physical or logical locations. This could because the data comes from different sensors, it could be too big to be processed by a single node on a timely manner, it could be because you want more reliability given by redundancy and load-balancing, it could be for privacy or policy reasons.
    • Replicated
    • Partitioning
    • Streaming
  • Dynamic := an application with spatiotemporal variability.

Examples

  • Next Generation Sequencing (NGS) := map/align short reads to a reference genome.
    • Application type: traditional
    • Data: terrabyte scale data of DNA sequences
    • Distribution: the problem can be distributed, but it is unclear how to get optimal performance. Few workflow systems natively manage distribution.
    • Dynamic: the data itself is not dynamic, but properties of the running program are (when tasks complete).
  • ATLAS := Analyze experimental physics data (pleasingly parallel)
    • Application type: infrastructure; data generation and processing are controlled by different people. Scientists submit requests to run certain analyses on the data.
    • Data: 20Tb per day of serialized C++ objects
    • Distribution: 250,000 cores over 140 sites.
    • Dynamic: data streams in continuously, and applications is run 2 or 3 times per year.
  • Large Synoptic Survey Telescope (LSST) := find and study moving objects using a telescope.
    • Application type: infrastructure; the data gets used by others downstream.
    • Data: tens of TB per day of FITS images
    • Distributional: talks to other telescopes, compute resources, and storage resources
    • Dynamic: Data streams in. The system has to decide whether or not to interrupt its existing observing program to get another look at an anomalous object.
  • SOA Astronomy := uses the data from LSST
    • Application type: Infrastructure
    • Data: 1Gb images
    • Distribution: Data exists on different servers and is processed in a distributed cluster.
    • Dynamic: Source data is constantly in flux.
  • ... others