Introducing distributed dynamic data-intensive (D3) science: Understanding applications and infrastructure
From AcaWiki
Citation: Shantenu Jha, Daniel S. Katz, Andre Luckow, Neil Chue Hong, Omer Rana, Yogesh Simmhan (2017/02/02) Introducing distributed dynamic data-intensive (D3) science: Understanding applications and infrastructure. Concurrency and Computation: Practice and Experience (RSS)
DOI (original publisher): 10.1002/cpe.4032
arXiv (preprint): arXiv:1609.03647
Semantic Scholar (metadata): 10.1002/cpe.4032
Sci-Hub (fulltext): 10.1002/cpe.4032
Internet Archive Scholar (search for fulltext): Introducing distributed dynamic data-intensive (D3) science: Understanding applications and infrastructure
Download: https://onlinelibrary.wiley.com/doi/10.1002/cpe.4032
Tagged: Computer Science
(RSS)
Summary
- Traditional application := program run by one group written to find answer to scientific question.
- Infrastructure application := a program written in multiple stages run by different groups.
- Big data
- Distributed := presence of data in different physical or logical locations. This could because the data comes from different sensors, it could be too big to be processed by a single node on a timely manner, it could be because you want more reliability given by redundancy and load-balancing, it could be for privacy or policy reasons.
- Replicated
- Partitioning
- Streaming
- Dynamic := an application with spatiotemporal variability.
Examples
- Next Generation Sequencing (NGS) := map/align short reads to a reference genome.
- Application type: traditional
- Data: terrabyte scale data of DNA sequences
- Distribution: the problem can be distributed, but it is unclear how to get optimal performance. Few workflow systems natively manage distribution.
- Dynamic: the data itself is not dynamic, but properties of the running program are (when tasks complete).
- ATLAS := Analyze experimental physics data (pleasingly parallel)
- Application type: infrastructure; data generation and processing are controlled by different people. Scientists submit requests to run certain analyses on the data.
- Data: 20Tb per day of serialized C++ objects
- Distribution: 250,000 cores over 140 sites.
- Dynamic: data streams in continuously, and applications is run 2 or 3 times per year.
- Large Synoptic Survey Telescope (LSST) := find and study moving objects using a telescope.
- Application type: infrastructure; the data gets used by others downstream.
- Data: tens of TB per day of FITS images
- Distributional: talks to other telescopes, compute resources, and storage resources
- Dynamic: Data streams in. The system has to decide whether or not to interrupt its existing observing program to get another look at an anomalous object.
- SOA Astronomy := uses the data from LSST
- Application type: Infrastructure
- Data: 1Gb images
- Distribution: Data exists on different servers and is processed in a distributed cluster.
- Dynamic: Source data is constantly in flux.
- ... others