Mining Development Data to Understand and Improve Software Engineering Processes in HPC Projects
From AcaWiki
Citation: Boyana Norris (2021/07) Mining Development Data to Understand and Improve Software Engineering Processes in HPC Projects. IDEAS-ECP Webinar (RSS)
Internet Archive Scholar (search for fulltext): Mining Development Data to Understand and Improve Software Engineering Processes in HPC Projects
Download: http://ideas-productivity.org/wordpress/wp-content/uploads/2021/07/hpcbp054-miningdevdata.pdf
Tagged: Computer Science
(RSS) computational science (RSS), high-performance computing (RSS)
See also
- What Predicts Software Developers' Productivity?
- 20 Patterns to Watch for in Your Engineering Team (Pluralsight)
- Interactive notebook example
Data mining
Data sources
- Git metadata: commits, forks, branches, developers
- Issues and associated discussions
- Pull requests (github, gitlab) and associated discussions
- Mailing list archives
Aggregates
- Bug-fix rate
- Feature-request rate
- Number of issues
- Issue categories
- For each issue, number of followers and watchers
- Number of contributors
- Code complexity
- Proportion of commits by most active developer
- Churn (LoC, cosine distance, commits, PRs, versions, files)
- Group developers into sub-teams
- Timestamp of commits
Example queries
- Identify domain champions (many changes over a small number of files)
- Identify areas and people with high churn
- Identify when someone is at risk of burning out
- Impact of change estimates
- How do projects weather interesting times?
- Where is development effort going?
- Does mood affect productivity?
Program analysis
Examples
- Security: Buffer overruns, improperly validated input.
- Memory safety: Null dereference, uninitialized data.
- Resource leaks: Memory, OS resources.
- API Protocols: improper use of APIs, incomplete/incorrect implementations
- Exceptions: Arithmetic/library/user-defined
- Encapsulation: Accessing internal data, calling private functions.
- Data races: Two threads access the same data without synchronization
Tools
- clang-tidy
- clang-analyze
- scan-check (wraps clang-analyze)
- flang (fortran to LLVM)
- fortran-linter
Workflow
- clang-format
- clang-tidy
- clang-analyze
- xSDK specific analysis
- compilation
- tests x {valgrind, ASan, MSan, TSan, UBSan}
- I think valgrind is overkill if you already have ASan and MSan
- code coverage
Goals
- Integrate static and dynamic program analysis into dev process
- Make it easy to follow for others