Change Genealogies
Change dependency graphs

Lehrstuhl für Softwaretechnik (Prof. Zeller)
Universität des Saarlandes – Informatik
Informatik Campus des Saarlandes
Campus E9 1 (CISPA)
66123 Saarbrücken
E-mail: zeller @
Telefon: +49 681 302-70970

Deutschsprachige Startseite Page d'acceuil en franšais English home page

The Software Evolution project at the Software Engineering Chair, Saarland University, analyzes version and bug databases to predict failure-prone modules, related changes, and future development activities.

Change Genealogies

Change genealogies are graph structure modeling dependencies between code changes. Using change genealogies it is possible to track back which decisions caused which code changes providing fundamental information that determines the quality of a change. Change genealogies model dependencies between code changes applied at different times and affecting different code artifacts using structural code dependencies that cannot be detected by standard mining techniques.


  • Classifying Code Changes and Predicting Defects Using ChangeGenealogies . K. Herzig., S. Just, A. Rau, A. Zeller Saarland University, November 2012.
    Abstract. Identifying bug fixes and using them to estimate or even predict software quality is a frequent task when mining version archives. The number of applied bug fixes serves as code quality metric identifying defect-prone and non-defect-prone code artifacts. But when is a set of applied code changes considered a bug fix and which metrics should be used to building high quality defect prediction models? In this work, we make use of change genealogy graphs to define a set of change genealogy network metrics describing the structural dependencies of change sets. We further investigate whether change genealogy metrics can be used to identify bug fixing change sets (without using commit messages and bug databases) and whether change genealogy metrics are expressive enough to build effective defect prediction models classifying source files to be defect-prone or not. The results show that change genealogy metrics can be used to separate bug fixing from feature implementing change sets with an average precision of 72% and an average recall of 89%. Our results also show that defect prediction models based on change genealogy metrics can predict defect-prone source files with precision and recall values of up to 80%. On average the precision for change genealogy models lies at 69% and the average recall at 81%. Compared to prediction models based on code dependency network metrics, change genealogy based prediction models achieve better precision and comparable recall values.
  • Mining Cause-Effect-Chains from Version Histories. K. Herzig., A. Zeller Saarland University, November 2011. Proc. of the 2011 IEEE 22nd International Symposium on Software Reliability Engineering, Pages 60-69, IEEE Computer Society Washington, DC, USA.
    Abstract. Software reliability is heavily impacted by software changes. ow do these changes relate to each other? By analyzing the impacted method definitions and usages, we determine dependencies between changes, resulting in a change genealogy that captures how earlier changes enable and cause later ones. Model checking this genealogy reveals temporal process patterns that encode key features of the software process such as pending development activities: whenever class A is changed, its test case is later updated as well. Such patterns can be validated automatically: In an evaluation of four open source histories, our prototype would recommend pending activities with a precision of 60-72%.
  • Capturing the Long-Term Impact of Changes. K. Herzig. Saarland University, May 2010. Proc. 32nd ACM/IEEE International Conference on Software Engineering, Pages 393-396, ACM, New York, NY, USA, May 2010.
    Abstract. Developers change source code to add new functionality, fix bugs, or refactor their code. Many of these changes have immediate impact on quality or stability. However, some impact of changes may become evident only in the long term. The goal of this thesis is to explore the long-term impact of changes by detecting dependencies between code changes and by measuring their influence on software quality, software maintainability, and development effort. Being able to identify the changes with the greatest long-term impact will strengthen our understanding of a project's history and thus shape future code changes and decisions.

Data Sets

Public available data sets are strored and distributed in a public GIT repository:
To retrieve the data sets, please clone the public (read-only) GIT repository or browse the content of the data repository online.

The repository structure

The data repository contains three main directories:
  • bug_mappings: CSV files containing asssociations between issue reports and source files or change sets (transactions).
  • transaction_metrics: CSV files containing metric values collected per change set (transaction).
  • file_metrics: CSV files containing metric values collected per source files. complexity.tar.xz archives contain the complexity diff metrices for all change sets (transactions) per project.
Each of these directories contains again a series of sub-directories, each corresponding to one subject project.

Data Generation

The data sets were generated using our open-source general pupose mining tool Mozkito


<> · · Stand: 2017-01-03 21:10