Eclipse Burst Data!
Release 1.2, 2010-03-23

Lehrstuhl für Softwaretechnik (Prof. Zeller)
Universität des Saarlandes – Informatik
Informatik Campus des Saarlandes
Campus E9 1 (CISPA)
66123 Saarbrücken
E-mail: zeller @
Telefon: +49 681 302-70970

Deutschsprachige Startseite Page d'acceuil en franšais English home page

[ News | Download | FAQ | Usage | People ]

We have mined the Eclipse version databases to compute so called change bursts. As we demonstrated in our experiments, change bursts can be used for defect prediction purposes. The dataset is publicly available for download and use.


There existed a mismatch in further column names that made the R-script in the paper not to be used without modification. We fixed the column names to match the column names as referred to by the R-script in that paper. We thank Nguyen Tri Linh from the University di Bolzano in Italy for pointing these issues out and helping us to fix these issues. We apologize for any inconvenience caused by these issues. As verified, the results reported in the paper remain valid.
There exists a mismatch between paper and data set. This mismatch does not affect the data quality but might raise questions about how to use the data set. In the paper, we stated that the data set column holding the number of pre-release defects is named NumberOfDefects. The R-script listed in the paper also references this column. The data set made available on this page though, does not contain a column of this name. Instead, the column is names pre. Sorry for the confusion. Replacing the references for NumberOfDefects by pre should do the trick. Thanks to Derek M. Jones from Knowledge Software Ltd for reporting this inconsistency. The data set remains unchanged.


Frequently Answered Questions (FAQ)

I cannot find the NumberOfDefects column in the data set

True. There is no such column in the data set. References to NumberOfDefects must be replaced by references to column pre. This holds for the paper itself but also for the R-script referenced in the paper.

What is this all about?

We have published data that identifies change bursts for each component of Eclipse. Change bursts are a sequence of consecutive changes.
For a more detailed and formal definition of change bursts, we refer to Section 2 of our paper Change Bursts as Defect Predictors. The paper is also included in the zipped files available above.

Figure 1: How gap size and burst size determine change burst detection from a sequence of changes.

What can I do with this data?

In our paper Change Bursts as Defect Predictors we showed that change bursts can be used to build very accurate defect prediction models.

Where do I get these Eclipse versions?

All the versions of Eclipse we analyzed (2.0, 2.1, and 3.0) can be accessed at the Eclipse project archived downloads site.

Why do you share this data?

Finding out where defects come from is a creative effort, and hence, better addressed by a community rather than individuals. This is why we want to share this data with the research community. To our knowledge, this is the first time change bursts were used to build defect prediction models.

What is the copyright for this data?

In general, facts are free (as in freedom), and are not copyrightable. Hence, users of this archive can use the factual information contained in the bug data archives without any restriction.

How can I reference this work?

If you publish something based on this data, we would be happy if you could attribute its source. Appropriate citation is our paper: Change Bursts as Defect Predictors.
Nachiappan Nagappan, Andreas Zeller, Thomas Zimmermann, Kim Herzig, and Brendan Murphy: Change Bursts as Defect Predictors, Proceeding ISSRE '10 Proceedings of the 2010 IEEE 21st International Symposium on Software Reliability Engineering Pages 309-318.
Here's the citation in BibTeX format:
 author = {Nagappan, Nachiappan and Zeller, Andreas and Zimmermann, Thomas and Herzig, Kim and Murphy, Brendan},
 title = {Change Bursts as Defect Predictors},
 booktitle = {Proceedings of the 2010 IEEE 21st International Symposium on Software Reliability Engineering},
 series = {ISSRE '10},
 year = {2010},
 isbn = {978-0-7695-4255-3},
 pages = {309--318},
 numpages = {10},
 url = {},
 doi = {10.1109/ISSRE.2010.25},
 acmid = {1914387},
 publisher = {IEEE Computer Society},
 address = {Washington, DC, USA},
Also, we will be happy to hear about your results using our dataset and cite your papers on this page.

What's in the Data?

What is in the packages?

Unzip one of the archives. The resulting folder contains three subfolders: hourly, daily, weekly. To compute change bursts, we split the development history into a series of events at which we would assume there could be some change or not. The sub-folders represent data sets for different definitions of these events:
  • hourly: changes applied within the same hour will be collected. Hours in which no changes have been applied will be ignored.
  • daily: changes applied within the same day will be collected. Days for which no changes have been found will be ignored.
  • weekly: changes applied within the same week will be collected. Weeks for which no changes have been found will be ignored.
Within each of these frequency folders, you find sub-folders determining the granularity level you want to work on: classes or packages.

On the next level you can choose the Eclipse versions we investigated: Eclipse 2.0, Eclipse 2.1 and Eclipse 3.0.

Within each granularity folder you will find the actual change burst data sets. Each filename is of the form


Thus, each filename is specific to an eclipse version <VERSION>, a specific gap size <GAP_SIZE>, and a specific burst size <BURST_SIZE> (for definitions of gap size and burst size, please read our paper). For each Eclipse version, we computed all combinations of burst and change sizes between 1 and 10. The ZIP-package also contains a copy of our paper.

What is the data format?

The provided CSV files contain change burst metrics for each java source file of the Eclipse project. Each column in CSV file represents one metric. A list of metrics including description is given in the paper.


<> · · Stand: 2017-01-03 21:10