Dies ist ein Archiv des alten Softwaretechnik Lehrstuhls der Universität des Saarlandes. Es ist nicht länger aktuell.

Für die aktuelle Arbeit von Andreas Zeller und seiner Gruppe besuchen Sie andreas-zeller.info.
Für den aktuellen Softwaretechnik Lehrstuhl besuchen Sie www.se.cs.uni-saarland.de.

Eclipse Bug Data!
Release 2.0a, 2007-12-28

Lehrstuhl für Softwaretechnik (Prof. Zeller)
Universität des Saarlandes – Informatik
Informatik Campus des Saarlandes
Campus E9 1 (CISPA)
66123 Saarbrücken
E-mail: zeller @ cs.uni-saarland.de
Telefon: +49 681 302-70970

[ News | FAQ | Usage | People ]

We have mined the Eclipse bug and version databases to map failures to Eclipse components. The resulting data set lists the defect density of all Eclipse components. As we demonstrate in three simple experiments, the bug data set can be easily used to relate code, process, and developers to defects. The dataset is publicly available for download and use.

News

2010-03-25

An extension to the bug data sets is available. The change burst set can be found here. The data set is described in detail in our new publication Change Bursts as Defect Predictors.

2009-12-10

A revised version of the PROMISE 2007 paper "Predicting Defects for Eclipse" is now available. It supersedes the original, published version, being based on the revised 2.0a data (see below).

2009-12-09

A R file for all experiments is now available.

2007-12-28

We removed duplicates entries from the dataset that were inserted by a bug in the Perl API. Here's the new data (labelled as Release 2.0a), which extends the original data by complexity metrics and counts from abstract syntax trees.

as ARRF relations:
for file level (promise-2_0a-files-arff.zip, 4.48M) and
for package level (promise-2_0a-packages-arff.zip, 770K)
as comma-separated values:
for file level (promise-2_0a-files-csv.zip, 4.47M) and
for package level (promise-2_0a-packages-csv.zip, 768K)

If you publish material using the extended data (ARFF or CSV format), please cite our PROMISE 2007 paper and follow the acknowledgment guidelines posted on the PROMISE Repository web page.

In addition, we provide data as XML, which does not contain the metrics but references to bug reports and transactions that fixed the bugs (not included in the ARFF ad CSV files).

as XML:
a single file for both, file and package levels (promise-2_0a-xml.zip, 895K)

If you publish material based on the XML format, please cite our ISESE 2006 paper.

2007-12-12

We have fixed the clerical error in data acquisition and now provide a thoroughly revised and validated data set. The fixed version is labelled as Release 2.0.

2007-08-23

Regrettably, due to a clerical error, we have temporarily withdrawn all versions of the Eclipse bug data. Please come back to this page in the first week of December for the fixed files. We apologize for any inconvenience caused.

2006-11-06

Eclipse bug data now contains references to Bugzilla bug reports and CVS transactions.
See also: What's new in Release 1.1?

2006-06-16

Eclipse bug data is now publicly available for download.

Frequently Answered Questions (FAQ)

What is this all about?

We have published data that tells for each component of Eclipse how many defects had to be fixed in the first six months after release. More information on the data is available on our ISESE 2006 paper. The paper is also included in the zipped files available above.

What can I do with this data?

The typical use of this data is to validate hypotheses on the nature and cause of errors as they occur during software development.

Here's a recipe for research based on this data:

Take your favorite hypothesis on where errors come from.
Example: I think that side effects increase the chance of bugs.
Measure the amount of each factor in the Eclipse source code. To obtain the source code, see "Where do I get these Eclipse versions?", below.
Example: In package org.eclipse.core.launcher, 55% of all methods have side effects. (This is a hypothetical result.)
Check whether there is a correlation between the factor you measured and the bug data as you find it here.
Example: The more side-effects a component has, the more defects it has. (Again, this is a hypothetical result.)

One main result of our group so far is that specific imports correlate with the number of defects; in particular, importing "internal" components results in a higher defect rate. But as software developers, we still would like to know more about what makes software fail — in particular, whether there are any domain-independent ways to predict defects.

There are several analysis methods in the field that determine software properties. The question is whether any such properties correlate with software defects. To answer this question, one first needs defect data — and this is what we provide.

Two companion papers, If Your Bug Database Could Talk... and Predicting Defects for Eclipse.pdf, are available which contains all the details, including three simple and one large experiment. The next step is yours.

Where do I get these Eclipse versions?

All the versions of Eclipse we analyzed (2.0, 2.1, and 3.0) can be accessed at the Eclipse project archived downloads site.

Why do you share this data?

Finding out where defects come from is a creative effort, and hence, better addressed by a community rather than individuals. This is why we want to share this data with the research community. To our knowledge, this is the first time such defect data is available from a non-trivial industrial project.

What is the copyright for this data?

In general, facts are free (as in freedom), and are not copyrightable. Hence, users of this archive can use the factual information contained in the bug data archives without any restriction.

The XML representation of the data is copyrighted, though, as are the texts enclosed in the archive. Since the companion paper is submitted for publication, we currently do not allow redistribution of the archive.

How can I reference this work?

If you publish something based on this data, we would be happy if you could attribute its source. Appropriate citation is our PROMISE 2007 paper (acknowledgment guidelines for usage are posted on the PROMISE Repository web page)

Thomas Zimmermann, Rahul Premraj, and Andreas Zeller: Predicting Defects for Eclipse, In Proceedings of the Third International Workshop on Predictor Models in Software Engineering, May 2007.

Here's the citation in BibTeX format:

@inproceedings{zimmermann-promise-2007,
    title = "Predicting Defects for Eclipse",
    author = "Thomas Zimmermann and Rahul Premraj and Andreas Zeller",
    year = "2007",
    month = "May",
    booktitle = "Proceedings of the Third International Workshop on Predictor Models in Software Engineering",
    location = "Minneapolis, MN, USA",
}

Also, we will be happy to hear about your results using our dataset and cite your papers on this page.

I need more data!

We plan to periodically update the dataset to include newer versions of Eclipse. But if you wish to have additional data for the available release, please drop us a note. Our contact information is available at the end of this page.

What's in the Data?

What is in the packages?

Unzip one of the archives. As an example, the resulting folder from the XML archive contains the following files:

If Your Bug Database Could Talk.pdf
Predicting Defects for Eclipse.pdf (note that there is an updated version of this paper!)
eclipse-defects-version-2.0.xml
eclipse-defects-version-2.1.xml
eclipse-defects-version-3.0.xml
README.html (this file)

What is the data format?

The provided XML files contain the defect data collected from the eclipse bug database and version archive and are separated according to the eclipse versions.

The coarse structure of the XML files is described in the companion paper "If Your Bug Database Could Talk...":

Packages are structured hierarchically, which means subpackages are nodes with their super packages as parent node in the tree. The following example describes the tree structure:

            <package name="org.eclipse">
              <counts> ... </counts>
              <compilationunit ...> ... </compilationunit>
              <package name="org.eclipse.core">
                ...
              </package>
            </package>

Here's an extract from the file eclipse-defects-version-2.0.xml:

          <?xml version="1.0" encoding="UTF-8"?>
          <!-- comments -->
          <defects project="eclipse" release="2.0" dataversion="1.0">
          <plug-in name="platform-launcher">
            <counts>
              <count id="pre" value="0" avg="0.0" compilationunits="1" max="0"/>
              <count id="post" value="0" avg="0.0" compilationunits="1" max="0"/>
            </counts>
            <package name="org.eclipse">
              <counts>
                <count id="pre" value="0" avg="0.0" compilationunits="1" max="0"/>
                <count id="post" value="0" avg="0.0" compilationunits="1" max="0"/>
              </counts>
              <package name="org.eclipse.core">
                <counts>
                  <count id="pre" value="0" avg="0.0" compilationunits="1" max="0"/>
                  <count id="post" value="0" avg="0.0" compilationunits="1" max="0"/>
                </counts>
                <package name="org.eclipse.core.launcher">
                  <counts>
                    <count id="pre" value="0" avg="0.0" compilationunits="1" max="0"/>
                    <count id="post" value="0" avg="0.0" compilationunits="1" max="0"/>
                  </counts>
                  <compilationunit dir="/platform-launcher/library/" base="Main.java">
                    <counts>
                      <count id="pre" value="0"/>
                      <count id="post" value="0"/>
                    </counts>
                  </compilationunit>
                </package>
              </package>
            </package>
          </plug-in>

Defect counts are listed as count at the plug-in, package and compilationunit levels.
The value field contains the actual number of pre- ("pre") and post-release defects ("post"). The average ("avg") and maximum ("max") values refer to the defects found in the compilation units ("compilationunits").
Each compilation unit is listed separately ("compilationunit") within the enclosing package.
The average ("avg") is the average number of defects per compilation unit.

What's new in Release 1.1?

Compilation units now have <fix> children that reference the Bugzilla bug report (bug_id) and and the CVS revision (revision_id) for a fixed bug. Additionally, we provide the committer author and the log message that was provided (<message>) with the change. The kind field distinguished between pre-release defects ("pre") and post-release defects ("post").

          <fix kind="pre" bug_id="16191" revision_id="1.6" author="mvalenta">
            <message>16191: Sharing project already in repo, picking Base, results in tag</message>
          </fix>
          <fix kind="pre" bug_id="14737" revision_id="1.5" author="mvalenta">
            <message>14737: Add capability to move tags</message>
          </fix>

Why is the defect count for the parent different from the sum over all children?

The number of defects listed for a parent is not necessarily the sum over all children. This is because a single defect can be distributed over several children, but each defect is counted only once per parent. This is illustrated in the following diagram:

This figure exemplifies how the defects were counted in our data. In the figure, Package X comprises two sub-packages X.Y and X.Z. Package X.Y consists of two compilation units; package X.Z has three. There are four defects, each indicated with a different colour. The blue defect affects just one compilation unit in package X.Y, whereas each of the other defects affects two compilation units. In package X.Z, each compilation unit has between one and two defects. The sum over all compilation units is five. However, the total defect count for package X.Z is only three, since three distinct defects occurred overall. For package X.Y, the sum is identical to the overall defect count, because each defect affected just one compilation unit.

How was the data collected?

The data was obtained from the Eclipse bug and version databases; in essence, we automatically determined for each bug report in the bug database the associated fix in the version database and hence could determine for each bug where it was fixed — and likewise, for each component, we could tell the defects that occurred.

The major challenge for this task was to map bug reports from the bug database to compilation units in the version archive. To this end, we used text analysis of the commit messages, we identified fixes in version archives (in contrast to other changes). Typically, fix messages contain links to the bug reports in the bug database by stating a the identification number of a bug report.

Furthermore, we needed to obtain the version of each defect. We did so by using the version field provided by the bug database. Note that the first reported version was used.