![]() |
Eclipse Bug Data! |
Lehrstuhl für Softwaretechnik (Prof. Zeller) Universität des Saarlandes – Informatik Informatik Campus des Saarlandes Campus E9 1 (CISPA) 66123 Saarbrücken E-mail: zeller @ cs.uni-saarland.de Telefon: +49 681 302-70970 ![]() ![]() ![]() |
|||||||
[ News | FAQ | Usage | People ] We have mined the Eclipse bug and version databases to map failures to Eclipse components. The resulting data set lists the defect density of all Eclipse components. As we demonstrate in three simple experiments, the bug data set can be easily used to relate code, process, and developers to defects. The dataset is publicly available for download and use. News
Frequently Answered Questions (FAQ)What is this all about?We have published data that tells for each component of Eclipse how many defects had to be fixed in the first six months after release. More information on the data is available on our ISESE 2006 paper. The paper is also included in the zipped files available above.What can I do with this data?The typical use of this data is to validate hypotheses on the nature and cause of errors as they occur during software development.Here's a recipe for research based on this data:
One main result of our group so far is that specific imports correlate with the number of defects; in particular, importing "internal" components results in a higher defect rate. But as software developers, we still would like to know more about what makes software fail in particular, whether there are any domain-independent ways to predict defects. There are several analysis methods in the field that determine software properties. The question is whether any such properties correlate with software defects. To answer this question, one first needs defect data and this is what we provide. Two companion papers, If Your Bug Database Could Talk... and Predicting Defects for Eclipse.pdf, are available which contains all the details, including three simple and one large experiment. The next step is yours. Where do I get these Eclipse versions?All the versions of Eclipse we analyzed (2.0, 2.1, and 3.0) can be accessed at the Eclipse project archived downloads site.Why do you share this data?Finding out where defects come from is a creative effort, and hence, better addressed by a community rather than individuals. This is why we want to share this data with the research community. To our knowledge, this is the first time such defect data is available from a non-trivial industrial project.What is the copyright for this data?In general, facts are free (as in freedom), and are not copyrightable. Hence, users of this archive can use the factual information contained in the bug data archives without any restriction.The XML representation of the data is copyrighted, though, as are the texts enclosed in the archive. Since the companion paper is submitted for publication, we currently do not allow redistribution of the archive. How can I reference this work?If you publish something based on this data, we would be happy if you could attribute its source. Appropriate citation is our PROMISE 2007 paper (acknowledgment guidelines for usage are posted on the PROMISE Repository web page)Thomas Zimmermann, Rahul Premraj, and Andreas Zeller: Predicting Defects for Eclipse, In Proceedings of the Third International Workshop on Predictor Models in Software Engineering, May 2007.Here's the citation in BibTeX format: Also, we will be happy to hear about your results using our dataset and cite your papers on this page.@inproceedings{zimmermann-promise-2007, title = "Predicting Defects for Eclipse", author = "Thomas Zimmermann and Rahul Premraj and Andreas Zeller", year = "2007", month = "May", booktitle = "Proceedings of the Third International Workshop on Predictor Models in Software Engineering", location = "Minneapolis, MN, USA", } I need more data!We plan to periodically update the dataset to include newer versions of Eclipse. But if you wish to have additional data for the available release, please drop us a note. Our contact information is available at the end of this page.What's in the Data?What is in the packages?Unzip one of the archives. As an example, the resulting folder from the XML archive contains the following files:
What is the data format?The provided XML files contain the defect data collected from the eclipse bug database and version archive and are separated according to the eclipse versions.The coarse structure of the XML files is described in the companion paper "If Your Bug Database Could Talk...":
What's new in Release 1.1?Compilation units now have <fix> children that reference the Bugzilla bug report (bug_id) and and the CVS revision (revision_id) for a fixed bug. Additionally, we provide the committer author and the log message that was provided (<message>) with the change. The kind field distinguished between pre-release defects ("pre") and post-release defects ("post").<fix kind="pre" bug_id="16191" revision_id="1.6" author="mvalenta"> <message>16191: Sharing project already in repo, picking Base, results in tag</message> </fix> <fix kind="pre" bug_id="14737" revision_id="1.5" author="mvalenta"> <message>14737: Add capability to move tags</message> </fix> Why is the defect count for the parent different from the sum over all children?The number of defects listed for a parent is not necessarily the sum over all children. This is because a single defect can be distributed over several children, but each defect is counted only once per parent. This is illustrated in the following diagram:This figure exemplifies how the defects were counted in our data. In the figure, Package X comprises two sub-packages X.Y and X.Z. Package X.Y consists of two compilation units; package X.Z has three. There are four defects, each indicated with a different colour. The blue defect affects just one compilation unit in package X.Y, whereas each of the other defects affects two compilation units. In package X.Z, each compilation unit has between one and two defects. The sum over all compilation units is five. However, the total defect count for package X.Z is only three, since three distinct defects occurred overall. For package X.Y, the sum is identical to the overall defect count, because each defect affected just one compilation unit. How was the data collected?The data was obtained from the Eclipse bug and version databases; in essence, we automatically determined for each bug report in the bug database the associated fix in the version database and hence could determine for each bug where it was fixed and likewise, for each component, we could tell the defects that occurred.The major challenge for this task was to map bug reports from the bug database to compilation units in the version archive. To this end, we used text analysis of the commit messages, we identified fixes in version archives (in contrast to other changes). Typically, fix messages contain links to the bug reports in the bug database by stating a the identification number of a bug report. Furthermore, we needed to obtain the version of each defect. We did so by using the version field provided by the bug database. Note that the first reported version was used. Interesting. Where can I learn more about this work?For more on this work, have a look at our web sitehttp://www.st.cs.uni-saarland.de/softevo/ Keep me posted
People
Impressum ● Datenschutzerklärung <webmaster@st.cs.uni-saarland.de> · http://www.st.cs.uni-saarland.de/softevo/bug-data/eclipse/?lang=fr · Stand: 2018-04-05 13:41 |