Skip to main content

Back in 1987, when David Aha was still a Ph.D. student in UCI’s Department of Computer Science, he had an idea. “My plan was to provide a location where datasets — and descriptions of them — could be shared with researchers studying supervised learning,” recalls Aha, now the director of the Navy Center for Applied Research in AI (NCARAI) at the Naval Research Laboratory. He started with a small number of datasets gathered by fellow Ph.D. student Jeff Schlimmer and then waited to publicize the repository until it had at least 25 datasets. “Once it caught on,” he says, “it became clear that the collection had to live on with the dedicated help of subsequent UCI student librarians, and they’ve been outstanding.”

Indeed, the collection has lived on, with various faculty and Ph.D. students passing the baton to make sure the collection has remained up and running. By the time the current librarians — Ph.D. students Casey Graff and Dheeru Dua — took over, the UCI Machine Learning Repository had 469 datasets, representing a variety of applications domains, from physical and social sciences to business and engineering. This publicly accessible archive has been a tremendous resource for empirical and methodological research in machine learning for decades. In fact, it has had more than 38,000 citations since 1998, rendering it one of the most highly cited “references” across all of computer science.

Yet with the growing number of machine learning (ML) research papers, algorithms and datasets, it is becoming increasingly difficult to track the latest performance numbers for a particular dataset, identify suitable datasets for a given task, or replicate the results of an algorithm run on a particular dataset. To address this issue, Computer Science Professors Sameer Singh and Padhraic Smyth in the Donald Bren School of Information and Computer Sciences (ICS), along with Philip Papadopoulos, Director of UCI’s Research Cyberinfrastructure Center (RCIC), have planned a “next-generation” upgrade. The trio was recently awarded $1.8 million for their NSF grant, “Machine Learning Democratization via a Linked, Annotated Repository of Datasets.”

“It is quite an important grant, combining research, computational infrastructure and community outreach for machine learning,” says Singh, the principal investigator. The goal is to enhance the current Repository with rich metadata, links to research papers, and automated extraction and presentation of metadata and performance data. The new version will also provide systematic support for reproducible science by letting users validate empirical ML results on testbed datasets.

“ICS is known for its work in AI and emphasis on ML,” says Smyth, “and I don’t think I am boasting when I say that pretty much everyone in AI, all over the world, knows about the UCI Repository.” As outlined in the grant proposal, the Repository had an estimated 24 million visitors in 2018, with 2 million dataset downloads from 723 unique web addresses and from 119 different countries and territories, ranging from Botswana to Fiji to Greenland. Smyth himself started using the Repository long before he came to UCI. “I was a researcher in machine learning at JPL at the time [it first started] and remember being very happy to find the Repository and be able to download datasets — and documentation — for my research,” he says.

As noted in the grant abstract, the existing Repository “directly impacts tens of thousands of ML researchers and students by providing a standard and widely cited set of testbed datasets to support both research and education.” The proposed improvements will support broader and more systematic and reproducible evaluations of ML algorithms, leading to robust advances better calibrated for success in real-world environments, helping in areas ranging from climate science to personalized medicine.

“I wish UCI well on continuing to provide this important service,” says Aha, “and encourage its continued growth, not just in the number of datasets, but in broad support of empirical research in ML.”

Shani Murray