UCI ML Repository Highlights Four Impactful Projects at 2022 ML Hackathon
The UCI Machine Learning (ML) Repository hosted the 2022 Machine Learning Hackathon from May 18 to May 29. Throughout the hackathon, participants engaged with members of the UCI ML Repository and its datasets to build creative and meaningful projects. On June 3, hackathon organizers held an awards ceremony to review project submissions and recognize four winning hacks.
Overall Best: Personalizing Recommendations Without User Activity
The UCI ML Repository is home to over 600 datasets. While there are plenty of ways to filter for datasets, how about using ML to do the work for you? Angel Vilchis, a junior computer science major specializing in intelligent systems, built a model that recommends datasets related to a select dataset. Vilchis says the model’s recommendations are highly accurate, and it benefits all users browsing the UCI ML Repository.
You first select a dataset in the UCI ML Repository you’re interested in. Then, you specify how many related datasets you would like to be recommended. Datasets are recommended based on how similar they are to the selected dataset based on three measures: characteristics, context and popularity. You can also customize the model to prefer one similarity measure over the other.
Overall Runner Up: SEW.NLP – NLP for Dataset Parsing
Knowing the context surrounding data that is collected is important and can help determine whether or not it’s suitable for your needs. To better understand datasets, the SEW.NLP team created a question-answering NLP model. They used SciBert and XLNet models and the Qasper dataset to extract information from scientific papers about datasets in the UCI ML Repository.
SEW.NLP was created by:
- Edoardo Botta – senior economics and computer science major, Università Bocconi
- William Han – senior psychological science major, UCI
- Sanay Talsania – senior business information management major, UCI
Most Creative: UnlimitedMonsterLearning – Automatic Statistician
What does it mean when a dataset is “good”? UnlimitedMonsterLearning strives to answer that question and address related concerns and ethics about the quality of datasets. The team evaluated the quality of datasets using statistical parameters and analyzed the pattern of dataset popularity with respect to a variety of statistical qualities.
UnlimitedMonsterLearning was created by:
- Yiqin Chen – junior business information management and data science major, UCI
- Hao Li – junior mathematics major and statistics minor, UCI
Most Impactful: Team Untitled – Search
The goal of Team Untitled is to improve and refine the process of searching for information in datasets using NLP. Team Untitled combined latent dirichlet allocation and latent semantic analysis modeling techniques to find the most relevant words in datasets. This helps expand search queries and find the most relevant datasets.
Team Untitled was created by:
- John Lorenzini – freshman computer science major, UCI
- Meera Jagota – freshman computer science major, UCI
- John Daniel Norombaba – freshman software engineering major, UCI
- Neel Ramesh – freshman computer science and engineering major, UCI
— Karen Phan