Computational Science Research Opportunities in Wayne Hayes's Group


Reviews

I've supervised and mentored literally hundreds of young researchers (grads, undergrads, high school students) over the years. It's fun and productive for all --- you, me, other students, and other scientific collaborators. It's hard work, but rewarding. Here's what some have said about working in my research group:

Selection Criteria

To work in my group, you need to demonstrate yourself to be significantly above average among your peers: you need to undertake one of the following challenges and submit the results to me. If they're not correct... no problem, I give second chances. You're also allowed to ask questions, but try to keep them to a minimum. The primary criterion by which you will be judged is how well you can perform, and then write up, these tasks independently, without much help.

In all cases you need to tell me how much longer you'll be at UCI (ie., if you're an undegrad senior and this is your last quarter... we need to find something short for you to do--not always easy), submit a PDF write-up, with histograms or figures plotted. I don't need your code. I want to see your write-up including a description of you did, and why, and your results with commentary. THIS IS JUST AS MUCH A TEST OF YOUR COMMUNICATION SKILLS AS CODING. Doing research requires critical thinking and the ability to explain your rationale for what you did AND WHY. Without that your code or blind results are worthless.

I have several projects running. You only need to do ONE of the below challenges, depending upon which project you're interested in working on. You need to do the task, and then write it up nicely, with graphs or plots to illustrate your answer. You should be able to do the task within about a week at most, but the answer has to be GOOD. If you hand in a GOOD solution later, that's better than a crappy solution earlier. In other words, a good solution is required, but faster is better than slower.

Let me know if there are any ambiguities, but your job is to do this task with as little supervision from me as is possible.

Unless otherwise specified below, please direct all questions to me at whayes@uci.edu.



The Projects

Quick links:

NEW! Non-scientific projects (Web Development / Computer Systems / Software Engineering)!
Using AlphaFold to predict Protein Interactions
Human Pose Estimation
Graph theory / network analysis
Molecular dynamics of the retina
Project: Haplotype assembly
Project: Predicting evolutionary fitness using genes, traits, and environments
Project: Physics-based tracking of moving and growing cells in a colony
Project: Image analysis of spiral galaxies
Project: Non-standard ("weird") Machine Learning, theory and scientific applications

You can get a brief idea of my active projects by browsing my GitHub Repos. NOTE: Although many of the projects below are biology-related, you don't need to know much biology to participate. The biology can mostly be abstracted away into mathematical, geometrical, statitical, or algorithmic concepts.


Project: non-scientific programming (Web dev, scripting, management of Big Data & Builds, Parallelism, etc.)

If you'd like to get exposure to research but none of the scientific projects below interest you, I will have no problem finding something for you to do. Possible non-scientific projects are:

Non-scientific sub-project: "adding parallelism"

Even if you are not inerested in the scientific aspects of any of these projects, many of them can benefit with parallelism, and it's usually nontrivial to implement. So, for this challenge you are to create a thread-safe version of
this C program so that it uses as many threads as are available on the hardware. The results of the threaded and non-threaded version should be the same (or very very close--slight differences are allowed, eg having the threaded answer be off by 1% or less from the non-threaded version is acceptable).

Non-scientific sub-project: "web front-ends, submission management, web drives "

Each of the projects below is built to run on a Unix/Linux type system: either Linux itself, or WSL, or Cygwin under Windows, or the MacOS command-line. This makes the tools useful for "computer nerds" like you and me... but difficult for non-CS scientists to use. While I already have a rudimentary web interface to two of the projects (for Global Network Alignment, and the Spiral Galaxy Analysis), they are very basic and we need something more general. These two sites have similar structure: user uploads a file, is given a token representing their job, the job runs, and they either wait or come back later using the token to see if the job is done. Challenge question for web development project: This ZIP file contains: a C program (which YOU need to compile into an executable called "checksum" and make available to your web server); a BASH shell script called "backend.sh", which your webserver should call with the user's input file on its command line; a sample input file (an edge list like those used on the network alignment project); and the correct output for that input file. You are to create a website that allows any user (like me) to upload any input file smaller than 1MB; you then run the Bash script "backend.sh" on that input file, and display the resulting output (including any error messages). Send me the link to your webpage so I can try it using my own input files. (You'll need to compile the C file and place the executable into whatever directory you put the other files.) The actual project (if you're accepted into it) will also have some sort of management system so I can watch / review / check the status of running jobs.

Non-scientific sub-project: "build management"

I'm in desperate need of somebody with general systems / build management expertise. Why? Just count the number of projects below. :-) Each of them has a complex software system with complex Makefiles and other various version dependencies (including Python, which is a bitch to keep up-to-date). If you want to learn anything about managing one (or many) software systems... then feel free to contact me to sign up for a 199 (undergrad independent study class) with me. The "challenge" problem for this position: clone the BLANT repo, and install BLANT and run the regression tests (see the README.md on github, and note that it make take an hour or so), and then report to me the mean and standard deviation that's output from the test labeled "testing Graphlet (not orbit) Degree Vectors" for the k=3 case.


Project: Physics-based tracking of moving and growing cells in a colony: YouTube video

The left side shows frames from a real video of a growing bacterial colony; the right frame shows our algorithm tracking the growth and motion of each individual bacterium during its whole life cycle from being born, moving, growing, to splitting into two daughter cells. Biologists need to track cells in video frames for many purposes, including tracking the growth of cancer cells, learning about the growth of embryos, learning how bacteria move, learning how genetic changes to a cell result in functional changes during it's lifetime... it's a huge research area. Although there already exist several cell tracking algorithms out there, we are working on a novel approach that seems to have several advantages. In order to join this project, your task is to take the above animated GIF, and automatically estimate the number of bacteria in each of the frames, and produce a text file whose only output is one integer per line, representing the count, and the number of lines should equal the number of frames. You only need to use one of the two sides. You can use any language you want, and any method you want, as long as it's automatic, and you wrote it yourself (cite any references you use). Describe your algorithm and the output, and send your PDF write-up to me by email. Extra work for those who already have an undergrad degree: You must create two algorithms, one that can handle each side of the above image. Compare the results and explain any differences.

Alternatively, if you prefer "library" type (reading and writing) research, use Google Scholar to look up and describe everything you can find out about cell tracking that has been published in the last 2 calendar years. (For example, if it's now 2018, then click "since 2017" on the left-hand side toolbar in Google Scholar and see what you can find.) There is no set number of words or pages, but you should describe all relevant work that has happened in cell tracking in the designated time period, devoting at least one paragraph to each paper you find.


Project: Graph theory applied to network database engines and biological network alignment: YouTube Video

Sub-project: Network modeling and network database queries

We are working on a database engine to allow complex network-based queries. This basically means we are trying to solve the subgraph isomorphism problem in such a way as to allow lightning-fast graph-based database queries of a large network corpus. Additionally, we are working on modeling networks theoretically so as to be able to create synthetic databases of graphs that "model" real graphs.

Sub-project: Network Alignment

Network alignment is the task of finding large subgraphs, or near-subgraphs, that are common between two or more large networks. The differerence between this and network modeling and database queries is that in queries, we are generally looking for very small (10-20 nodes at most) exact or near-exact matches inside a large database of networks; in alignment, we are given a small number of input graphs (maybe half a dozen), and we're looking for much larger (hundreds or thousands of nodes) of near-matches between them.

Sub-project: Graphlet analysis, eg., brain connectivity

Our BLANT repo on GitHub was used to perform a graphlet analysis on a brain dataset called the NKI Brain Initiative, and we found the first network-based evidence of degradation of connectivity as a function of age, as seen below.

Challenge Question

If you want to do either the graph database and network modeling, or the biological network alignment project, you need to know what a graph is and how to work with them, especially how to code with them. Your task is the following: you're given a text file representing a network. The first line of the file is N, the number of nodes. You will name the nodes from 0 through N-1. The remaining lines will have two integers per line, representing an edge. You don't know in advance how many edges there are, you just keep reading until you reach end of file. Your task is to compute and output the number of connected componets (you can ignore duplicate edges, but note that a node with degree zero counts as a connected component). If you already have any sort of degree (eg a B.Sc., etc), then additionally you must compute the number of strongly connected components. Below I provide some sample inputs. I don't care what language you use. In addition, in your write up, include a histogram of the distribution of DEGREES of nodes; in the undirected case, just count all degrees; in the strongly connected (directed) case, provide histograms for both the in-degree and out-degree. (That means you'll have one histogram per graph if you're an undergrad, and three histograms per graph if you've finished your undergrad degree.) The histogram shows how many nodes have degree zero, degree 1, etc., up to the max degree. If you don't know any of these terms, look them up, don't ask me. The data for this project is here. Analyze all the graphs in the zip file.

Extra work for those who already have an undergrad degree. Read the GRAAL paper and count the number of graphlets of size 3 in all the networks.

Alternatively, if you prefer "library" type (reading and writing) research, use Google Scholar to look up and describe everything you can find out about biological network alignment that has been published in the last 2 years (ie., if it's 2021 now, then include 2020 and 2021). There is no set number of words or pages, but you should describe all relevant work that has happened in biological network alignment in the designated time period, devoting at least one paragraph to each paper you find.


Project: Molecular dynamics of proteins in the human eye

This work is in collaboration with UCI professor Krzysztof Palczewski of the Gavin Herbert Eye Institute; he is a member of the United States National Academy of Sciences, and works on treatments for various forms of blindness. One aspect is determining the precise nature of how various sets of proteins interact in the retina. The static structure and locations of these proteins are well understood, as outlined in the below figure, taken from one of his recent papers. Click here, or on the image, to see the paper.

The project involves initializing a molecular dynamics simulation with these known protein structures and locations within the surrounding tissue, and then modeling their movement using the popular open source Amber molecular dynamics system.

Challenge: To start working on this project, read up on the Amber molecular dynamics system and any one of the tutorials. Then attempt to install Amber on a system of your choice, and describe the difficulties you encounter, and suggest some ideas on how to overcome them. (Installing Amber is challinging, so a successful installation isn't required for the challenge... though if you succeed that would be really impressive!)


Project: Haplotype assembly

Most of us are familiar with the "letters", or nucleotides that make up our genomes: A, C, G, and T. Mammallian genomes (including Human) have about 3 billion (3 x 109) letters. However, each of these 3 billion locations actually have two letters: one from the mother, and the other from the father. A haplotype is the sum total of genetic code (alleles) inherited together from a single parent. Thus, humans and other mammals have two haplotypes (also called a diploid genome). However, when your genome is sequenced, it is nontrivial to separate the two haplotypes from each other, on a letter-by-letter basis. Solving this problem has many applications in medicine and biology.

Haplotype assembly is the act of clearly separating the haplotypes on a letter-by-letter basis. It's a hard problem, technically NP-complete, and algorithms to solve it are currently at the cutting edge of genomic analysis. My group is working one such algorithm, which is among the best that exist.

Challenge: clone, install, and run our prototype assembler SAHap, and report the MEC values and whether they were the BEST or not.


Project: Predicting evolutionary fitness using genes, traits, and environments

We have come up with a very simple representation of how genetic traits are expressed and influence an organism's ability to respond and survie as function of various environmental traits (temperature, food supply, etc). We are working with a biology lab to test our predictions on real bacterial colonies. Contact me if you are interested in this project.


Using AlphaFold to Predict Protein-Protein Interactions

If AlphaFold does such a good job at predicting the shape of folded proteins, and there exist ways to measure shape similarity between proteins, and that protein shape dictates which proteins interact with which.... why can't we use AlphaFold shapes in collaboration with existing dense human PPI networks, can we "predict" interactions between proteins in other species? Contact me for details.

Project: Human Pose Estimation via AI/ML

Estimating a person's "pose" (ie., the relative location of their limbs represented as a "skeleton") from a 2D image has many applications and is an active research area. There are many distinct data sets: each uses a different camera, different people, and in a different location; there are even slighly different skeletal representations between datasets. Unfortunately, the vast majority of published papers pick just one dataset (or create a new one), and then perform both training and testing of their method on just that one dataset; virtually nobody has tried training on one dataset and then testing on another--which is an absolute requirement if these systems are be leveraged in the "real world" where every user has a different webcam or phone camera. In this project we have a preliminary working method that works across datasets; our problem is testing it across the largest set of possible input datasets, due to the above differences between datasets. Your task in this project is to help us gather all the existing datasets and algorithms and prepare them for testing under our system.

CHALLENGE QUESTION: Although the actual project uses AI/ML, this challenge has no such element and is purely a geometrial problem. The ZIP file Pose.zip contains 20 images (in the subdirectory "frames") along with some text files. These files are: focal.txt, which contains the focal length of the camera (in mm); joint-names.txt, which contains the list of joints (second column) and their integer IDs (first column). You'll note that there are 14 joints, numbered zero through 13. In the real world, joints have 3D positions, so an actual pose must have 3 coordinates for each joint, for a total of 42 values. Finally, the file poses.txt contains a list of 20 poses, in order of the images in the frames directory. Each line contains 45 columns: the first 3 are the camera position in world space; the remaining 42 columns are 14 triplets of (x,y,z) cordinates of the joints in the order listed in joint-names.txt. Your goal is to write a program (language of your choice) that, for each of the 20 images: (a) finds a suitable camera orientation that points to the subject from the camera position in the first 3 columns of poses.txt; (b) project the 3D skeleton onto a 2D image resembling the right side of the sample pose above. NOTE: In the image displayed on this page, the right side is a 3D pose in 3D co-ordinates, and on the left the 2D projection of the limbs is overlayed onto the image. Your 2D projection should look like that as well, except you can plot them on a white background rather than overlaying onto the photograph. In addition to providing the 2D projection images, please accumulate all of the 2D co-ordinates used into one table so we can easily verify your 2D co-ordinates. NOTE: if you already have a degree (eg., B.Sc.), then you must also (c) superimpose your skeleton onto the image in the frames directory, as we've done in the left half of the sample image above. (Note that this step is nontrivial and not required, since the camera position, scale and center of your image and the input frame may not be exactly the same---look closely at the sample image and you'll see that even though the pose is correct, the camera viewpoint in the real and reconstructed image are not exactly the same.) Please direct questions for this challenge to my Ph.D. student Saad Manzur.


Project: Image analysis of spiral galaxies: YouTube Video

2) If you want to work in the Galaxy Image Analysis project, then you should start by playing around with any galaxy images you find on the web and putting them into the SpArcFiRe webpage. Once you get the hang of it, you have two choices:

(A) find an image of NGC5054, or take the one from my paper with Darren Davis (cited on the above web page), and try to find a set of SpArcFiRe options on the website that can find the "dim" arm on the right hand side of the image of that galaxy in the above paper.

(B) Go get the following file: here Each row is some data about a galaxy, and the columns have names in the top row. You don't need to know what all of the columns mean, but pay attention to these ones: P_CS: the probability that this galaxy is a spiral. numDcoArcsGEXXX for various values of XXX: the number of discovered arms in that galaxy that are longer than XXX. Your task is to plot a histogram of the number of spiral galaxies with N or more arms of length XXX, for each of the XXX values in the file. It would be best to plot all the histograms on one figure to be easily able to compare them to each other. What value of P_CS did you choose?

Alternatively, if you prefer "library" type (reading and writing) research, use Google Scholar to look up and describe everything you can find out about automated galaxy classification that has been published in the last 2 calendar years. (For example, if it's now 2018, then click "since 2017" on the left-hand side toolbar in Google Scholar and see what you can find.) There is no set number of words or pages, but you should describe all relevant work that has happened in automated galaxy classification in the designated time period, devoting at least one paragraph to each paper you find.

Extra work for those who already have a degree: Tell me about your astronomy and/or physics background.


Project: Non-standard ("weird") Machine Learning, theory and scientific applications

We are pursuing two different aspects of machine learning. (1) Applying traditional, maintsream ML to scientific problems, and (2) developing entirely new, non-traditional ML methods. The former are mostly applied to astronomy applications, and the latter won't teach you anything about what is normally called AI/ML. So if you want to what everybody means when they say AI/ML... this probably in't for you, because I don't do what everybody else means when they say AI/ML. See here for the challenge; also specify which type of research you're interested in. If it's the application side, you'll also need to specify which of the problems above you're interested in applying it to.