30 Days in the Life of a Machine Learning Researcher
If you’re thinking about going for a PhD, there are lots of excellent guides I would recommend reading. However, most guides I’ve read focus on the high-level picture of pursuing a doctorate, not how it feels day-to-day to be sitting in your lab, doing research.
When I started my PhD at Stanford three years ago, I had no idea how much time I would be banging my head against a difficult problem and staring into a monitor with no apparent solution in sight. I also could not have realized how amazing it would feel when an unexpected epiphany hit me as I biked through campus on my way home. I wanted to share some of my raw experiences with people who are deciding whether they want go into research, particularly in my field of machine learning.
So every day for the month of January 2019, I took notes on the main technical tasks (bolded throughout the article) that I worked on that day, as well as my emotional state (italics-bolded), to give you a window into the heart and mind of a machine learning PhD student.
The month of January was a psychological roller coaster: on one hand, I started off working night and day preparing two new papers to submit to the ICML conference. On the other hand, I finished the month attending the AAAI conference in Hawaii, where I drafted parts of this article sitting in the sun on the beautiful Hanauma Bay. In the weeks between, I attended a great journal club talk, mentored an undergrad student from Turkey, and tried to save our computing cluster from failing the day before the ICML deadline.
But let me start from the beginning of the month…
January 2: Leaving Family and Returning to Stanford
I had spent most of winter break with my family in Arkansas, but on the 2nd of the month, I returned to Stanford. Technically, there was another week of vacation listed on Stanford’s academic calendar, but PhD students do not follow the typical academic schedule — any breaks are decided between you and your advisor.
That means that your advisor may let you take a week off in the middle of October if you feel burned out; but it also meant that, with the ICML deadline coming up in the middle of January, I figured that I should return to Stanford and start working on my conference submissions. It was a wistful decision, but almost certainly a wise one, as I realized when I began working on my submissions for ICML…
January 3–4: Implementing a VAE for Paper 1
My first order of business was to clearly define the problem that I had been working on in bits and pieces before winter break, as well as code up a baseline that I could use to compare to my proposed algorithm.
For my first paper, I was working on a new way to learn latent features in data. Latent features are a way to represent complex data by identifying the important low-level variables that explain high-level variation in the data. For example, if you have a bunch of images of celebrity faces, each one is going to look very different from the rest. But you may be able to approximate the images by just changing a few “dimensions,” such as skin color, face angle, and hair length.
A very common algorithm for identifying such latent dimensions or features is known as the variational autoencoder (VAE). Here’s an example of a VAE that has been trained to identify 2 latent features in images of celebrities (taken from the CelebA dataset). I used the VAE to generate new images, by changing the value of exactly one latent feature at a time as you move along this grid of images horizontally or vertically:
You’ll notice that the the VAE seems to be primarily changing two things: the skin color and the background color / hair length across the images. It’s important to mention that the VAE is unsupervised: I didn’t tell the algorithm to search specifically for hair length or skin color. So you might be wondering: why did the VAE identify these two latent features?
The reason is that adjusting these two features allowed the VAE to reconstruct the celebrity images in the original dataset with minimum error. In other words, if you had to pick only two factors which you could change from a generic face to approximate get any arbitrary celebrity, these two factors would get you pretty far.
I wanted to develop a new algorithm that would allow the user to have more control over which latent features were learned. Surely, in some cases you may want to discover and tweak a different set of latent features: the amount of lipstick or the color of the celebrity’s cap, for example. For reasons I will get into later, I called my algorithm the contrastive variational autoencoder (cVAE).
But if I wanted to show that my cVAE was working, I needed a baseline to compare it with. The natural choice would be the standard VAE. So I spent a couple of days working on the following:
- Downloading and preprocessing the CelebA image dataset
- Writing the code for a VAE in TensorFlow (Keras), and training the VAE on the CelebA image dataset.
I found the experience pretty straightforward and fun because I was mostly following existing tutorials and code, only varying certain hyperparameters related to the architecture of the neural network. And it felt quite satisfying when the trained VAE was able to produce the images above.
Jan 5–6: Working Through the Weekend
I’ve told myself many times that I will not work on Saturdays and Sundays, and that I need to discover a healtheir work-life balance. But when the going gets tough, I immediately default to hitting the lab on weekends. Realizing that I had about three weeks until ICML, and with a meeting with my advisor coming up on Tuesday, I bit through my frustration and spent most of the weekend in my office, skipping a ski trip that my friends planned to Tahoe.
What did I do on the wekeend? As I mentioned before, I was working on a method to identify latent factors with more specificity than the ones that were simply dominant in the data. The idea was to use contrastive analysis (hence the name, contrastive VAE), a method in which a secondary background dataset is introduced, which does not contain the features of interest. I designed the contrastive VAE to explicitly model latent features that were present in both the primary dataset and the background dataset, as well as those that were only present in the primary dataset.
Here’s an example to illustrate the idea. Suppose you have a bunch of images of hand-written digits consisting on a complex background, like on images of grass. We’ll call this the target dataset. You also have images consisting only of grass (not necessarily the same ones used in the target dataset, but roughly similar). We’ll call this the background dataset. Examples of each kind of image are below:
You’d like to train a VAE on the target dataset to identify latent features related to the handwritten digits: the images of 0s, 1s, and 2s should each be far apart in such a latent space. However, a standard VAE trained on the target data would identify the dominant sources of variation in such a dataset to be related to that of the grass, such as the the texture and density, because these dominate the image (in the sense that more pixels are related to grass features than hand-written digit features), and completely ignore the digit-related features.
What if we instead encouraged the VAE to identify features that were present in the target dataset, but not in the background? One would hope that this contrast would be sufficient to encourage the algorithm to learn the digit-related features. I spent Saturday and most of Sunday trying various ways to adapt the loss function of a VAE to get the right kind of results on synthetic data. For the results to be meaningful, I needed to show the results on a real dataset, like that of the celebrity images, so I began a simulation from my office, and then biked home.
I’ve told myself many times that I will not work on Saturdays and Sundays… but when the going gets tough, I immediately default to going back to the lab on weekends.
Jan 7, 2019: Troubleshooting our GPU Cluster, Babbage 3
The next morning, I came back to find that my simulation hadn’t produced great results — in fact, it hadn’t produced results at all! Shortly after I had started the simulation, the lab cluster that we use to run our simulations, Babbage 3, had crashed, leaving me staring at the stack trace:
ResourceExhaustedError: OOM when allocating tensor
Normally, I would see this error when running multiple training scripts on a single GPU — this was the GPU’s way of complaining that I was feeding too much data into its memory, but I knew that shouldn’t have been the case that day as I was running a single script with a manageable dataset size. I spent some time debugging my code, restarting various scripts, and then eventually, the entire machine, but no luck.
I was somewhat irritated, because I didn’t have a long continuous period of time to debug, as it was also the start of the quarter, and I was shopping a few different classes. I spent the day in and out of classes, while in between debugging Babbage 3. After a restart later in the evening, the cluster spontaneously began working again. I hoped and prayed for the best, restarted the simulation, waited half an hour to keep an eye on the progress of the script. When it seemed to be running without any signs of aborting, I headed home.
Jan 8: Meeting with my Advisor, Richard
By the next morning, I finally had some results using the contrastive VAE architecture on the celebrity image dataset. With generative models, evaluating the results can be tricky, because image quality and latent feature selection can be subjective. It seemed to me that the results were not that great, but were a definite improvement over the standard VAE. I quickly copied and pasted the figures into a set of slides so that I would have them ready to present to my advisor, Richard.
Richard looked over my figures closely, with interest, asking questions about how I had defined the loss. After a while, he asked me whether it may be possible to improve the results by encouraging independence between the dominant latent variables (which I was trying to remove) and the relevant latent variables (which I was explicitly trying to learn). I thought that it was a great idea, and I was particularly excited as I had recently read a paper that had proposed a method for encouraging independence between latent features. I could use a similar method for my contrastive VAE, I thought.
Richard also gave me the green light to start the contrastive VAE paper. He said that with a few more experiments, I should have enough results for a paper, and if I could have them in time for the ICML deadline, it would be a good place to send the paper.
At the time, I had also been mentoring a talented undergrad student, Melih, who lived in Turkey. Melih was a friend of a friend who had reached out in September of the previous year to ask for advice in applying to PhD programs in the United States. Melih was a brilliant undergraduate student, who had participated in international computer science competitions. But when I asked him about research experience, he confided that he didn’t really have any. Since research experience is the most important factor when applying to competitive PhD programs, I had told him he should do some research before the application deadline. He then audaciously asked me if I could mentor him, and so I agreed, and suggested a collaboration on a project related to unsupervised feature selection. The idea was to determine what features (pixels in an image, genes in a transcriptomics dataset, etc.) were the most important, versus which ones were redundant. In biological datasets in particular, features often tended to be correlated, so that it was often possible
to select a small subset of independent features and reconstruct the remaining with high accuracy. If we could do this in a systematic manner, then it would save experimental costs and time, since not all genes would have to be measured — they could simply be predicted (or imputed) from a small subset of measured genes. I had reviewed the literature on feature selection, and found most of the techniques to be pretty old, and so I suggested a feature selection technique based on deep learning, which we called the “Concrete Autoencoder” to Melih.
Our collaboration had been going well. I found the experience of mentoring Melih’s very fulfilling and I particularly enjoyed reviewing his results and giving him suggestions for new approaches to try. We had recently gotten some positive results, and my advisor Richard suggested that if it would be possible to get some more results by the ICML deadline, it would be a good idea to write up a second paper and submit that to ICML as well. I gulped as I realized how much work we would have to do over the next couple of weeks.
Jan 9: Figures for Paper 1
The first thing that I do when preparing a paper to submit to a conference is prepare the figures. The figures provide the structure around which I write the rest of the paper. So on Wednesday, the day after I met with Richard, I began preparing the figures for my contrastive VAE paper. One great piece of advice that I’ve read from Zachary Lipton is that “A reader should understand your paper just from looking at the figures, or without looking at the figures.”
For the contrastive VAE paper, I was still working on getting better results, so rather than creating a figure of results, I started with a figure that described the methodology, specifically the architecture, of the contrastive VAE. The figure looked something like on the left below:
When I created the figure, I didn’t have to worry too much about making it picture perfect. It was meant to be a preliminary version of the figure, which served mostly as a placeholder that my advisor could take a look at and use to give me substantive feedback. I still probably spent an unnecessarily long time on the figure — but I personally really enjoy making figures (I know not all grad students do), and when I start drawing diagrams (usually in Powerpoint, but sometimes in LaTeX using the tikz
package), I really get into it. I designed the figure on the top left as the preliminary version, and one on the top right for the final figure that appeared in the paper. I also created a figure for the second paper showing the architecture of the concrete autoencoder, which I was particularly proud of, as it was entirely in LaTeX:
“A reader should understand your paper just from looking at the figures, or without looking at the figures” — Zachary Lipton
Jan 10: Working with an Undergraduate Student on Paper 2
After I created my figure showing the architecture of the concrete autoencoder, I sent it to Melih, as motivation along with a message saying that we should start generating figures for the concrete autoencoders. I also passed along some advice (it’s gratifying to pass along advice, I’ve learned!) that I had heard from my advisor: the first figure of a paper should be a graphic which shows the results of the method on a well-known dataset to demonstrate the power of the method — and to lure the reader into reading the whole paper, since many readers only read the first few pages of a paper or skim down to the first figure.
We decided, as the first figure, to show the results of using the concrete autoencoder on the MNIST handwritten digits dataset. The dataset is well-known in the machine learning community. Although it doesn’t really make sense to do feature selection on the dataset, we figured that if we could illustrate the 10 or 20 most “important” pixels out of the total 784 pixels in each image, along with the reconstructions that these pixels alone would allow, that would immediately communicate the power of the method. We created an image that looked like this:
I realized, over the course of my communication with Melih, that I am more of a micromanager than I thought. On an intellectual level, I knew that it would be better for Melih’s growth and my own sanity if I didn’t specify, down to the littlest stylistic detail, how the figures should look, but I couldn’t help myself when Melih would send me a version of the graphic, to comment and suggest a different color scheme or arrangement of the figures. The figures above were produced after a lot of back-and-forth on the part of Melih and me.
Jan 11: Journal Club
It was Friday again, which was the day that our lab holds our weekly journal club. I really enjoy journal club, as it forces me to read at least one paper a week: the paper that is the focus of the discussion during the journal club. In general, as a PhD student, it’s important to keep up with research happening in the field, but I often forget to read papers.
Part of that is due to the sheer numbers of papers that are produced every day. On ArXiv, the pre-print server for papers, about 100 papers are released each day, and so it’s impossible to keep up all of them. In our lab, we have a system in which each person, for one day of the week, reads the abstracts of all of the ML papers for that day, and shares the ones that are relevant to the rest of the group. Even then, it’s hard to keep up, so it’s useful to have a journal club to force you to read papers.
In our journal club, we discussed a paper that I found very interesting, “Ordered Neurons: Integrating Tree Structures into Recurrent Neural Networks”. The paper proposed modifying the widely-used LSTM architecture in a minor way to encourage the network to learn various levels of nested dependencies. While the generic LSTM architecture allowed the neural network to learn arbitrary kinds of dependencies in sequential data, the paper proposed a simple tweak that required the LSTM to learn a hierarchy of dependencies. If the outer dependencies finished, the inner ones would be finished as well. This tweak was designed to model the common hierarchical characteristic of spoken and written language (e.g. when a prepositional phrase finishes, it usually signals that a larger clause has ended as well).
I found the paper interesting, because it allowed a simple way to encode prior domain information (the domain here being natural language), while still minimizing any sort of feature engineering. What was particularly impressive was that the authors went back and found that the different levels of dependencies learned by the cell states corresponded to the tree structure that is actually present in the English language, as illustrated in the schematic from their paper below.
I personally relish reading these papers with simple, novel ideas — sometimes, more than doing research myself!
Jan 12–13: Not Working Through the Weekend
With a firm resolve, I again told myself that I would not do research over the weekend. This weekend, I was mostly able to pull that off. On Saturday, I went hiking with a few of my friends from my undergraduate university who had relocated to Silicon Valley. On Sunday, I went to Fremont to study Arabic, as I was taking an in-person weekly class to learn the Arabic language.
Jan 14: Teaching Myself the Density-Ratio Trick
I returned to the question of disentanglement, which my advisor Richard had raised in our last meeting. There were two popular approaches to disentanglement that I had read about in the literature: that of the FactorVAE and that of the TCVAE. Initially, I planned to use the TCVAE, because the FactorVAE required a separate discriminator neural network, but after looking through the algorithm in the TCVAE paper, I found it difficult to understand what exactly the authors were doing. I also didn’t think their approach would work well in my setting; in fact, the authors of the FactorVAE paper claimed that they tried the TCVAE algorithm in the original setting for which it was proposed, but couldn’t replicate the good results. I wasn’t sure if I could believe the claims of a rival paper, or whether it was a case of academic politics, but because the FactorVAE authors had described their method more clearly, I figured I should try that first.
In order to use the FactorVAE technique, I needed to understand something called the “density-ratio trick.” Surprisingly, I couldn’t find a good scientific article explaining the trick, but I was fortunate to find a very readable blog post by Shakir Mohamed, who explained the trick in clear language and crisp LaTeX math (clarity really makes all the difference!). I carefully read through the post, teaching myself the mathematics in order to understand how the method worked.
I really savor working out the math to get a better understanding of machine learning methods. In fact, working out the math behind existing methods and coding up simple prototypes is usually more interesting to me than doing large-scale experiments with my own experimental methods. I wish I had more time to read papers, do math, and understand things from first principles. But the fast pace of machine learning research usually doesn’t give you the luxury of time.
Jan 15: Presenting to Richard, Again
And then it was Tuesday, and my weekly meeting with Richard was that afternoon. I implemented a contrastive VAE with disentanglement using the density-ratio trick, and gathered the results of my experiments, and put them in a series of slides. I looked at the figures — they looked pretty decent, actually, I thought to myself. Here was one example, where I had trained a contrastive VAE on celebrities with hats, contrasted against a background dataset of celebrities without hats:
A standard VAE would have learned skin color or background color as the latent features of interest, but the contrastive VAE actually seemed to learn latent features related to the hats, such as hat color and hat shape.
Happily, Richard was content with the results as well, and he mentioned that he would be happy to look over the the paper when I had a draft ready. With the ICML deadline just a week away, I doubled down and began to write.
And that was the first two weeks of January! I hope that has given you a sense of of what it’s like to do machine learning research as a PhD student. I’ll end the first part here, since I’ve already gone on long enough. But if you’d like to know what the rest of my month was like, including the rest of the steps that I took to write the paper, let me know in the comments and I’d be happy to write a Part 2!
Author’s note: the events have been slightly edited and rearranged for coherency. Names of people and computing clusters have been modified. Caveat: these experiences are only my own, and may not represent the experiences of other PhD students, particularly those with a better work-life balance!
This article was inspired by a question from my dear friend Ali Abdalla.