What is Reproducibility All About?

As a computer scientist, or as an academician, you do experiments with a bunch of algorithms. Let’s say you have coded a machine learning algorithm, like an Artificial Neural Network, and after doing your experiments with different datasets, you have found out that the best type of neural network has the following structure, in terms of the architecture:

So, you know that the best performance happens with this exact architecture and you have proof of that, as you have been a good programmer and saved all your results. The problem arises when you decide to reproduce the same results on your same very machine . You choose the same architecture, with the same number of convolutional layers, the same kernel size for your convolutional operations, the same pooling in your pooling layers, but your algorithm performs differently! If you are so unlucky, the results could be quite poor! This is what happened to me and made me struggle with it for 3 straight weeks, until I found a solid solution, which I am going to share with you 😉 

This is when you say, an experiment is NOT reproducible!

In this post, I will address one such notorious problem when coding with PyTorch and Python. Let’s just say that I struggled with my script for over 3 weeks and I learned about some very nasty tricks! I will share them all with you, so you will not experience the pain ;-).

The Guidelines for Reproducibility in PyTorch

  • Seed Every Random Number Generator anywhere in your code. The hiding places for random number generators in Python scripts are:

    • Numpy’s random number generator: If you are using anywhere in your code, any type of sampling, or random number generation methods of Numpy, you will need to seed Numpy’s random number generator.

    • Python’s random number generator: Python has functionalities such as shuffling a list/array, and if in your code there is any such functionalities of Python, you will need to make sure that you have seeded Python’s random number generator.

    • The behavior of your environment variable: The way the processes/threads are generated in your environment, is subject to randomness and you need to seed it.

  • Now, PyTorch has its own devilment and you need to pay attention now, as this is the real part that you are probably struggling with.

    • The general PyTorch source of randomness: Seeding it controls the general randomness in PyTorch. Especially operations related to the CPU operations!

    • The specific Cuda seeding: This has 2 steps, depending whether the code is running on only 1 or multiple GPUs. This has nothing to do with CPU operations!

    • The Cudnn’s behavior: This needs to be deterministic and this is done in 2 steps as well

  • CRAZYYYY Issues with the Dataloader: On a multi-processing platform such as a server (Whether you work with GPUs or CPUs), the dataloader can behave in a very strange manner.

    • The craziest thing ever is the behavior of the dataloader when one runs their code on a Server where we have a multi-processing, strong platform. This tricky case, that I am going to talk to you about, does not happen on your local laptop or desktop but only on a server, regardless of whether we are using GPUs or CPUs, and this is what makes this case of reproducibility indeed a nasty one! The problem is that if you are defining a batch size for your dataloader, chances are that the last batch of your data might not be fully filled with data! For instance, if your batch size is 32, then the last batch might have less number of data, say 10! The dataloader, only on a multi-processing platform like a server, fills the remaining space with some random junk! I have no idea why! But this will make your code NOT reproducible, as there is no way that I know of that you could seed this particular source of randomness! Now, as a solution, you could switch to a FULL BATCH learning method, or if you insist on a mini-batch learning method, then you could ask the dataloader to LITERALLY drop the last batch, with turning the drop_last flag to True!!

  • Some more rare Issues with the Dataloader: In some rare cases, the Dataloader itself needs to be seeded! (And No! Seeding it, did not fix the last previous problem with the last batch) You see, the dataloader has this idea of the “workers” behind the scene, that handle the entire shuffling operation in your dataloader. So, you MIGHT have to seed these workers as well! The way to seed them is a little strange! The dataloader module has a variable called “worker_init_fn” and it needs to be fed with a function (yeah! I know! Weird!) and in that function you would have to seed these workers. Here is how you do it:

  • Seeding dependent scripts everywhere: Make sure, IF your main function is calling other functions and scripts that you have coded in a separate file, that you have seeded EVERYTHING in those scripts and files as well.

  • Seeding any algorithm in any packages that you are using: Make sure if you are using any packages, that you have seeded any source of randomness in them. For example, if you are using Scikit-learn’s K-means clustering algorithm, you MUST know that this package initializes the cluster centers randomly! So, you need to seed that randomness for the sake of reproducibility. As shown below:

I do hope that this post has been concise and useful,

Take care guys 😉

Leave a Comment