One of our Program Committee members, Reshama Shaikh, recently interviewed Andreas Mueller, a Lecturer in Data Science at Columbia University and core developer of the Python library scikit-learn, on some of his recent work with the scikit-learn open source community. There is a scikit-learn sprint that is co-organized by Andreas and Reshama (an organizer for the meetup group, Women in Machine Learning and Data Science) to increase women’s participation in open source contribution, on March 4th in NYC. Check it out here.
RS) Tell us briefly about yourself
AM) I’m currently a lecturer in Data Science at Columbia University, where I teach applied machine learning. I have been a core developer of the Python library scikit-learn for the past 6 years. I recently published the book Introduction to Machine Learning for Python.
RS) How did you get involved in scikit-learn and open source in general?
AM) While working on my Ph.D. in computer vision and learning, the scikit-learn library became an essential part of my toolkit. I was an ardent user of the library, and I wanted to partake in its advancement. My initial participation in open source began in 2011 at the NIPS conference in Granada, Spain, where I had attended a scikit-learn sprint. The scikit-learn release manager at the time had to leave, and the project leads asked me to become release manager; that’s how it all got started.
RS) Last year, you reached out me, as an organizer of the meetup group, Women in Machine Learning and Data Science, and asked of our group’s interest in doing a sprint. You were working on a grant to NSF to fund the sprint for my meetup group. Where did you get the idea to submit a grant to increase women’s participation in open source?
AM) It was part of a bigger grant submission to the NSF. It is very obvious that in academia, in particular in computer science, there are very few women, there is gender bias. This is apparent at conferences where there are noticeably few women. Unfortunately, in open source, the gender bias is even worse. And in academic open-source, it is even lower. There is only one woman among the top 100 contributors to the scikit-learn library. Fortunately, there are lots of funding agencies that are happy to fund diversity and research.
RS) What are your long-term goals for increasing women’s participation in open source?
AM) My goal is to have more women actively involved in scikit-learn. Right now, there are 1 or 2, so any number greater than that is progress. Ultimately, we would like to have more women involved in central open source projects in other python projects such as numpy, matplotlib and jupyter.
RS) What do you think women bring to open source that is missing?
AM) This is a complicated question, and I want to avoid statements that are generalizations; that one gender does something that another doesn’t. My ultimate goal is to make sure that everyone in the community participates. Since both men and women use open source, it would be beneficial for the entire ecosystem if both men and women were contributors.
RS) Why do you think women are not as involved in open source?
AM) There could be a number of hypotheses. Maybe we are just so unfriendly to women and they start to drop out – I don’t think that’s it, though. The gender disparity is a substantial problem in other places in tech. It’s possible it is a funnel problem, where women do not have the opportunity to start being involved. A female friend of mine, a high-profile machine learning researcher, told me she was anxious to post on the scikit-learn issue tracker. We need to find barriers and remove them.
RS) Why is contributing to open source so important?
AM) This is an easier question. There are many tech applications and research that have been written in open source. Basically the whole internet works on Linux, and that is open source.
In contrast, there are software projects that receive corporate funding, either in terms of money or time. This is very true for the Apache ecosystem. The scientific python ecosystem, as well as other scientific programming languages such as R and Julia, are mostly the product of volunteer labor. Most scientific packages don’t have support from industry at all. There are so many people (including students and self-learners) who would not be able to do their work without it. Accessibility to open source is fundamental for education and research. This accessibility leads to opportunities for users that has categorically profound advancement for many sectors of society. The startup community has flourished as a result of this access.
RS) How does one get involved in contributing to open source?
AM) People can reach out to a project on a mailing list. Projects have guidelines on how to contribute, how to get started; they can also sign up for the mailing list. There is an issue tracker on github that lists things people can work on: fix a bug or make a small addition. It’s a good idea to start with something small. The entire process on how to submit a contribution might be complicated; My advice: start small and then go to more interesting stuff. Small contributions really help. Details here: http://scikit-learn.org/dev/developers/contributing.html
RS) What are other open source projects?
AM) Other open source Python data science projects are: numpy, matplotlib, jupyter, pandas and scipy. More details can be found at: scikit-learn.org.
*Both Reshama and Andreas will be attending MLconf NYC on Friday, March 24th. Andreas will be discussing scikit-learn and his O’Reilly book at a table in the networking space during the conference. Mention “Andreas18” and save 18% on a ticket to the event!
Andreas Mueller is a lecturer at the Data Science Institute at Columbia University and author of the O’Reilly book “Introduction to Machine Learning with Python“, describing a practical approach to machine learning with python and scikit-learn. Dr. Mueller is one of the core developers of the scikit-learn machine learning library, and has been co-maintaining it for several years. Dr. Mueller is also a Software Carpentry instructor. In the past, he worked at the NYU Center for Data Science on open source and open science, and as Machine Learning Scientist at Amazon.
Reshama Shaikh is a data scientist/statistician and MBA with skills in Python, R and SAS. She worked for over 10 years as a biostatistician in the pharmaceutical industry. She currently teaches data science at Metis. She is also an organizer of the meetup group NYU Women in Machine Learning and Data Science http://wimlds.org/chapters/about-nyc/. She received her M.S. in statistics from Rutgers University and her M.B.A. from NYU Stern School of Business. Twitter: @reshamas