Artificial intelligence (AI) and machine learning (ML) have become a driving force of innovation in recent years. In 2022 alone, large language models like OpenAI’s GPT-3 and text-to-image diffusion models like Stable Diffusion have made great progress creating AI systems that can have full conversations and create realistic images from simple text prompts. AI/ML-powered products are also now indispensable parts of many e-commerce and digital media platforms, and chances are, you’ve interacted with one recently without even being aware of it.
But for anyone who has ever tried to build a machine learning product, you were likely amazed by the complexity you found. A single project can require several different tools: You might use pandas for your data, scikit-learn for training, and a mish-mash of obscure libraries and implementations for other functions (with some having not been updated in years). While there has been a huge amount of creativity and tooling made possible by open source, this complexity is an unfortunate side effect. If we take a look at the history of this space and the background of the various contributors, we can better understand how to get the most out of this amazing open source ecosystem, despite these difficulties.
In this Guide you will learn:
- How machine learning evolved into a space dominated by open source software and the various factors that accelerated this union. 
- Who contributes to the open source machine learning space and why. 
- How free and open source software (FOSS) has helped democratize machine learning. 
The academic roots of open source in machine learning
Machine learning was mostly an academic area of research until the early 2000s, when many people still wrote their algorithms from scratch in languages like C++ for performance. MATLAB and R provided a more high-level approach, with MATLAB offering support for linear algebra computations and a host of numerical algorithms built in. This, alongside MATLAB’s interactive environment and visualization tools, made it a good fit for ML research, but it was a commercial product with licenses that cost a few thousand dollars per year.
In 2005, MATLAB discontinued education discounts for publicly funded research institutes that weren’t also teaching, where many ML researchers worked at the time. This led to a renewed interest in more open platforms. Octave, for example, is an open source MATLAB clone that has been around since the early 1990s. But people had become dissatisfied with MATLAB’s scripting language and were interested in using more general purpose languages. One such language, Python, already had the NumPy and SciPy libraries to provide linear algebra data types and numerical algorithms, putting it a step ahead of the rest.
That same year, I was part of a group at the Conference and Workshop on Neural Information Processing Systems (NeurIPS), where key community members decided to begin using Python for machine learning. Two years later, we published a joint paper on the need for open source in machine learning and launched a special track at the Journal of Machine Learning Research (JMLR) where researchers could publish papers for their open source projects.
Open source was a natural fit for researchers for several reasons. First, the open source ethos of collaboration and publicly publishing work for others to use and build upon is very close to that of scientific research. As a scientist, you also publicly publish your work so that others can build on it, and your performance is evaluated based on citations as a measure of your impact. There was a gap here, and the JMLR special track closed it so that you could get citations for software as well, which helped to incentivize researchers to invest in open source.
Open source licenses also removed barriers to publication of software by giving researchers a legal framework with which to publish their work. While scientists know how to publish scientific research papers, publishing software was a new challenge with legal implications. With its assortment of well-designed and tested licenses, open source offered researchers and their institutions a free and easy alternative. In fact, many of the original machine learning libraries, such as NumPy and scikit-learn, came out of the research environment as open source projects, a trend that took root early on. NumPy dates back to 1995, and has become the de facto library for dealing with matrices in the ML community.
Finally, the reproducibility of research results has always been a challenge in the ML field. Much like experimental physics, an ML algorithm is not just mathematics, but also data on which it is tested and evaluated. As a result, scientific papers have often struggled with size constraints, limiting which implementation details could be included. Open source provided a way for researchers to publish the software and data alongside the paper, providing third parties with the ability to verify results. But problems remain. Some in the field have suggested that including software should be mandatory for conference submissions, but this is often labor-intensive and could introduce a new barrier to entry. Another issue lies in the immense amount of data and computational resources required to re-run experiments. For example, training larger language models requires weeks or even months running on a cluster of machines on data sets like The Pile, an 800 GB data set of text scraped from the internet. Even if the source code were available, few have the resources to reproduce the results. Still, for those who do, open source now provides more possibilities for reproducibility than before.
Big data offers an open source business model
While open source opened avenues for research publication, collaboration, and reproducibility, there remained an obstacle: the money and resources necessary to build, maintain, and grow a project beyond what is generally possible in academia. This changed with the arrival of Big Data in the mid-2000s, with projects like Apache Hadoop, Apache Spark, Apache Cassandra, Apache Storm, and many others. Inspired by technologies like Google’s MapReduce, these projects provided tools for storing and processing large amounts of data on a cluster of machines, which enabled ML use cases like clickstream processing for recommendations, churn prediction, and ad optimization. Perhaps equally as important, they also helped birth a business model around open source that led to companies investing time, money, and effort into the open source projects at their core.
The Apache Software Foundation (ASF) played an important role in shaping the open source software community by offering member projects a relatively open and explicit governance model for how to run an open source project—including guidance around contribution management, collaborative software development, and so on. Becoming a top level ASF project was a stamp of approval that showed a project was committed to taking the open source approach seriously.
At the same time, one of the ASF’s core beliefs included using a “commercial-friendly standard license,” and many of these projects saw companies founded around them to sell support. This was still early on in the general adoption of open source software in business, and companies were often concerned about using open source software since there was no support provided. Cloudera is a prominent example of one such company, which sold support for Hadoop. Later, Cloudera added additional features around Hadoop, adopting an open source business model called “open core.” The often significant resources derived from this approach allowed these companies to grow and mature the open source projects at their core—beyond what was possible in a purely academic setting—and funded efforts around marketing, documentation, training, and community conferences. Cloudera may have been the first, but many such companies followed its lead: Databricks built around Apache Spark, Confluent around Apache Kafka, and Datastax around Cassandra, to name just a few.
Deep learning, GPUs, and the Cloud advance ML
In the early 2010s, we saw the next big change: The acceleration of neural networks on graphical processing units (GPUs) led to a resurgence of deep learning models. Among the oldest algorithms for machine learning, these large models took a lot of time to compute. Repurposing GPUs, which had been designed for rendering video games, led to speedups by a factor of 100. In 2012, Alex Krizhevsky et al. won the yearly ImageNet competition with a GPU accelerated convolutional neural network, the first case where deep learning models beat existing rule-based or support vector machine (SVM) models on community benchmarks like ImageNet or CIFAR. It was just the beginning: GPU-accelerated deep learning models continued to beat existing models in other realms as well, and led to the many powerful AI systems we see today, such as GPT-3, DALL-E, and Stable Diffusion.
A few years later, powerful open source projects for deep learning like Google’s TensorFlow and Facebook’s PyTorch arrived, accelerating progress in the field. At their core, these libraries are similar to classical linear algebra libraries, but they add capabilities for GPU acceleration and functionality to compute gradients, even for complex network architectures, automatically. Computing these derivatives by hand is technically straightforward but very tedious, and while algorithms to compute the derivatives numerically already existed, integrating them into these libraries meant that people no longer had to compute them manually. Instead, they could fully focus on specifying the structure of the networks, leading the way for the increasingly complex network architectures we use today.
During this same period, cloud computing made computing resources more readily available, giving rise to the software-as-a-service (SaaS) business model, which provided another great way to build companies on open source software. The software would remain free and open source, while the company would sell services to operate the software, for example to deploy machine learning models. This changed the relationship between the business and open source: While open source projects at the core of “open core” businesses were sometimes less complete (lacking, for example, enterprise features like single-sign-on or tools to ease deployment), the projects at the center of SaaS businesses saw more investment and became more feature-complete.
Modern ML: Open source in its many forms
As you can see, today we deal with a rich and diverse landscape of open source software in the AI/ML field. Some are research projects that have stood the test of time, while others are open source software projects in the more classical sense, having been developed by a community over the years. Others still, while open source by definition, are firmly under control of a company, and sometimes act as little more than an appetizer for a full-featured paid product.
Regardless of the form it takes, open source has helped drive AI/ML to where we are today: We have countless tools available to try out and often even use for free, surrounded by communities where you can find support. And as with any open source software, you can change the software to better fit your needs, contribute back to the original project, and even fork a project entirely if your changes don’t fit the original project or it has become inactive (as can often happen with projects that started in academia). In this respect, ML isn’t that different from other areas of tech.
Open source ML differs from open source in general, however, due to its academic roots. There exist a large number of essential tools that provide immense value, but the creators have no interest in commercial development; they want to focus on their academic career, not building a software business. Similarly, projects fall by the wayside as researchers move on and nobody steps in to maintain the projects, leaving them with little to no support, no product development, unfixed bugs, and missing features. I don’t mean to criticize how people run their projects, but instead to highlight that being aware of a project’s background can help to set expectations and reduce frustration on both sides.
I hope I shed some light on the history of open source in machine learning and helped you better understand the different kinds of projects that exist today. Armed with this knowledge, you should be able to better decide which libraries to use when, and understand the incentives of the people, communities, and companies behind these projects. The ability of everyone to contribute to an open source model—big, well-funded corporation or not—is an amazing feature of the open source model, and the marriage of open source and machine learning has led to otherwise improbable advancements. Machine learning is still a fast-moving field with many startups trying to build innovative products. The MLOps movement of the past few years, for example, is currently focused on productionizing ML workloads and further democratizing ML, and I’m excited to see what other areas of the ML toolchain might also be improved in the future.
 
 
 
 
 
 
