Two-thirds of data practitioners publicly share their data analytics or machine learning applications, according to The New Stack’s analysis of Kaggle’s latest annual survey of machine learning and data science.

Of those who collaborate publicly, 76% said they do so using GitHub. Despite its criticisms, the platform continues to be one of the most critical pieces of the tech stack for developers and non-developers building data and AI applications.

In 2021, more than 25,000 people responded to the survey. Because many participants used the Google-owned Kaggle platform to learn how to become data scientists, The New Stack’s analysis only focused on the 17,182 respondents who said they were employed.

Of the 840 machine learning engineers in the study, 61% said they use GitHub for sharing, the highest percentage of any profession in the report to do so. While only 40 relationship developers/lawyers participated in the study, it’s worth noting that only 45% said they use GitHub to share their apps or analytics.

Data scientists, software developers, and data analysts made up the largest portion of the study participants. Here are some other takeaways from the study:

  • Collaboration tools designed for data science, machine learning, and artificial intelligence use cases were not widely adopted in the Kaggle survey. Of the study participants who said they collaborated publicly, a third used Kaggle itself and 20% used Colab, which is also a Google product. Since these offers are affiliated with the survey itself, we don’t believe they represent anything in the larger market.
  • Streamlit, which was purchased by Snowflake earlier this year, was cited as a favorite collaboration tool by 4%. In May, the former CEO of Streamlit described the rise of data-driven applications in The New Stack.
  • Open source Nbviewer and Plotly Dash, which transformed a popular open source visualization tool into a low-code platform, were two other ways to share data analysis ML applications.

IDE and collaboration

Collaboration also takes place within and between laptops, which have taken on a life of their own as integrated development environments (IDEs). Like most developers, the average data practitioner uses more than one IDE, but some flavor of a Juypter or JuypterLab is the most common, with Visual Studio Code coming second. Yet many types of hosted notebooks struggle to make their mark in a crowded field:

  • More than a third of study participants said they use Kaggle and Colab Notebooks. Google appears to be successful in turning these users into paying customers for its other laptop and cloud offerings.
  • Eight percent use Binder, which turns a Git repository of Juypter notebooks into a live interactive environment.
  • While overall, 7% of the study said they use specific offerings from Amazon Web Services and Microsoft Azure laptops. However, more than 15% of AWS and Microsoft Azure cloud computing customers also use a laptop or other AI-like solution from their cloud provider.
  • Databricks and IBM offerings received more than passing mentions, but niche products
  • Deepnote, Code Ocean, Gradient and Observable were each used by only 1% of the study.

We are still in the early days of data-driven applications. Most data analysts aren’t interested in software licenses or the code repository they use. They want to go where the data is and where people are most likely to share their models. According to Meltano, a company created by GitLab itself is GitHub.

I could provide a huge list of low-code platforms, DataOps pipeline integrations, collaboration tools, and next-gen Airtables, many of which have strong followings. But few, if any, are really close to mass adoption. Some have achieved viability as niche products, in niche industries, but only the Juypter laptop and GitHub variants seem familiar enough to non-technical audiences, data professionals, and developers to become breakout hits. .

What do you think? How can the modern data stack break out of the schema without stifling collaboration? You can join here.

