LineaPy co-founder Doris Xin recently sat down with the Evidently AI community for an AMA. Questions asked ranged from LineaPy’s inception story to Data/MLOps industry challenges.
This is a snapshot of the AMA and some of the key topics that were discussed. If you’re interested in reading the full discussion, check out Evidently’s blog.
Q: Do you believe data scientists should be end-to-end and be able to do everything from data prep and problem definition to deployment and monitoring?
A: Yes, from an organizational standpoint, since the handoff between the different personas for taking on different parts of the end-to-end workflow is high-friction. However, at the same time, it’s a tall order to expect data scientists to split their attention across so many tasks. So, ‘no’, from an individual productivity standpoint.
Q: What big holes do you see as missing in the tooling that exists for a relatively straightforward use case in ML? Do you think we have all the tools covered and it’s just a question of smoothing over the edges, or are there more fundamental reshapings that you think are to come?
A: If anything, we have too many tools! The unbundling of the data science tooling landscape has been driven by the desire to adopt best-in-breed point solutions. However, this trend is not sustainable. Data scientists need to glue together many different tools in the end-to-end lifecycle of ML/DS, and the overhead is arguably taking away from the benefits of best-in-breed solutions. It feels that we’re undergoing a time of “reunification” of the data science ecosystem.
Q: What uniqueness does LineaPy bring to this tooling ecosystem? That is, what distinguishes LineaPy from other existing tools?
A: LineaPy is leading the “reunification” effort. We have found a balance between creating general purpose tools that support a lot of things (suboptimally) and forcing data scientists to managing too many tools all at once by creating a “glue layer” that allows many different tools to easily integrate with the same workflow abstraction at different points in the end-to-end lifecycle.
Q: There is a lot of discussion on LinkedIn about the impending death of the modern data stack, but when I talk to companies, it seems that the modern data stack has not even arrived yet. They have yet to deploy Snowflake and work with dbt and the like. What is your feeling from talking with the market? Are we in the early stages of the modern data stack? Is it something that was already tried, and people want to re-bundle again? Or will it remain a niche that only pertains to some well-funded Internet companies?
A: This is a fun one! It feels we are actually quite early in the modern data stack journey. We have a lot of great raw ingredients floating around in the market, especially great OSS tools. What’s different about big internet companies is that there are teams dedicated to evaluating the options and figuring out integrations. For the rest of the world, limited resources often means having to work with a small sample of the available tools and having to fit together lego pieces that might not be compatible with each other. The “stack” is more like a Jenga tower, and it’s precariously balanced on a few key choices.
Q: I see similarities between LineaPy and tools for experiment tracking such as CometML or MLFlow in that all aim to increase productivity and document work. In LineaPy, there is also the pipelining component, but otherwise, how would you set apart LineaPy from tools in the data science space aiming at making DS teams more effective in going to production?
A: We’re similar in that, like MLFlow and CometML, we understand that the journey to production starts in development, and it’s crucial to provide functionalities that give us visibility into the development side of things. We’re different in two important ways 1) LineaPy framework agnostic. There’s a *logical* workflow implicitly expressed in the development code that these frameworks require data scientists to explicitly extract and define. 2) LineaPy eagerly captures everything rather than relying on data scientists to deliberately identify what’s important to capture during experimentation, which often only becomes clear in hindsight. These two factors combined allows LineaPy to automatically extract data pipelines from messy development code to be run in many different environments, such as Airflow, Ray, and someday MLFlow and CometML, too.
Q: What are your tips for those launching open-source libraries? When should one release the library? How much time should one invest on making the code elegant vs writing documentation? How do we go about finding a community of contributors?
A: If it’s “perfect,” you’re releasing it too late. Invite your users to be part of the chaos in the early days to create a partnership rather than a “dictatorship” from the tool developers to the users. Documentation is extremely important. If a tree fell in a forest and nobody was there to document it, all the development efforts will have a very hard path to real user impact. As for finding community contributors, what we’ve found to be effective is to let contributions happen organically, and really celebrate them when it happens. This ties back to the previous point of forming a partnership with the users and looping them in early to co-develop the solution.
Q: When during the DS process do you advise running LineaPy? Should we run it each time committing to a repository? Is it only until we know that the code is stable and we want to refactor it for production? Where is the sweet spot, especially thinking broader in terms of software engineering for data science?
A: LineaPy should be the first thing data scientists import into their development environment! We anticipate and support the stream-of-thought data analytics workflows because it is what a data scientist needs to arrive at high-quality insights quickly.
LineaPy can capture everything during the development process to allow users to extract any relevant parts of the development workflow postdoc in a reproducible fashion. We believe it’s unreasonable for data scientists to figure out exactly what parts to save during development. LineaPy combats the messiness by automatically analyzing the program to extract only the necessary parts to reproduce the results of interest.
It’s always fun to connect with folks interested in MLOps, data science and data engineering. If you’d like to connect with us, be sure to follow us on Twitter or join our Slack channel. You should also try out LineaPy and see how it can be useful in your workflow. We’d love your feedback for improvements. Last, check out our GitHub!