Getting data science projects from a development notebook to production is full of friction. LineaPy is an open-source solution that streamlines the data science productionization process by addressing common problems data practitioners run into, including:
You can access the demo setup scripts here.
LineaPy by default works locally in a development environment to serve a single user. In this post, we will show you how to host major components in LineaPy for access across your team. Hosting LineaPy enables all team members to be involved in the data science process –from data scientists to data engineers and software engineers– to use Lineapy’s capabilities to seamlessly collaborate and use each other’s work to move projects from development to production.
You can learn more about LineaPy here.
As a starting point, it’s worth noting that LineaPy runs in each user’s local development environment to capture the development process, including both the code and runtime states.
In capturing the development process, LineaPy can save any result of interest, from models, charts, statistics and dataframes, that were generated during a data scientist’s development work. These results are called Artifacts and can be stored in and retrieved from an Artifact Store. Importantly, an Artifact contains both the value and the code for the result. The artifact store can be hosted to facilitate collaboration and sharing of Artifacts between team members.
In production, artifacts need to be continuously updated as new data is ingested. LineaPy is capable of automatically creating data pipelines from artifacts by leveraging their code stored in the Artifact Store. This generates workflow files that can be run on common orchestration engines such as Airflow. LineaPy can generate a self contained Dockerfile which sets up a local orchestration engine and runs pipelines locally. Orchestration engines can also be hosted, which allows users to schedule pipeline runs and track their execution status as a team.
Together, the dev environment, artifact store, and orchestration engine provide a complete view of the major tools that data science and engineering teams work with on a daily basis to organize development work and run them as a production pipeline. This demo environment illustrates how LineaPy integrates with these components to make this transition to creating formal production pipelines as seamless as possible.
Hosting the LineaPy Artifact Store gives team members the ability to share artifacts and their associated metadata.
Under the hood, the Artifact Store is composed of two important components: a data store to store an artifact’s value and a DB to store its metadata. The Artifact Store is quite flexible and can be hosted using any SQL DB and file storage system whether locally or remotely. To connect to the Artifact Store, users simply need to configure their local LineaPy Dev Environment to use an existing Artifact Store.
Our demo environment provides a quick example of what hosting these components looks like. If at any point you run into issues, post your question in the #support channel on our community Slack and one of our developers will get back to you right away!
This demo environment specifically uses a Postgres DB, minio S3 based data store, and Airflow standalone orchestrator.
To spin up the demo environment of remote hosted LineaPy, simply use docker compose. This spins up 4 main containers of note.
Two setup notes are important for understanding how the dev environment is connected to each hosted component.
First, explore how lineapy_config.json is configured to connect the remote hosted jupyter notebook dev environment to connect the DB and minio hosted S3 bucket. Any LineaPy environment configured similarly will be able to access the same ArtifactStore and use Artifacts as if they were created locally. Simply set storage_options, artifact_storage_dir, and database_url.
Second, notice the shared-airflow-dags-volume mounted on both the dev environment and hosted orchestrator. This allows for any generated DAGs to be picked up by the orchestrator immediately without the need to copy any files over.
Run our demo notebooks to see how pipelines are created and picked up automatically by the hosted airflow orchestrator! Get another coworker to connect with another dev environment and load a few of the artifacts you created in a different environment.
The #support channel on our Slack can be used to troubleshoot any issues you run into.
Setting up a hosted LineaPy Artifact store is as simple as setting up a remote object storage bucket and a SQL database. Any local development environment with LineaPy and connectivity to the two resources can be configured to use the pair as an Artifact Store. This will allow data scientists and engineers to access the same artifact stores and collaborate in a team environment. We hope this demo illustrates how easy it is to set up LineaPy environments and configure them for any type of team environment.
The demo can be found here. See our video walkthrough of it.
If you are interested in an enterprise platform solution that provides hosting for LineaPy with features for pipeline orchestration, team collaboration, and lineage tracking, make sure to also sign up here for product announcements.