MLflow is a great tool for managing the entire ML lifecycle. We often use MLflow for the following purposes:
Compared to MLflow, which focuses on ML models, LineaPy treats all serializable Python objects as artifacts. Here’s how:
LineaPy and MLflow intersect when an artifact is also an ML model. In this case, it would be great if we could leverage the functionality from both libraries.
Currently, in order to achieve the above goal, we need to save the artifact (model) in both MLflow and LineaPy. However, manually maintaining two copies of the same data in two locations is usually not a good practice. Thus, we were trying to find an easy way for existing MLflow users to leverage the functionalities of LineaPy or vice versa.
Both LineaPy and MLflow require users to save the artifact (ML model) to the artifact store to utilize the full feature of the library. In LineaPy, it is through lineapy.save
, and in MLflow, it is through mlflow.flavor.log_model
or its equivalent.
Manually writing two save statements for the same object can create a lot of problems down the line. For instance, version tracking for two stores would become a nightmare if one of the two statements is missed accidentally. It would be great if we can just write a single statement to register the artifact in both LineaPy and MLflow so we no longer need to worry about the syncing problem between LineaPy and MLflow. This is exactly what the rest of the post is about.
Note that, MLflow supports several model flavors such as sklearn
, tensorflow
, statsmodels
, and more. In the rest of the post, we will use `flavor` to represent various MLflow-supported flavors.
LineaPy supports using MLflow as the storage backend for ML models. Instead of writing mlflow.flavor.log_model
and lineapy.save
twice for the same artifact (model), and save the artifact in both locations, we can now write a single lineapy.save
and the model will only be saved in MLflow but registered in both the MLflow and LineaPy stores.
When we run lineapy.save(model, 'model_name')
, LineaPy detects the object type of model
. If the model is an MLflow-supported flavor model, instead of using LineaPy as the storage backend to serialize/deserialize the model (artifact), LineaPy calls mlflow.flavor.log_model to log the model into the MLflow artifact store.
Once we save an artifact from LineaPy using MLflow as the backend storage, we can retrieve the model from both LineaPy and MLflow depending on our preference.
We can use the same API for all other LineaPy artifacts to retrieve the artifact value (ML model) even if it used MLflow as the storage backend as in the following:
artifact = lineapy.get('model_name', version=artifact_versiona)
model = artifact.get_value()
Since ML models are using MLflow as the backend storage, they are registered in the MLflow model store as well. Thus, we can retrieve the same model with MLflow API as well.
model = mlflow.flavor.load_model('model_uri')
All the LineaPy features like code slicing, lineapy.get('model_name').get_code()
and pipeline generation lineapy.to_pipeline(['model_name'])
should work as usual.
All the MLflow features should work as usual. There is no need to change the way we interact with MLflow because of LineaPy.
Here are some configuration items related to MLflow needed to be set so LineaPy can use MLflow as the storage backend for ML models.
mlflow_tracking_uri
: where the MLflow model is tracked (see MLflow Tracking).mlflow_registry_rui
: (optional) depend on how our MLflow is configured (see MLflow Model Registry).default_ml_models_storage_backend
: which storage backend(lineapy
or mlflow
) to use by default for an MLflow-supported model (pre-configured as mlflow
).These configuration items can be set like all other LineaPy configuration items.
Here is a basic example to configure MLflow within LineaPy, use lineapy.save
to register the ML model in both LineaPy and MLflow stores, and retrieve the model with both LineaPy API and MLflow API.
import lineapy
import mlflow
# Configure MLflow within LineaPy
lineapy.options.set('mlflow_tracking_uri','file:///tmp/mlruns')
lineapy.options.set('mlflow_registry_uri','sqlite://')
# Train a sklearn model
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=0)
X = [[ 1, 2, 3], [11, 12, 13]] # 2 samples, 3 features
y = [0, 1]
clf.fit(X, y)
# Only need one save statement and model should be register in both LineaPy and MLflow
lineapy.save(clf, 'clf', registered_model_name='lineapy_clf')
# Retrieve Model from LineaPy
art = lineapy.get('clf')
lineapy_model = art.get_value()
metadata = art.get_metadata()
# Retrieve Model from MLflow
client = mlflow.MlflowClient()
latest_version = client.search_model_versions("name='lineapy_clf'")[0].version
mlflow_model = mlflow.sklearn.load_model(f'models:/lineapy_clf/{latest_version}')
Currently, we’re supporting the following model flavors: prophet, sklearn, statsmodels, xgboost, and more. We plan to support all MLflow-supported flavors soon.
As we can see from the above example, using LineaPy with MLflow is extremely easy. With minimal code change (using lineapy.save
instead of mlflow.flavor.log_model
), we can enjoy the benefits of both libraries.
However, we believe what we have achieved so far is just the tip of the iceberg regarding the integration of LineaPy with other tools. One potential direction related to this post is whether we should let LineaPy just detect mlflow.flavor.log_model
and register the model as a LineaPy artifact automatically. We would love to hear your thoughts on this.