Kubeflow Pipeline + MLflow Integration
This guide shows how Kubeflow Pipelines (KFP) components log parameters, metrics, and models to MLflow on Kubeflow with the MLflow Python client. Authentication and workspace/RBAC follow Using the MLflow Python SDK with Authentication and RBAC — each component authenticates with a user identity token and the server records the run under that user.
TOC
ScopePrerequisitesHow components reach MLflowComplete example: training pipeline with MLflowUpload and runVia the KFP UIVia the KFP SDKUsing MLflow in Trainer v2 pipelinesBest practicesUse the pipeline job ID in MLflowKeep credentials in a Secret and refresh tokensLog metrics inside a runArtifact storage for productionTroubleshootingScope
- Alauda AI 2.5 and later.
- Kubeflow Pipelines and the MLflow cluster plugin are installed.
- The MLflow workspace is a namespace labelled
mlflow-enabled=true. - The MLflow OAuth proxy accepts bearer tokens and a Dex client permits the password grant — see Platform setup in the SDK guide.
Prerequisites
kfpandkfp-kubernetesPython SDKs (pip install kfp kfp-kubernetes).- Access to a KFP endpoint (see Use Kubeflow Pipelines).
- A Dex id token for a dedicated service account, minted with the OAuth2 password grant (see the SDK guide). Store it in a Kubernetes
Secretand inject it into the component. - An MLflow workspace (a namespace with
mlflow-enabled=true) the account can access.
How components reach MLflow
A pipeline component runs inside the cluster, so it talks to MLflow through the in-cluster Service http://mlflow-tracking-server.kubeflow:5000 (which is fronted by the OAuth proxy — components never use the MLflow container port directly). It authenticates exactly like any other MLflow client:
MLFLOW_TRACKING_TOKEN— a Dex id token; the MLflow client sends it asAuthorization: Bearer ….mlflow.set_workspace(...)— selects the workspace (X-MLFLOW-WORKSPACE).
The server reads the identity from the token and records the run under that user. See the SDK guide for how the token is obtained and how authorization works.
Complete example: training pipeline with MLflow
The component uses the MLflow client and reads MLFLOW_TRACKING_TOKEN from a Secret injected with kfp-kubernetes. KFP v2 packages each component from its own source, so import mlflow lives inside the function.
Create the mlflow-token Secret with an id token from the password grant (see the SDK guide):
id tokens expire (24 h by default), so refresh the mlflow-token Secret before submitting long pipelines — or mint the token inside the component from service-account credentials kept in a Secret (the password-grant call shown in the SDK guide), so each run gets a fresh token.
Upload and run
Via the KFP UI
- Go to Kubeflow Dashboard → Pipelines → Upload Pipeline and select
pipeline.yaml. - Click Create Run and fill in the parameters (workspace, model name, epochs).
- After the run starts, check the MLflow UI under Alauda AI → Tools → MLFlow — the run owner is the token's user.
Via the KFP SDK
Using MLflow in Trainer v2 pipelines
If you fine-tune with Kubeflow Trainer v2, the framework's MLflow integration (for example report_to: mlflow in LLaMA-Factory) authenticates the same way. Trainer v2 uses apiVersion: trainer.kubeflow.org/v1alpha1, kind: TrainJob, and a spec.runtimeRef + spec.trainer shape. Point it at the in-cluster Service and inject the id token from a Secret:
See Fine-tuning LLMs using Workbench for a full Trainer v2 + MLflow example.
Best practices
Use the pipeline job ID in MLflow
KFP v2 provides dsl.PIPELINE_JOB_ID_PLACEHOLDER (the v1 dsl.RUN_ID_PLACEHOLDER was removed). It is a pipeline-level placeholder, so pass it into the component as an argument — a component cannot reference dsl.* from inside its own body. Use the received string in the run name to keep runs distinct per pipeline execution.
Keep credentials in a Secret and refresh tokens
Never hardcode the token or service-account credentials in pipeline.yaml — compiled pipelines are stored and shared. Inject them from a Secret, and refresh the id token (or mint it inside the component) before it expires.
Log metrics inside a run
Each metric belongs to a mlflow.start_run() block. If a component has multiple logical stages, open a run per stage rather than logging outside a run context.
Artifact storage for production
Logging large model artifacts requires durable object storage. Configure S3-compatible storage in the MLflow plugin settings (see MLflow Tracking Server → High Availability And Storage) so artifact uploads do not hit pod disk limits.