MLflow MLOps Platform (User Guide)
Prerequisites
Before deploying the CloudFormation stack, ensure you have the following resources set up in your AWS account:
- Amazon RDS (PostgreSQL) Instance: MLflow uses a database to store metadata. You need an RDS PostgreSQL instance for the
BackendStoreUri
parameter.
- Amazon S3 Bucket: MLflow stores artifacts (like models) in S3. Create an S3 bucket for the
ArtifactRoot
parameter.
- ACM SSL Certificate: To secure access to the MLflow interface, you need an SSL certificate from AWS Certificate Manager (ACM).
Setting up an Amazon RDS (PostgreSQL) Database
- Go to the RDS Console in AWS.
- Click Create Database and select PostgreSQL as the database engine.
- Choose an instance type and configure the database settings.
- Make sure the RDS instance is in the same VPC as the CloudFormation stack.
- Under Connectivity & security, ensure the instance is not publicly accessible.
- Record the following details for the
BackendStoreUri
parameter:
- DB Hostname: The endpoint of the RDS instance.
- DB Username and DB Password: Credentials you set during the RDS setup.
- Add the database password to AWS Systems Manager (SSM) Parameter Store as a secure string.
Example BackendStoreUri Format:
postgresql://<DB_USERNAME>:<DB_PASSWORD>@<DB_HOST>:5432/mlflow
Setting up an Amazon S3 Bucket
- Go to the S3 Console in AWS.
- Click Create Bucket and provide a unique name (e.g., mlflow-artifacts-youraccount).
- Configure the bucket settings as needed. Consider enabling versioning to track artifact changes.
Example ArtifactRoot Format:
s3://<YOUR_S3_BUCKET_NAME>/
Setting up an ACM SSL Certificate
- Go to the Certificate Manager Console in AWS.
- Request a public certificate for the domain you’ll use to access MLflow.
- Complete the domain validation process.
- Copy the ARN of the ACM certificate for use in the
ACMCertificateArn
parameter.
Deploying the CloudFormation Stack
Once you have the RDS instance, S3 bucket, and SSL certificate ready, you can deploy the CloudFormation stack.
1. Launch the CloudFormation Stack
- Go to the CloudFormation Console in AWS.
- Click Create Stack and select With new resources (standard).
- Upload the CloudFormation template and click Next.
2. Enter Stack Parameters
Provide the following parameters:
- VpcId: Select the VPC where your RDS and EC2 instances will be deployed.
- SubnetIds: Choose two or more subnets within the selected VPC.
- InstanceType: Choose the instance type for MLflow (e.g., t3.medium for production setups).
- KeyName: Select an EC2 KeyPair to SSH into the instances for troubleshooting.
- BackendStoreUri: Enter the URI for the RDS PostgreSQL database.
- ArtifactRoot: Enter the URI of the S3 bucket.
- DBUsername: Enter the database username you created for the RDS instance.
- DBPassword: Enter the SSM Parameter Store name where the RDS password is stored.
- ACMCertificateArn: Enter the ARN of the ACM SSL certificate for HTTPS access.
- DesiredCapacity: Enter the number of MLflow instances you want in the Auto Scaling group (e.g., 2 for high availability).
3. Review and Create
- Review your parameters to ensure accuracy.
- Click Next and review additional CloudFormation options as needed.
- Click Create Stack to begin deployment.
4. Monitor Deployment
- Monitor progress in the Events tab of the CloudFormation Console.
- Once the stack is created, note the LoadBalancerDNS in the Outputs section. This is the URL to access the MLflow interface.
Accessing MLflow
- Navigate to the LoadBalancerDNS URL in your browser.
- Use
https
to securely access the MLflow UI.
Managing and Scaling MLflow
- Auto Scaling: Update the DesiredCapacity parameter in the Auto Scaling group to scale up or down as needed.
- Monitoring: Use CloudWatch to monitor instances and view logs for EC2, RDS, and Load Balancer components.
Troubleshooting
- EC2 Instances: SSH into the instances using the provided key pair and check Docker logs for MLflow issues.
- RDS Connection: Ensure the RDS security group allows inbound connections from the MLflow instances’ security group on port 5432.
- S3 Permissions: Verify the IAM role assigned to the instances has permissions to access the S3 bucket.
FAQ
- Can I use an existing RDS PostgreSQL instance? Yes, as long as it’s accessible from the same VPC and configured appropriately.
- Can I use a different database engine? PostgreSQL is recommended. Other engines may work but are not officially supported.
- How do I update the SSL certificate? Update the ACMCertificateArn parameter in the CloudFormation stack to apply a new certificate.