MLflow MLOps Platform: User Guide
Welcome to the MLflow MLOps Platform! This guide will walk you through the required setup steps and parameter inputs needed to deploy MLflow on AWS using our CloudFormation stack.
Prerequisites
Before deploying the CloudFormation stack, ensure you have the following resources set up in your AWS account:
- Amazon RDS (PostgreSQL) Instance: MLflow uses a database to store metadata. You need an RDS PostgreSQL instance for the
BackendStoreUri
parameter. - Amazon S3 Bucket: MLflow stores artifacts (like models) in S3. Create an S3 bucket for the
ArtifactRoot
parameter. - ACM SSL Certificate: To secure access to the MLflow interface, you need an SSL certificate from AWS Certificate Manager (ACM).
1. Setting up an Amazon RDS (PostgreSQL) Database
- Go to the RDS Console in AWS.
- Click Create Database and select PostgreSQL as the database engine.
- Choose an instance type and configure the database settings.
- Make sure the RDS instance is in the same VPC as the CloudFormation stack.
- Under Connectivity & security, ensure the instance is not publicly accessible.
- Record the following details, as you will need them for the
BackendStoreUri
parameter:- DB Hostname: The endpoint of the RDS instance.
- DB Username and DB Password: Credentials you set during the RDS setup.
- Add the database password to AWS Systems Manager (SSM) Parameter Store as a secure string.
Example BackendStoreUri Format
postgresql://<DB_USERNAME>:<DB_PASSWORD>@<DB_HOST>:5432/mlflow
Replace <DB_USERNAME>
, <DB_PASSWORD>
, and <DB_HOST>
with your database username, password, and hostname respectively.
2. Setting up an Amazon S3 Bucket
- Go to the S3 Console in AWS.
- Click Create Bucket and provide a unique name (e.g.,
mlflow-artifacts-youraccount
). - Configure the bucket settings as needed. Consider enabling versioning to keep track of artifact changes.
- Note the bucket name, as you’ll need it for the
ArtifactRoot
parameter.
Example ArtifactRoot Format
s3://<YOUR_S3_BUCKET_NAME>/
Replace <YOUR_S3_BUCKET_NAME>
with the name of your S3 bucket.
3. Setting up an ACM SSL Certificate
- Go to the Certificate Manager Console in AWS.
- Request a public certificate for the domain you’ll use to access MLflow. If you’re using the default DNS provided by AWS, you can use an ACM certificate for the load balancer’s HTTPS listener.
- Complete the domain validation process.
- Copy the ARN of the ACM certificate for use in the
ACMCertificateArn
parameter.
Deploying the CloudFormation Stack
Once you have the RDS instance, S3 bucket, and SSL certificate ready, you can deploy the CloudFormation stack.
1. Launch the CloudFormation Stack
- Go to the CloudFormation Console in AWS.
- Click Create Stack and select With new resources (standard).
- Upload the CloudFormation template (provided by this solution) and click Next.
2. Enter Stack Parameters
Provide the following parameters:
- VpcId: Select the VPC where your RDS and EC2 instances will be deployed.
- SubnetIds: Choose two or more subnets within the selected VPC. The MLflow instances will be deployed across these subnets for high availability.
- InstanceType: Choose the instance type for MLflow. We recommend
t3.medium
for production setups, but you may adjust based on your expected workload. - KeyName: Select an EC2 KeyPair that you can use to SSH into the instances for troubleshooting.
- BackendStoreUri: Enter the URI for the RDS PostgreSQL database (see format above).
- ArtifactRoot: Enter the URI of the S3 bucket (see format above).
- DBUsername: Enter the database username you created for the RDS instance.
- DBPassword: Enter the SSM Parameter Store name where the RDS password is stored.
- ACMCertificateArn: Enter the ARN of the ACM SSL certificate for HTTPS access.
- DesiredCapacity: Enter the number of MLflow instances you want in the Auto Scaling group (recommended: 2 for high availability).
3. Review and Create
- Review your parameters to ensure accuracy.
- Click Next and review any other CloudFormation options as needed (e.g., stack tags, permissions).
- Click Create Stack to begin the deployment.
4. Monitor Deployment
- You can monitor the progress of the stack in the Events tab of the CloudFormation Console.
- Once the stack is successfully created, note the LoadBalancerDNS in the Outputs section. This is the URL to access the MLflow interface.
Accessing MLflow
- Once the stack is deployed, navigate to the LoadBalancerDNS URL in your browser.
- Ensure you use
https://
to access the MLflow UI securely.
Managing and Scaling MLflow
- Auto Scaling: The Auto Scaling group will manage the number of instances based on the
DesiredCapacity
parameter. You can update this parameter to scale up or down as needed. - Monitoring: You can monitor the instances via CloudWatch, and view logs for EC2, RDS, and Load Balancer components in the AWS Console.
Troubleshooting
- EC2 Instances: If you encounter issues, SSH into the instances using the key pair you provided and check Docker logs to debug MLflow (
docker ps
to list containers anddocker logs <container_id>
). - RDS Connection: Ensure the RDS security group allows inbound connections from the MLflow instances’ security group on port 5432.
- S3 Permissions: Verify that the IAM role assigned to the instances has the required permissions to access the S3 bucket.
FAQ
1. Can I use an existing RDS PostgreSQL instance for the BackendStoreUri? Yes, as long as it’s accessible from within the same VPC and configured according to the MLflow requirements.
2. Can I use a different database engine other than PostgreSQL? This solution is pre-configured for PostgreSQL. Other database engines may work with additional configuration, but they are not officially supported in this template.
3. How do I update the SSL certificate? You can update the ACMCertificateArn
parameter in the CloudFormation stack to apply a new certificate.