Migrating On-Prem SQL Server to Amazon RDS for High Availability and Cost Optimization
Migrating On-Prem SQL Server to Amazon RDS for High Availability and Cost Optimization
Introduction
A mid-sized financial services company with over 500 employees was experiencing significant performance and availability issues with its on-premise Microsoft SQL Server infrastructure. The infrastructure supported core banking applications, customer data management, and internal reporting processes. The SQL Server environment was aging, with limited scalability, lack of built-in high availability features, and growing maintenance costs. Additionally, disaster recovery procedures were manual and error-prone, posing a risk to data integrity and SLA commitments.
The client aimed to modernize their database infrastructure by migrating to a managed cloud service to improve availability, automate maintenance tasks, ensure scalability during peak loads, and reduce operational overhead. The decision was made to migrate to Amazon RDS for SQL Server due to its fully managed capabilities, native high availability features, and seamless integration with other AWS services.
Architecture Overview
The architecture was designed for high availability, security, scalability, and automation. The RDS SQL Server instance was deployed in a Multi-AZ configuration to ensure automatic failover. The migration solution included real-time replication, monitoring, secure access, and automated provisioning using Infrastructure as Code.
Key Architectural Components:
Amazon RDS (SQL Server): Deployed in Multi-AZ for automatic failover, backup automation, and patching.
AWS DMS: Used for schema and data migration with minimal downtime via CDC.
Amazon S3: Used for storing migration scripts, exported data, and logs.
Amazon CloudWatch: Real-time monitoring of database health and performance metrics.
Amazon SNS: Notifications on critical events like failover or performance threshold breaches.
Terraform: Automated provisioning of resources across multiple environments (dev, test, prod).
AWS IAM & KMS: Role-based access and encryption at rest and in transit.
Data Sources Integration
The source database was hosted on a legacy SQL Server 2016 setup within a VMware environment. The integration involved:
Schema Extraction: The AWS Schema Conversion Tool (SCT) was used to analyze the current schema and identify compatibility issues. Necessary transformations were made.
Initial Data Export: Tables were categorized based on size and criticality. Large historical tables were extracted in batches to minimize the load.
Live Replication: Change Data Capture (CDC) was configured using DMS to capture real-time changes until cutover.
Stored Procedures & Views: Extracted and manually validated for compatibility with RDS.
Data Consistency: Verified by running parallel queries and checksums during test migration.
AWS Services Architecture
The project utilized the following AWS services in a modular and scalable architecture:
AWS Service
Purpose
Amazon RDS (SQL Server)
Primary managed DB platform with Multi-AZ high availability
AWS DMS
Continuous data replication and minimal-downtime migration
Amazon S3
Staging area for schema exports, logs, and snapshot backups
CloudWatch
Real-time and historical performance monitoring
SNS
Alerts and notifications for health events
Terraform
Automated infrastructure provisioning
IAM
Role-based access management
KMS
Encryption of RDS, S3, and backups
Data Flow Process
Step-by-Step Process:
1. Assessment & Planning:
Performed detailed workload analysis using CloudEndure and SCT.
Identified long-running queries and performance bottlenecks
Defined RPO and RTO requirements.
2. Infrastructure Provisioning:
Wrote Terraform modules to provision:
RDS SQL Server (db.m5.large) with Multi-AZ
Parameter groups and subnet groups
CloudWatch alarms, IAM roles, and KMS keys
Verified provisioning through AWS Config and Terraform state files.
3. Schema & Data Migration:
Ran AWS SCT to extract and convert schema.
Applied schema on RDS using SQL Server Management Studio.
DMS full load task created and started to replicate entire dataset.
Enabled CDC for real-time synchronization of inserts, updates, deletes.
4. Validation and Testing:
Test applications pointed to staging RDS.
Ran regression, functional, and performance tests.
Validated row counts, referential integrity, and stored procedure execution.
5. Final Cutover:
Paused writes on legacy DB.
Waited for CDC lag to reduce to zero.
Switched endpoints in application config to point to RDS.
Conducted post-migration smoke tests.
6. Post-Migration Activities:
Set up automated daily snapshots with 7-day retention.
Enabled Enhanced Monitoring and Performance Insights.
Applied IAM policies for devs, DBAs, BI users.
Error Handling and Monitoring
1. Error Handling Strategy:
DMS logs errors and warnings in CloudWatch; custom Lambda script checks logs every 15 minutes.
Retry logic for DMS tasks with configurable retry intervals.
Snapshot rollback plan in case of critical failures.
2. Monitoring Implementation:
Created CloudWatch dashboards for DB metrics: CPU, connections, IOPS, storage.
Alarm, CPU > 85% (alarm + SNS email)
Alarm, Storage > 90% (alarm + auto-scale trigger)
Failed login attempts (alarm + security team alert)
RDS Event Subscriptions: Used to track failovers, backup status, and patching events.
Data Access and Security
1. Access Controls:
IAM roles for DB admins, developers, support staff.
RDS integrated with AWS Secrets Manager for secure credential management.
Security groups locked to specific IPs and internal VPC CIDRs.
2. Encryption & Compliance:
Enabled KMS encryption for:
RDS storage and automated backups
Data in S3 used for schema/logs
Snapshots and replicas
Enforced SSL for all application connections to RDS.
Enabled audit logging for compliance with PCI DSS and SOX.
3. Best Practices Implementation
Multi-AZ Deployment: Ensured automatic failover and zero data loss.
Storage Auto-Scaling: Avoided manual intervention as data grew.
Monitoring: CloudWatch + Performance Insights for complete visibility.
Automation via Terraform: Reduced human error and enabled reproducible deployments.
Access Auditing: CloudTrail and RDS logs used to trace user activity.
Security Enhancements: Periodic key rotation and VPC-based access restrictions.
Conclusion
The migration of the on-prem SQL Server database to Amazon RDS resulted in significant improvements in reliability, scalability, and operational efficiency. Notable achievements included:
95% reduction in unplanned downtime thanks to Multi-AZ failover.
30% cost savings annually by eliminating licensing and hardware support.
RPO < 5 minutes, RTO < 1 minute via CDC and managed failover.
Increased developer velocity due to easier access to test environments from RDS snapshots.
Compliance-ready infrastructure with encryption, auditing, and logging.
This case study underscores how organizations can confidently transition from legacy on-prem systems to a modern, fully managed AWS environment while improving SLAs, compliance, and cost structure.