Sathish V
4 min readSep 28, 2020

--

Auto resume failed DMS tasks through AWS Lambda function using email notification as an event

Written By: Sathish Vemula

Use Case:

There is a need to avoid a manual resume for DMS tasks as continuous monitoring and availability is a challenging job and also there is a probability to lose the data if failed DMS task is not resumed within 1 hr retention period from the start time of the DMS task failure.

For the CDC 1hr retention period for SQL Server as source refer https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.SQLServer.html#CHAP_Source.SQLServer.CDC.Publication

Scope and Assumptions:

  • This solution is required for continuous running DMS tasks created with Ongoing replication type only and for other full load DMS tasks we can use restart directly instead of resume operation.
  • DMS task should be started manually for the first time in production through the DMS web console.
  • By default RecoverableErrorCount error handling property for DMS task will be -1 which specifies that AWS DMS service will auto restart the DMS task only for recoverable errors/failures without bringing the DMS task to a failed state. We can change the RecoverableErrorCount property to 0 which will stop the AWS DMS service from auto-restart for recoverable errors. In this case, our Lambda function will auto-resume the DMS task for all types of errors. If there is a need to track all the errors through email notification then change the property else use the default value -1.
    Steps to change the RecoverableErrorCount property are mentioned in this article as well.
  • Create AWS Lambda function which is used to execute the Python code
  • Configure email alerting through Event Subscriptions to get the mails for DMS task failures

Limitations:

  • There is a likelihood that a countless number of lambdas can be triggered if there is a realistic failure that the user needs to look into and analyze the issue but it’s an unusual issue that we may come across.
  • Lambda function will not email the status of the DMS task after auto resumed, instead it will email whether the rerun operation on DMS is successful or not.

High-level diagram

Solution:

  1. Run the DMS task manually for the initial run.
  2. Lambda will be triggered upon receiving failure email notification and performs the below steps

Modify RecoverableErrorCount error handling task setting for AWS DMS task using AWS CLI:

The default value is -1, which instructs AWS DMS to attempt to restart the task indefinitely. Set this value to 0 by using AWS to never attempt to restart a task automatically for any recoverable errors.

Copy the JSON task settings for DMS task and save it in JSON file format for eg., testdmsautoresume.json.

AWS CLI Command:

aws dms modify-replication-task — replication-task-arn arn:aws:dms:us-east-2:712392390:task:LBCDERXT54TQ6CSVASCXWB2YFYO277OKMKRI — replication-task-settings file://”C:\Users\Sathish\DMS\testdmsautoresume.json” — profile AWS — region us-east-2

For more information refer to the below post on editing JSON task settings for DMS task.

https://aws.amazon.com/premiumsupport/knowledge-center/dms-error-handling-task-settings/

Output:

Lambda (Python Code):

import jsonimport boto3import timefrom datetime import datetime
def lambda_handler(event, context):
print("Event output is ",event) message=event['Records'][0]['Sns']['Message'] print("Payload captured from the DMS event", message) print(type(message)) conv_to_dict=json.loads(message) print("Identifier Link is ",conv_to_dict['Identifier Link']) identifier_link=conv_to_dict['Identifier Link'] DMS_Taskname=identifier_link.split(" ")[1] Error_message=conv_to_dict['Event Message'] list_of_words1=Error_message.split(".") print("list_of_words1",list_of_words1[0]) rep_task_str=list_of_words1[0] list_of_words2=rep_task_str.split(" ") Status=list_of_words2[-1] print(DMS_Taskname, " status is ", Status) filter_name = 'replication-task-id' filter_value = DMS_Taskname dms_client = boto3.client('dms') filtersDict = {} filtersDict['Name'] = filter_name filtersDict['Values'] = [filter_value] print("filtersDict =" , filtersDict) response1 = dms_client.describe_replication_tasks(Filters=[filtersDict]) DMS_Taskname2=response1['ReplicationTasks'][0]['ReplicationTaskIdentifier'] dmstaskarn=response1['ReplicationTasks'][0]['ReplicationTaskArn'] loadtype=response1['ReplicationTasks'][0]['MigrationType'] print("loadtype is", loadtype) print("ReplicationTaskArn for ", DMS_Taskname2 ," is ", dmstaskarn) if(loadtype!='full-load' and Status=='failed'): reruntype='resume-processing' response2 = dms_client.start_replication_task(ReplicationTaskArn=dmstaskarn,StartReplicationTaskType=reruntype) print("response2 is ",response2) print("Rerun started for the DMS task with 'resume-processing mode") rerun_check=response2['ReplicationTask']['Status'] print("rerun_check", rerun_check) if(rerun_check=='starting'): dict1 = {} dict1['dmstaskname']=DMS_Taskname current_time=str(datetime.now()) dict1['time']=current_time dict1['timezone']='UTC' subject= "Rerun started in resume mode for DMS task "+ DMS_Taskname + " in Prod" send_mail(subject,dict1) else: dict2 = {} dict2['dmstaskname']=DMS_Taskname current_time=str(datetime.now()) dict2['time']=current_time dict2['timezone']='UTC' dict2['errorinfo']=response2 subject= "Rerun not started in resume mode for DMS task "+ DMS_Taskname + " in Prod" send_mail(subject,dict2)
def send_mail(subject,message):
sns_client = boto3.client('sns') topicArn='arn:aws:sns:us-east-2:712392390:DMS-ReplicationTask-RerunCheck' response = sns_client.publish( TopicArn=topicArn, Subject=subject, Message=json.dumps({'default': json.dumps(message)}), MessageStructure='json'
)

--

--