<

Alerting on a failed fargate task - built using cloud formation

Most of my posts on my blog so far, with regards to infrastructure as code have been related to terraform, so I thought I would write a post about cloud formation.

I’ve recently been working on a project, that involved some kind of conversion. The tool that was used for conversion was a unix based program, so made sense to run this one off task when needed as a fargate task.

Another important thing I needed was to report on whether or not the task had completed successfully.

The solution I settled on, was a cloudwatch alarm, which is triggered based on the exit code from the fargate task.

The alarm would trigger an SNS notification, and then users could be notified by email as configured via an SNS subscription.

First I need to define the ecr repository, so that I can upload my built docker image.

Resources:
  MyApplicationRepository:
    Type: AWS::ECR::Repository
    Properties:
      RepositoryName: !Sub "${AppName}"
      RepositoryPolicyText:
        Version: "2012-10-17"
        Statement:
          - Sid: MyApplicationPushPull
            Effect: Allow
            Principal:
              AWS:
                - !GetAtt MyApplicationTaskExecutionRole.Arn
            Action:
              - ecr:GetDownloadUrlForLayer
              - ecr:BatchGetImage
              - ecr:BatchCheckLayerAvailability
              - ecr:PutImage
              - ecr:InitiateLayerUpload
              - ecr:UploadLayerPart
              - ecr:CompleteLayerUpload

Then I need to create the task definition, which will define how the task will run

  MyApplicationTask:
    Type: AWS::ECS::TaskDefinition
    Properties:
       ContainerDefinitions:
         - Memory: 1024
           Cpu: 256
           Image: !Sub "${AWS::AccountId}.dkr.ecr.eu-west-2.amazonaws.com/${AppName}:latest"  
           Name: !Sub "${AppName}"
           LogConfiguration:
             LogDriver: "awslogs"
             Options:
               awslogs-group: !Sub "/ecs/${AppName}"
               awslogs-region: eu-west-2
               awslogs-stream-prefix: ecs
       TaskRoleArn: !GetAtt MyApplicationTaskExecutionRole.Arn
       ExecutionRoleArn: !GetAtt MyApplicationTaskExecutionRole.Arn
       Family: !Sub "${AppName}"
       NetworkMode: awsvpc
       Cpu: 256
       Memory: 1024
       RequiresCompatibilities:
         - FARGATE
           

Then create a role to be able to execute the task

  MyApplicationTaskExecutionRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: MyApplicationTaskExecutionRole
      AssumeRolePolicyDocument:
        Statement:
          - Effect: Allow
            Principal:
              Service: "ecs-tasks.amazonaws.com"
            Action:
              - sts:AssumeRole

And add associated policies for ecr access and logging

  MyApplicationRepositoryTaskPolicyEcr:
    Type: AWS::IAM::Policy
    Properties:
      PolicyName: MyApplicationRepositoryTaskPolicyEcr
      Roles: [ !Ref MyApplicationTaskExecutionRole ]
      PolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Resource: "*"
            Action:
              - ecr:GetAuthorizationToken  

  MyApplicationRepositoryTaskPolicyLogging:
    Type: AWS::IAM::Policy
    Properties:
      PolicyName: MyApplicationRepositoryTaskPolicyLogging
      Roles: [ !Ref MyApplicationTaskExecutionRole ]
      PolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Resource: "*"
            Action:
              - cloudwatch:*
              - sns:*

          - Effect: Allow
            Resource: !Sub "arn:aws:logs:eu-west-2:${AWS::AccountId}:*"
            Action:
              - logs:CreateLogStream
              - logs:PutLogEvents
              - logs:DescribeLogStreams
              - logs:GetLogEvents

I also need to create a log group in case I want to debug any application errors

  MyApplicationLogGroup:
    Type: AWS::Logs::LogGroup
    Properties:
      LogGroupName: !Sub "/ecs/${AppName}"
      RetentionInDays: 30

Now the interesting bit. I can create an alarm, which will trigger if at least 1 event is raised

  MyApplicationFailedAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties: 
      AlarmName: "MyApplicationTaskFailed"
      AlarmDescription: "My application has failed"
      Namespace: "AWS/Events"
      MetricName: "TriggeredRules"
      Period: 60
      Statistic: Minimum
      Threshold: "0"
      TreatMissingData: "ignore"
      ComparisonOperator: GreaterThanThreshold
      EvaluationPeriods: 1
      ActionsEnabled: true    
      AlarmActions:
        - !Sub "arn:aws:sns:eu-west-2:${AWS::AccountId}:AlertHigh"

And now the actual rule, which will match on the name of the running task, the last status being stopped, and the exit code matching 1

  MyApplicationFailedRule:
    Type: AWS::Events::Rule
    Properties:
      Description: My Application has returned exit code 1
      Name: my-application-failure
      EventPattern: !Sub '{"detail-type": ["ECS Task State Change"],"source": ["aws.ecs"],"detail": {"containers": {"name": ["${AppName}"],"exitCode": [1],"lastStatus": ["STOPPED"]}}}'
      Targets: 
        - Arn: !Sub "arn:aws:sns:eu-west-2:${AWS::AccountId}:AlertHigh"
          Id: MyApplicationFailed
  • Note that I found a bug in cloudformation, which if you specify the exit code within the yaml syntax, it will treat as a string rather than a number, and not match events. To prevent the issue, I supplied the event pattern as a string rather than an object.

And thats it, the full code example below:

Parameters:
  AppName:
    Type: String
    Default: "my-application"
  
Resources:
  MyApplicationRepository:
    Type: AWS::ECR::Repository
    Properties:
      RepositoryName: !Sub "${AppName}"
      RepositoryPolicyText:
        Version: "2012-10-17"
        Statement:
          - Sid: MyApplicationPushPull
            Effect: Allow
            Principal:
              AWS:
                - !GetAtt MyApplicationTaskExecutionRole.Arn
            Action:
              - ecr:GetDownloadUrlForLayer
              - ecr:BatchGetImage
              - ecr:BatchCheckLayerAvailability
              - ecr:PutImage
              - ecr:InitiateLayerUpload
              - ecr:UploadLayerPart
              - ecr:CompleteLayerUpload
 
  MyApplicationTask:
    Type: AWS::ECS::TaskDefinition
    Properties:
       ContainerDefinitions:
         - Memory: 1024
           Cpu: 256
           Image: !Sub "${AWS::AccountId}.dkr.ecr.eu-west-2.amazonaws.com/${AppName}:latest"  
           Name: !Sub "${AppName}"
           LogConfiguration:
             LogDriver: "awslogs"
             Options:
               awslogs-group: !Sub "/ecs/${AppName}"
               awslogs-region: eu-west-2
               awslogs-stream-prefix: ecs
       TaskRoleArn: !GetAtt MyApplicationTaskExecutionRole.Arn
       ExecutionRoleArn: !GetAtt MyApplicationTaskExecutionRole.Arn
       Family: !Sub "${AppName}"
       NetworkMode: awsvpc
       Cpu: 256
       Memory: 1024
       RequiresCompatibilities:
         - FARGATE
   
  MyApplicationTaskExecutionRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: MyApplicationTaskExecutionRole
      AssumeRolePolicyDocument:
        Statement:
          - Effect: Allow
            Principal:
              Service: "ecs-tasks.amazonaws.com"
            Action:
              - sts:AssumeRole

  MyApplicationRepositoryTaskPolicyEcr:
    Type: AWS::IAM::Policy
    Properties:
      PolicyName: MyApplicationRepositoryTaskPolicyEcr
      Roles: [ !Ref MyApplicationTaskExecutionRole ]
      PolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Resource: "*"
            Action:
              - ecr:GetAuthorizationToken  

  MyApplicationRepositoryTaskPolicyLogging:
    Type: AWS::IAM::Policy
    Properties:
      PolicyName: MyApplicationRepositoryTaskPolicyLogging
      Roles: [ !Ref MyApplicationTaskExecutionRole ]
      PolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Resource: "*"
            Action:
              - cloudwatch:*
              - sns:*

          - Effect: Allow
            Resource: !Sub "arn:aws:logs:eu-west-2:${AWS::AccountId}:*"
            Action:
              - logs:CreateLogStream
              - logs:PutLogEvents
              - logs:DescribeLogStreams
              - logs:GetLogEvents

  MyApplicationLogGroup:
    Type: AWS::Logs::LogGroup
    Properties:
      LogGroupName: !Sub "/ecs/${AppName}"
      RetentionInDays: 30

  MyApplicationFailedAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties: 
      AlarmName: "MyApplicationTaskFailed"
      AlarmDescription: "My application has failed"
      Namespace: "AWS/Events"
      MetricName: "TriggeredRules"
      Period: 60
      Statistic: Minimum
      Threshold: "0"
      TreatMissingData: "ignore"
      ComparisonOperator: GreaterThanThreshold
      EvaluationPeriods: 1
      ActionsEnabled: true    
      AlarmActions:
        - !Sub "arn:aws:sns:eu-west-2:${AWS::AccountId}:AlertHigh"

  MyApplicationFailedRule:
    Type: AWS::Events::Rule
    Properties:
      Description: My Application has returned exit code 1
      Name: my-application-failure
      EventPattern: !Sub '{"detail-type": ["ECS Task State Change"],"source": ["aws.ecs"],"detail": {"containers": {"name": ["${AppName}"],"exitCode": [1],"lastStatus": ["STOPPED"]}}}'
      Targets: 
        - Arn: !Sub "arn:aws:sns:eu-west-2:${AWS::AccountId}:AlertHigh"
          Id: MyApplicationFailed

Written on May 14, 2019.