Amazon Textract: Extract Text from PDF and Image Files [A How To Guide]

By Yi Ai Message

Amazon recently released Textract in the Asia Pacific (Sydney), thus i decided to write a javascript OCR demo using Amazon Textract.

Amazon Textract is a service that automatically extracts text and data from scanned documents. Amazon Textract goes beyond simple optical character recognition (OCR) to also identify the contents of fields in forms and information stored in tables.

In this post, I show how we can use AWS Textract to extract text from scanned pdf files.

Overview of the process

  • Upload files to an S3 bucket.
  • A S3 event trigger will invoke an AWS Lambda function, which will call Amazon Textract asynchronous operations to analyse uploaded document and then push the status of the job to an SNS topic after document analysis job completed.
  • The SNS topic will invoke another Lambda function, which will read the status of the job, and if job status is SUCCEEDED, it will write the extracted text to a .txt object to S3 bucket.
  • A Http Api endpoint can also get extracted job status and result by giving job id.
  • The following diagram shows the architecture of the process.

The following must be done before following this guide:

  • Setup an AWS account.
  • Configure the AWS CLI with user credentials.
  • Install AWS CLI.
  • jq (optional).
Before getting started, Install the AWS SAM CLI and creates an application with sample code using 
sam init -r nodejs12.x

There will be a SAM template file (template.yaml) in the project directory created. Let’s start to define a set of objects in template file as below:

  • lambda functions and inline policies;
  • S3 bucket
  • IAM role
  • SNS topic
  • Http Api
AWSTemplateFormatVersion: "2010-09-09"
Transform: AWS::Serverless-2016-10-31 Globals:
 Function:
 Timeout: 60 Parameters:
 Stage:
 Type: String
 Default: dev
 BucketName:
 Type: String
 Default: aiyi.demo.textract Resources:
 TextractSNSTopic:
 Type: AWS::SNS::Topic
 Properties:
 DisplayName: !Sub "textract-sns-topic"
 TopicName: !Sub "textract-sns-topic"
 Subscription:
 - Protocol: lambda
 Endpoint: !GetAtt TextractEndFunction.Arn  TextractSNSTopicPolicy:
 Type: AWS::Lambda::Permission
 Properties:
 FunctionName: !Ref TextractEndFunction
 Principal: sns.amazonaws.com
 Action: lambda:InvokeFunction
 SourceArn: !Ref TextractSNSTopic  TextractEndFunction:
 Type: AWS::Serverless::Function
 Properties:
 CodeUri: src/
 Handler: handler.textractEndHandler
 Runtime: nodejs12.x
 Role: !GetAtt TextractRole.Arn
 Policies:
 - AWSLambdaExecute
 - Statement:
 - Effect: Allow
 Action:
 - "s3:PutObject"
 Resource: !Join [":", ["arn:aws:s3::", !Ref BucketName]]  TextractStartFunction:
 Type: AWS::Serverless::Function
 Properties:
 Environment:
 Variables:
 TEXT_EXTRACT_ROLE: !GetAtt TextractRole.Arn
 SNS_TOPIC: !Ref TextractSNSTopic
 Role: !GetAtt TextractRole.Arn
 CodeUri: src/
 Handler: handler.textractStartHandler
 Runtime: nodejs12.x
 Events:
 PDFUploadEvent:
 Type: S3
 Properties:
 Bucket: !Ref S3Bucket
 Events: s3:ObjectCreated:*
 Filter:
 S3Key:
 Rules:
 - Name: suffix
 Value: ".pdf"  TextractRole:
 Type: AWS::IAM::Role
 Properties:
 RoleName: "TextractRole"
 AssumeRolePolicyDocument:
 Version: "2012-10-17"
 Statement:
 - Effect: "Allow"
 Principal:
 Service:
 - "textract.amazonaws.com"
 - "lambda.amazonaws.com"
 Action:
 - "sts:AssumeRole"
 ManagedPolicyArns:
 - "arn:aws:iam::aws:policy/AWSLambdaExecute"
 Policies:
 - PolicyName: "TextractRoleAccess"
 PolicyDocument:
 Version: "2012-10-17"
 Statement:
 - Effect: Allow
 Action:
 - "sns:*"
 Resource: "*"
 - Effect: Allow
 Action:
 - "textract:*"
 Resource: "*"  GetTextractResult:
 Type: AWS::Serverless::Function
 Properties:
 Role: !GetAtt TextractRole.Arn
 CodeUri: src/
 Handler: handler.getTextractResult
 Runtime: nodejs12.x
 Events:
 TextExactStart:
 Type: HttpApi
 Properties:
 Path: /textract
 Method: post  MyHttpApi:
 Type: AWS::Serverless::HttpApi
 Properties:
 StageName: !Ref Stage
 Cors:
 AllowMethods: "'OPTIONS,POST,GET'"
 AllowHeaders: "'Content-Type'"
 AllowOrigin: "'*'"  S3Bucket:
 Type: AWS::S3::Bucket
 Properties:
 BucketName: !Ref BucketName

Note that API Gateway HTTP API AWS::Serverless::HttpApi which is still in beta and is subject to change, please don’t use it for production.

The following code example shows how to use a few lines of code to send pdf to Amazon Textract asynchronous operations in a lambda function and another lambda function will be triggered to get json response back by calling getDocumentAnalysisonce once Textract analysis job is completed. We then iterate over the blocks in JSON and save the detected text to S3.

exports.textractStartHandler = async (event, context, callback) => { try { const bucket = event.Records[0].s3.bucket.name; const key = event.Records[0].s3.object.key; const params = { DocumentLocation: { S3Object: { Bucket: bucket, Name: key } }, FeatureTypes: ["TABLES", "FORMS"], NotificationChannel: { RoleArn: process.env.TEXT_EXTRACT_ROLE, SNSTopicArn: process.env.SNS_TOPIC } }; const reponse = await textract.startDocumentAnalysis(params).promise(); console.log(reponse); } catch (err) { console.log(err); } finally { callback(null); }
};
exports.textractEndHandler = async (event, context, callback) => { try { const { Sns: { Message } } = event.Records[0]; const { JobId: jobId, Status: status, DocumentLocation: { S3ObjectName, S3Bucket } } = JSON.parse(Message); if (status === "SUCCEEDED") { const textResult = await getDocumentText(jobId, null); const params = { Bucket: S3Bucket, Key: `${path.parse(S3ObjectName).name}.txt`, Body: textResult }; await s3.putObject(params).promise(); } } catch (error) { callback(error); } finally { callback(null); }
};
const getDocumentText = async (jobId, nextToken) => { console.log("nextToken", nextToken); const params = { JobId: jobId, MaxResults: 100, NextToken: nextToken };
if (!nextToken) delete params.NextToken;
let { JobStatus: _jobStatus, NextToken: _nextToken, Blocks: _blocks } = await textract.getDocumentAnalysis(params).promise();
let textractResult = _blocks .map(({ BlockType, Text }) => { if (BlockType === "LINE") return `${Text}${EOL}`; }) .join();
if (_nextToken) { textractResult += await getDocumentText(jobId, _nextToken); }
return textractResult;
};

Now let’s add another lambda function as a REST endpoint using HTTP API defined in template.yaml. with the rest api, we can retrieve the text analysis result and job status by Textract job id.

turn `${Text}${EOL}`; }) .join(); } return callback(null, { statusCode: 200, body: JSON.stringify({ text: textractResult, jobStatus, nextToken }) }); } } } catch ({ statusCode, message }) { return callback(null, { statusCode, body: JSON.stringify({ message }) }); } finally { return callback(null); }
};

Note that Amazon Textract retains the results of asynchronous operations for 7 days.

Now let’s deploy the service and test it out!

After deployment finished, copy a pdf file to S3 bucket.

$aws s3 cp ~/downloads/ocrscan.pdf s3://aiyi.demo.textract

You will get a Textract job id in CloudWatch lamba function TextractStartFunction’s log group, to monitor CloudWatch logs realtime you can run following command:

$sam logs --name TextractStartFunction -t --region YOUR_REGION --stack-name sam-app-appv2

Let’s check the job status by calling api endpoint we just deployed.

$curl -d '{"jobId":"xxxxx2bd5ad43875edxxxx5aee29b65f273fxxxxx"}' -H "Content-Type: application/json" https://xxxx.execute-api.ap-southeast-2.amazonaws.com/textract | jq '.'

Output shows job status is SUCCEEDED, there is a text file supposed to be created in S3 bucket. Let’s go to AWS S3 console and have a look:

The following image is the the content of ocrscan.txt.

That’s all about it, Thanks for reading! I hope you have found this article useful, You can find the complete project in my GitHub repo.