Deploying Amazon Textract via Image Bytes

Vaibhav Tiwari
DataDrivenInvestor
Published in
4 min readApr 21, 2020

--

Why do detection on the paper when we have Amazon Textract !!

Manual Detection: Unsplash

Amazon is proving itself to be promising when it comes to providing reliable services for hosting a website or doing a startup through its cloud computing platforms and APIs. Many of these services are inclined towards Machine Learning and you don’t need to be a Mathematics Guru to use these services and get the desired outputs.

Recently, I’ve been working on a project which requires the usage of Amazon Textract API which was released in November, 2019. Amazon Textract is primarily used for text detection and analysis from Images, PDFs and Forms but many other frequently required tasks like Document Translation and In document search can also be performed. The most distinctive feature that I found in Textract is its ability to maintain the order of text in the extracted doc same as that of the Original doc, be it a Table, Form or a paragraph.

S3 Bucket vs Image Bytes

The idea of writing this blog sparked when I was stuck in the input parameters of Textract. It offers two ways for providing input-

  • Through uploading object to the S3 bucket and then calling it in your program. Most of the resources explain this way of providing input to the API i.e. by manually creating object for the S3 bucket since this way is easier. But I found it cumbersome because you manually need to upload each document to the bucket that you want to analyze.
Using S3 Bucket method
  • If you are using AWS SDK (Boto3), you can pass the Image Bytes which need not to be Base64 hard-encoded byte array. It just needs to be present on your local system and we can directly call the image by path. If you are using AWS CLI for accessing then you’ll need to Base64 encode the image. I’ll be explaining Boto3 i.e. SDK method further in this article.
Delete the S3Object Key-Value pair and use Bytes method for input

Boto3 is the SDK released by AWS for calling the APIs and performing various development tasks. (You can find more on this link)

Simple Steps for Complex Concepts

  • pip install boto3 would easily install the SDK required for calling the API.
  • Download the AWS CLI from this link in order to set the credential and configurations.
  • After installing the .exe CLI program mentioned above, type aws configure in cmd of your OS and fill the requirements.
  • Code as follows-
Code snippet of an example program
  • response contains all the detected text information like Bounding box coordinates, Text type (Page, Line or Word) and Height and Width information in JSON format.

Output Jargon

Measurement Format of Textract

Textract provides output in JSON format in Parent- Child format (I’ve attached the link for further reading). To display the bounding box with the correct location and size, you have to multiply the BoundingBox values by the document page width or height (depending on the value you want) to get the pixel values. You use the pixel values to display the bounding box. An example is using a document page of 608 pixels width x 588 pixels height, and the following bounding box values for analyzed text:

BoundingBox.Left: 0.3922065
Bounding.Top: 0.15567766
BoundingBox.Width: 0.284666
BoundingBox.Height: 0.2930403

Link

You can find the full JSON format output example here.

References:

--

--

Data Science Intern at Trell | IIIT Jabalpur | Enjoy to write on ML and Abstract topics