Image to Text using Google Cloud AI Services

Recently I was working on an app at my internship where users have to fill up fields with names, emails, and other details of people. So we decided to add a feature where users can snap the business card and the data would be automatically filled. After a bit of research, I understood that there were two steps to building this feature, one is to get the data from the image and the other is to identify the entities in the data. These could be accomplished using Computer Vision and Natural Language Processing. Luckily there are a lot of APIs that provide the services. Google Cloud Vision and Google Cloud Language are two such services that are available for public use through the Google Cloud APIs. They have a very low price range and even provide $300 credit for first-time users.

Step 1: Setting up the services

The first step is to create a google-cloud account and a project. After that, we have to enable billing for the project. Then we have to enable the two APIs, google-cloud-vision and google-cloud-language. Finally, we have to create a service account and save the API key as a JSON file. All these steps can be found in detail in the official docs.

Step 2: Extracting data from business card

The project that I was working on was build using Django. Before working on the app itself, I created another temporary python project with a virtual environment, a single script.py file, and placed the APIKey.json file at the root of the project.

Extracting data was the easiest step because it could be done with a single API call. Before doing that we need a python package called google-cloud vision which can be installed using pip.

pip install --upgrade google-cloud-vision

In the script.py, I started by creating a function called detect_text where we’ll make a request to Google Cloud Vision API, and for the google cloud services to process the request, it needs the credentials in the APIKey.json file.

credentials = service_account.Credentials.from_service_account_file(
    './APIKey.json')

def detect_text(image):
    """Returns the detected text from image
    args:
      image: base64 string without `data:` part
    """
    client = vision.ImageAnnotatorClient(credentials=credentials)

    content = base64.b64decode(image)
    image = vision.Image(content=content)

    vision_response = client.text_detection(image=image)
    text = vision_response.text_annotations[0].description

    if vision_response.error.message:
        raise ValidationError(
            '{}\nFor more info on error messages, check: '
            'https://cloud.google.com/apis/design/errors'.format(
                vision_response.error.message))

    return text

Step 3: Analyzing entities in the text

In the next step, using the text from the image, we make a call to the Google Cloud NLP service using a python package called google-cloud-language.

pip install --upgrade google-cloud-language

I created a function called analyze_entities that returns the context of each text returned from the Google Cloud Vision API. We create a list of required entities and add the data to it.

 def analyze_entities(text_content):
        client = language_v1.LanguageServiceClient(credentials=credentials)

        type_ = language_v1.Document.Type.PLAIN_TEXT
        language = "en"
        document = {"content": text_content,
                    "type_": type_, "language": language}
        encoding_type = language_v1.EncodingType.UTF8

        language_response = client.analyze_entities(
            request={'document': document, 'encoding_type': encoding_type})

        required_entities = {"ORGANIZATION": "",
                             "PERSON": "", "LOCATION": ""}
        for entity in language_response.entities:
            entity_type = language_v1.Entity.Type(entity.type_).name
            entity_name = entity.name
            if entity_type in required_entities:
                required_entities[entity_type] += " {}".format(entity_name)

        return required_entities

To extract phone numbers and emails I used custom regex instead of using the API. With the two API calls, we are able to fetch all the required data from the image. But the time taken to produce the result can be long depending on the size of the image. So make sure to compress the image before making the API calls. The REST is up to you!


Profile picture

Written by Roshan R who lives in India, building useful things. You should follow them on Twitter