Language Detection - Python

The Language Detection bot uses Python along with PDF Integration Commands and OCR in order to detect the language of the PDF files (native and scanned PDFs)

Top Benefits

  • Classify the PDF documents according to the language identified
  • Can identify the language for native as well as scanned PDFs
  • Makes the language detection process easier for AP Invoicing processes
  • The bot is able to detect 97 languages
  • Able to detect language for multi-lingual PDFs based on the confidentiality of the language

Tasks

  • Read and validate the XML config file to get the input and output paths
  • Extracts the text from the native as well as the scanned PDFs
  • Detects the language of the document using Python libraries
  • Moves the file to the language specific folders based on the confidence of the language
  • Proper logs have been maintained

The Language Detection bot takes the PathConfig.xml as an input. PathConfig.xml and detectLanguage.py python files should be uploaded to the control room. The user must update the Config file according to the local folder structure. Input folder path, Output folder path, and Log folder path need to be updated in the config.

The bot reads the config file and extracts the text from the PDF (Input). If the text file is empty, then the PDF is a scanned PDF. For scanned PDFs, the bot extracts the text via OCR. The python script uses three libraries to detect the language of the text. The decision is based on the higher confidence percent detected. After determining the language, the PDFs are moved to the language-specific folders. As an output, the bot creates the language-specific folders which contain the PDF of that language.
In AP Invoicing Processes, where the Invoices need to be fed into the IQ bots based on their languages. It would not be possible to categorize the PDFs based on their language without any extra information. The Language Detection bot comes into the picture. It can detect the language of a PDF without any further information associated with the PDF Document.
Also, if the input PDF comes as a scanned PDF instead of a native one, then also it would be difficult to read the PDF. However, the Language Detection bot has the capability to read the text of the scanned document and categorize the document based on language. To do so, the bot uses the OCR functionality and the Python script to determine the language of the text in the PDF
The Language Detection Bot can detect 97 languages. As the Language Detection Bot uses PDF extraction OCR and Python script to detect the language of the PDF, it takes approximately ~1.5 sec per file. During the testing of 209 PDFs, it took 5mins for execution. Hence, the Language detection bot detects the language of a PDF in significant execution time.

Access Now

Free

Bot Security Program
Level 1
Applications
Business Process
Category
Downloads
132
Vendor
Automation Type
Bot
Last Updated
May 20, 2021
First Published
July 31, 2020
Platform
Automation 360
ReadMe
ReadMe
Support

See the Bot in Action

PathConfig
Input folder before bot run
Input folder after bot run
Detected language folders after bot run
Output folder after bot run
Log file after bot run
PREV NEXT
PathConfig
Input folder before bot run
Input folder after bot run
Detected language folders after bot run
Output folder after bot run
Log file after bot run

Setup Process

Install

Download the Bot and follow the instructions to install it in your AAE Control Room.

Configure

Open the Bot to configure your username and other settings the Bot will need (see the Installation Guide or ReadMe for details.)

Run

That's it - now the Bot is ready to get going!

Requirements and Inputs

  • Automation 360 should be installed and configured
  • Python 3.8.x version needs to installed and configured
  • Python libraries : Langid, LangDetect, Shutil, Chardet, Math needs to be installed using PIP
  • PathConfig.xml needs to be updated and pointed to inside the bot
  • Input folder should have the PDF documents