A2019 - Language Detection - Python

The Language Detection bot uses Python along with PDF Integration Commands and OCR in order to detect the language of the PDF files (native and Scanned PDFs).

Top Benefits

  • Classify the PDF documents according to the Language identified..
  • Can identify the language for Native as well as Scanned PDFs
  • Makes the language detection process easier for AP Invoicing processes
  • The bot is able to detect 97 languages
  • Able to detect language for Multi-lingual PDFs based on the confidentiality of the language

Tasks

  • Read and validate the XML config file to get the Input and Output paths.
  • Extracts the text from the Native as well as Scanned PDFs.
  • Detects the language of the document using Python Libraries
  • Moves the file to the language specific folders based on the confidence of the language.pyth
  • Proper Logs has been maintained.

The Language Detection bot takes the PathConfig.xml as an input. PathConfig.xml and detectLanguage.py python files should be uploaded to the control room. The user must update the Config file according to the local folder structure. Input folder path, Output folder path and Log folder path needs to be updated in the config.

The bot reads the config file and extracts the text from the PDF (Input). If the text file is empty, then the PDF is a scanned PDF. For scanned PDFs, the bot extracts the text via OCR. The python script uses three libraries to detect the language of the text. The decision is based on the higher confidence percent detected. After determining the language, the PDFs are moved to the language-specific folders. As an output, the bot creates the language-specific folders which contains the PDF of that language.
In AP Invoicing Processes, where the Invoices need to be fed into the IQ bots based on their languages. It would not be possible to categorize the PDFs based on its language without any extra information. The Language Detection bot comes into the picture. It can detect the language of a PDF without any further information associated with the PDF Document.
Also, if the input PDF comes as a scanned PDF instead of a native one, then also it would be difficult to read the PDF. However, the Language Detection bot has the capability to read the text of the scanned document and categorize the document based on language. To do so, the bot uses the OCR functionality and the Python script to determine the language of the text in the PDF
The Language Detection Bot can detect 97 languages. As the Language Detection Bot uses PDF extraction OCR and Python script to detect the language of the PDF, it takes approximately ~1.5 sec per file. During the testing of 209 PDFs, it took 5mins for execution. Hence, the Language detection bot detects the language of a PDF in significant execution time.

Get Bot

Free

Bot Security Program
Level 1
Applications
Business Process
Category
Downloads
29
Vendor
Automation Type
Bot
Last Updated
August 3, 2020
First Published
July 31, 2020
Enterprise Version
A2019
ReadMe
ReadMe
Support

See the Bot in Action

PathConfig
InputFolderBeforeBotRun
InputFolderAfterBotRun
DetectedLanguageFolders AfterBotRun
OutputFolderAfterBotRun
LogFileAfterBotRun
PREV NEXT
PathConfig
InputFolderBeforeBotRun
InputFolderAfterBotRun
DetectedLanguageFolders AfterBotRun
OutputFolderAfterBotRun
LogFileAfterBotRun

Setup Process

Install

Download the Bot and follow the instructions to install it in your AAE Control Room.

Configure

Open the Bot to configure your username and other settings the Bot will need (see the Installation Guide or ReadMe for details.)

Run

That's it - now the Bot is ready to get going!

Requirements and Inputs

  • A2019 should be installed and configured.
  • Python 3.8.x version needs to installed and configured.
  • Python libraries : Langid, LangDetect, Shutil, Chardet, Math needs to installed using PIP.
  • PathConfig.xml needs to be updated and pointed to inside the Bot
  • Input folder should have the PDF documents