extract data from documents

How to extract data from documents with Power Platform?

Table of Contents

Bank statements, account statements, contracts, application forms, checks, tax documents… All companies have both physical and digital files that need to be reviewed by a team to extract data from these documents – this means many hours of work that could be better spent on tasks that add value to your business. This is another use case for Power Platform to the rescue! 

Save hours of manual work with the help of Azure’s Cognitive Services, Open AI or ChatGPT along with Power Automate. Power Platform tools let you build automation that scans your documents and extracts key data points from them with amazing results. 

In this blog we will talk about how our Power Platform consulting services have helped our clients to seamlessly process multiple file formats and layouts to extract relevant data from documents. Are you ready to discover the joy of automation? 

What is document data extraction?

Extracting data means obtaining relevant information from your documents. Some examples include: 

  • Invoices
  • Purchase Orders
  • Bank Statements
  • Account Statements
  • Contracts
  • Job Application
  • forms
  • Tax forms
  • Business cards
  • US Checks
  • Health insurance documents 

Whether your documents have a pre-defined structure or if they are unstructured documents, you can extract text, paragraphs, tables or specific fields from your digitized or hand-written documents. You can apply document extraction to different file formats: JPG, PDF, and PNG.

Types of documents for data extraction

Structured documents

Structured documents include all those files you know the format or an approximation of the format in advance. For example, Checks or purchase orders have similar data points located in similar places. 

You can create your own Artificial Intelligence model to extract data from these types of documents and train it based on some sample documents you already have. Some services like Azure’s Document Intelligence have a wide variety of document types you can extract data from without having to train yourself – like invoices or US Tax forms. 

Click here if you want to learn how Power Platform can help you read data from invoices.

Unstructured documents

Unstructured documents are free-form documents that do not have a set structure or can have information located in places where it may become hard to predict—for example, contracts or agreements. 

Challenges in unstructured document data extraction

Unstructured documents were quite a challenge before NLP (Natural Language Processing) models became so advanced and popular (i.e., ChatGPT). Still, now, with Artificial Intelligence services, you can create efficient prompts to request any NLP engine to extract signees, companies, effective dates, or any other information you may want to extract for free-form documents.

Tools for unstructured data extraction

Azure’s Document Intelligence has a set of pre-trained and ready-to-use models you can leverage to extract data from contracts with a wide list of pre-defined fields. 

Azure’s OpenAI and Azure’s Cognitive Search are also great alternatives for any other type of document that you may want to read that is not covered by Document Intelligence. 

Methods of document data extraction

Optical Character Recognition (OCR) has been around for quite a while, and it’s still widely used. 

But with all the new Artificial Intelligence tools being developed, AI-based technologies are the main methods of document extraction being used – and probably one of the most efficient too: Natural Language Processing (NLP), Machine Learning (ML), 

Once a project is in progress, this screen can include a section for the finance team to enter actual expenses against the project – or even better: sync costs from your accounting system, let Power Automate connect to your ERP to extract expenses recorded for the project code and automatically enter them for the most recent period. 

Tools for document data extraction

We mainly recommend Azure’s Document Intelligence for any data extraction service, as it has many benefits: 

  • Pay as you go 
  • Service is constantly improved 
  • You can take advantage of pre-built and ready-to-use models 
  • Great integration with Power Platform and other Microsoft 365 tools 

There are other great alternatives: 

  • OpenAI and ChatGPT 
  • AI Builder 
  • Gemini 

Power Platform Tools for Data Extraction 

The Power Platform comes together along with other Microsoft technologies to allow a streamlined process to extract data from both structured and unstructured documents. 

The Power Platform comes together along with other Microsoft’s technologies to allow a streamlined process to extract data from both structured and unstructured documents 

Azure’s Document Intelligence. 

  • Document Intelligence lets you read data from a wide variety of document types: Contracts, Agreements, Business Cards, Invoices, Purchase Orders, Paychecks, Receipts. 

Power Automate 

  • It allows you to read documents from email or from almost any file repository where your documents are sent to. 
  • Connect to the AI service to extract invoice data and add it to the data repository of your choice (Dataverse, SQL, Airtable, SharePoint) 
  • Send automatic notifications to staff when new documents are received 
  • Avoid double data entry by integrating with other systems where the extracted data needs to be transferred to 

Power Apps 

  • Add a layer of validation to your process to make sure data is extracted accurately. A beautiful interface can be built in Power Apps to allow your team to see a centralized list of all documents extracted and the relevant information extracted from it – after validation is complete the document can move to the next stage. 

 

Benefits of automating document data extraction 

First of all – the joy of automation! Saving your team from having to perform such manual tasks will help them focus on more engaging and added-value tasks. 

Second of all – accuracy. We’re humans, and, normally, we make mistakes, so when extracting data from documents, copying and pasting among systems and screens, small errors can naturally happen. With an automated data extraction process, you can rely on modern technologies to extract the data with high accuracy and decreased errors. 

Other benefits include the ability to scale and flexibility – you can keep sending new documents for extractions, and the service will process. AI-based technologies keep improving, and the extraction process can be configured to handle many different layouts and formats, 

Challenges and how to overcome them

Poor-quality scans

AI-based technologies can read both digital and hand-written documents. And you can also add a layer of validation so someone in your team can check the results and adjust if needed! (see point 5 of this section)

Non-standardized document formats

Unstructured documents can become a challenge, but by picking the right service (OpenAI, AI Builder, Document Intelligence) for the right use case, you can overcome issues with free-form documents. 

Complex data structures

Azure’s Document Intelligence service lets you read different types of documents with many types of layouts: paragraphs, titles, tables,

Handling mixed content

Natural Language Processing (NLP) and services like OpenAI and ChatGPT can read through a document and understand the context. 

Maintaining Data Accuracy

Automated data extraction is great and allows you to decrease errors. Still, in case a validation layer is required, a great interface in Power Apps can be developed to allow your staff to validate the extracted data to access the list of documents scanned. 

Best practices for successful document data extraction

  • Use and prioritize pre-build models when they are available and work for your use case. Pre-built models in Azure’s Document Intelligence service are pre-trained by Microsoft and you can leverage them to start extracting data from your documents quickly. 
  • If you will train your model, use samples with good quality and update the models regularly with any new format or layout. 
  • Implement a validation layer to make sure all data is extracted correctly. A Power Apps application can be built to show all extracted fields and a copy of the document on the same screen so your team can quickly validate the results and approve or confirm before moving the document to the next processing stage. 

Applications of document data extraction

Different types of documents can be scanned and have data extracted from them. Some examples include but are not limited to:

Finance and Accounting

  • Bank Statements 
  • Account Statements 
  • Tax forms 
  • Income Statement 
  • Balance Sheet 
  • Cashflow 

Legal sector

  • Contracts 
  • Agreements 
  • Complaints 
  • Marriage Certificates (US) 
  • Identity documents

Healthcare

  • Insurance forms 
  • Insurance cards (US) 
  • Identity Documents 
  • Business cards 
  • Checks 

Human Resources

  • Job Application forms 
  • Paycheck (Pay Stubs) 
  • Identity documents 

Frequently asked questions

Can I extract data from scanned documents?

Scanned documents are a big use case and can for sure be read to have relevant information extracted from them. Most modern technologies allow us to extract data from both digital and hand-written documents. The better the quality, the better the results. 

What are the best tools for extracting data from PDFs?

Our top pick is Azure’s Document Intelligence – it has plenty of options for many use cases, but every business need is different so it may be you need something else like Open AI. 

Power Automate integrates really well with Azure’s services, and it’s the perfect tool to orchestrate an end-to-end process that goes from receiving PDFs through email to storing data in a database. 

How do I process unstructured documents?

Use pre-built models for documents like contracts or connect to Natural Language Processing (NLP) services like OpenAI, where you can send a prompt to request relevant information to be extracted from a text. 

Power Automate lets you connect to both Document Intelligence and Open AI. 

Is automated data extraction secure?

We usually recommend enabling data extraction with Microsoft’s services so all your data stays in your environment or tenant. Microsoft runs under the principle of responsible AI (you can read more information here

Whether you have a document in mind or want to have a conversation on how to start extracting data from your documents – contact us to know how you can achieve all of this with our Power Platform consulting services!