Web Scraping with Power Automate

Have you ever wanted to extract data that isn’t available using Connectors? In this post, I’ll explain web scraping with Power Automate and give you some great pointers on the best approach

Connectors Vs Desktop Flows

Connectors are the best way to get your hands on the data you need. Connectors are the ‘official’ APIs between the Power Platform and other systems. They are generally easy to use and serve up the data very quickly

Even if ready-made Connectors aren’t available, building a custom Connector to a freely available web service is a fall-back option and isn’t as complex as it sounds. Checkout my earlier posts for details on how to do this

Hiredgun.tech: Custom Connectors – Some General Info
Hiredgun.tech: Building a Basic Custom Connector
Hiredgun.tech: Custom Connectors Requiring an API key

However, there is a mass of data out there on plain old web pages where connectors cannot be used. With data presented on the Web, the only option is to scrape the data yourself. Again, that sounds like it could be difficult, but it doesn’t have to be

Desktop Flows (previously called UI Flows) can be used in Power Automate Cloud to scrape data from web pages. That’s the first option. The second option is to use Power Automate Desktop. Which option do you choose? That’s what I’ll take you through

Power Automate Desktop has been around now for several months. If you haven’t yet used it, the first thing you’ll need to do is download and install, noting that you’ll need Windows 10 Pro. Once you’ve done that, you’re ready to start.

Now let’s work through the up-front decisions to decide on the best approach to make your data scraping easierBuilding a

Select your Target Web Site

Often the same data is available from more than one website so choose an established site that isn’t likely to change its page structure or URLs. It’s also worth inspecting the code which is easily done using Chrome. For example, most sites present tabular data in HTML tables which makes scraping much easier. Other sites go out of their way to use dynamic content to make a Scraper’s life as difficult as possible

Pre-scrape or Scrape On-demand?

The next question is should you scrape the data immediately when its required, or pre-scrape so the data is already available when needed. This decision depends on a couple of factors. If the data is updated very frequently (sometimes several times a second), such as the foreign exchange rate between the £/$; then the answer is clear, you should scrape the data when it is needed so it is bang up to date

However, consider the example of scraping the data for the English Football Premier League (EPL) table. The data is updated much less frequently. In this case, the EPL table would be a good candidate to pre-scrape daily

Another factor to consider is the amount of data you require. Scraping isn’t an instantaneous process; it is FAR slower than using a Connector. Scraping the EPL table can take 30 seconds or longer depending on the approach and the number of columns selected

In these 2 examples the decision is straightforward. £/S exchange rate is a single field and changes very frequently. Therefore, it should be scraped on demand. The EPL table contains a large data set and is updated infrequently. It is a great candidate to be pre-scraped and stored in a Data Source (Dataverse, SharePoint, Excel, etc) so it is ready to be served up quickly when required

How Complex is the Data Structure?

Now consider the complexity of the data to be scaped. For one or two items or fields of data, complexity isn’t likely to be an issue. As I discussed earlier, scraping websites where a large amount of data is presented in a HTML table isn’t particularly complex either

For these basic data structures, which probably cover the large majority of the web scraping you’ll want to do, use the Web Recorder in Power Automate Desktop. You’ll be able to extract straightforward data structures easily. Power Automate Desktop is even able to recognise when you are wanting to scrape the full set of data from a table and it will auto generate the code for you. If required, the output from the Web Recorder can be supplemented with the in-built Power Automate Desktop Actions

However, some websites do go out of their way to make their content difficult to scrape. For these more complex structures, knowledge of HTML, CSS and XML is required and you will need to use Selenium with Power Automate Cloud. Install both the Selenium IDE and the Power Automate Chrome extensions. Then when you create a Desktop Flow, you’ll be able to launch the Selenium window directly from Power Automate Cloud

Selenium also has a Web Recorder Similar to Power Automate Desktop, that can also be supplemented with direct coding. The direct coding gives you the ability to extract data from difficult sites that isn’t yet possible by using Power Automate Desktop. Selenium is quite an intuitive language so can be learned fairly quickly. Microsoft now refers to Selenium IDE as a ‘legacy flow’ which implies that the functionality could eventually be replaced by Power Automate Desktop Actions. This will be great when done, but it’s not there yet

Connectivity Considerations

Also bear in mind that you can only run Power Automate Desktop installed on Windows 10 Pro or Windows Server. If you want the scrape to run at a fixed time on your laptop or desktop, you’ll need to have your laptop or desktop PC running and the On-premises Data Gateway installed. You then run a Power Automate Cloud Scheduled Flow to start the Power Automate Desktop Flow. If you think this is a bit messy, I agree with you. I hope it will be possible to run Scheduled Flows directly within Power Automate Desktop soon

Alternatively, you could host Power Automate Desktop on an Azure VM which avoids the dependence on your laptop or PC, and doesn’t require the On-Premises Data Gateway, but may incur a cost. You’ll still need to initiate the Scheduled Flow from Power Automate Cloud

As you can see, the benefit of running a Desktop Flow from Power Automate Cloud is that the scrape runs completely unattended

Summary

So, there you have it. All you need to get started with Power Automated web scraping. If you haven’t yet given it a go then I think you’ll be pleasantly surprised how easy it is

In the next post, I will be scraping the EPL table from http://bbc.co.uk/sport. It is formatted using a HTML table structure so will be using Power Automate Desktop running on my laptop. I’ll take you through it step by step

More Info

Power Automate Reference: Cloud Flows
Power Automate Reference: Desktop Flows