SQL Server - How to capture data from a webpage (Webscraping with HttpRequest) and insert data into database with SQLCLR (C #)

Views: 1.664
Reading Time: 8 minutes

Hey guys!
All right with you ?

In this post I would like to share a solution that I needed to develop in my company to do a web scraping and extract agency data from the Post Office website (http://www2.correios.com.br/sistemas/agencias/). As you might imagine, in order to achieve this solution and bring data into my database quickly, I used my old companion SQLCLR to accomplish this not-so-simple task, since the post office site had some errors in the site's XHTML and It ended up complicating my life a little to handle the data.

To be able to handle this data, I will use regular expression techniques (Regexp or Regex), which I demonstrated to use in my last post. SQL Server - How to use regular expressions (RegExp) in your databaseTherefore, I recommend reading this Regexp article before this article here.

Information at the source site:

Information Returned by Stored Procedure Using Web Scraping:

How does the solution work in theory

I will try to summarize the steps to perform web scraping and explain how I did to achieve this result.

  • The first step is to identify the URL used to return data via Ajax on the site.

  • Now that I have identified the URL, the next step is to make sure the city name has no accents (Ex: Victory). For this, I will use the function fncRemove_Acentuacao (I will release the codes below).
  • Another important point is encoding the URL to transform spaces and other characters as HTML entities. (Ex: Space =% 20). For this, I will use the function fncCodifica_URL
  • Now I can assemble the full URL to submit the HttpRequest request. For this, I will use the stpWs_Requisition Stored Procedure, which performs the request and returns the result in an output string.
  • The return of this request is the HTML code returned by the site, which is in this format:

  • As you can see, it is not a trivial task to handle this XHTML, especially as it contains some elements that are open and not closed, generating an error when trying to convert to XML and handle by SQL Server. Given this, I will treat the return using another technique: Regular Expression (RegExp)
  • I use the mask “<td. *?> (. *?) </td>” in the fncRegex_Match function to return all values ​​within the tags td. In addition, I also use the fncRemove_Html_String functions to transform an HTML string to text, and also the fncRecover_Letters_Sem_Special_Character function to remove any special characters in the string.
  • From there, I will apply some filters to separate and categorize the results of these treatments so that I can correctly classify the information and return this data in table format.

Solution Source Code

Now that I have explained in theory how the solution works, the time has come to apply it and demonstrate it in practice.

SQLCLR assembly source code, which will be used in this article (T-SQL):

StpBusca_Agency_Courier (T-SQL) source code:
View source

SQLCLR Objects Source Code

Above, I provided the source for creating the ready-made assembly with SQLCLR functions and also the database objects (functions and Stored Procedure) to use the assembly. This greatly facilitates the use of these objects and does not require C # knowledge or having Visual Studio on your machine to compile the SQLCLR DLL.

If you are a guy who likes to know everything in depth and understand the code of SQLCLR, I will also make available here the original code of these objects, for you to include in your SQLCLR project, for example.

If you do not know SQLCLR, you can not miss the following articles:
Introduction to SQL Common Language Runtime (CLR) in SQL Server
SQL Server - Performance Comparison between Scalar Function and CLR Scalar Function
Other posts about SQLCLR

Procedure source code stpWs_Requisition (C #):
View source

FncCode_URL (C #) Source Code:
View source

FncRemove_Acentuacao (C #) Source Code:
View source

FncRegex_Match (C #) Source Code:
View source

FncRemove_Html_String (C #) Source Code:
View source

Fnc Source CodeReturns_Letter_Special_Character (C #):
View source

That's it folks!
I hope you enjoyed this post and see you next time!

Hug!