Effortless Web Scraping in C# (ASP.NET) Using HTML Agility Pack: 10-Minute Guide & GitHub Resources
Web scraping is a powerful tool for a variety of applications. In almost any business, there arises a need to automate the retrieval of data from external sources. This kind of work can be highly valuable to a business that would otherwise spend a lot of time on manual upkeep.
In this guide, we’ll see how we can scrape data from websites using HTML Agility Pack. Html Agility Pack is the most popular web scraping tool for C# and ASP.NET. It is available as a NuGet package.
We’ll also look at how we can get data directly from an API source, using Newtonsoft.Json to deserialize response data into C# objects.
Download the starting project here
Download the finished project here. Postman collection included.
You can scrape any website on the world wide web, albeit with varying levels of difficulty.
Sophisticated websites like Amazon.com will employ counter-measures to detect automated traffic and prevent bots from scraping their page data. If your goal is to scrape these websites, then you’ll probably need to use paid intermediary services that provide web scraping tools to work around these challenges.
If you are going after simpler websites in niche industries, things will be quite a bit easier. That’s where most of the opportunity is anyways! So let’s get on with it.
Add Web Scraping Service to CRUD App
We’ll attempt to get the top 100 board games from https://boardgamegeek.com/ with their title and average rating, and then save each one to our database.
The starter GitHub repo is a simple application. It has one entity called Product, with a service and controller to perform CRUD operations. You can also get the finished version.
We need to create a new service that will contain the web scraping code, lets call it AutomationService. This service will have one method called RunAutomation. Create this as an async method and return a Task<int>. The value returned will be the number of records changed in the database. Here is how the interface and service should look:
IAutomationService:
using crudApp.Services.AutomationService.DTOs;
namespace crudApp.Services.AutomationService
{
public interface IAutomationService
{
Task<int> RunAutomation();
}
}
AutomationService:
using crudApp.Persistence.Contexts;
using crudApp.Persistence.Models;
using crudApp.Services.AutomationService.DTOs;
namespace crudApp.Services.AutomationService
{
public class AutomationService : IAutomationService
{
private readonly ApplicationDbContext _context; // database context
public AutomationService(ApplicationDbContext context)
{
_context = context;
}
public async Task<int> RunAutomation()
{
int recordsUpdatedTotal = 0;
// url to retrieve html document
string url = $"https://boardgamegeek.com/browse/boardgame/";
// TODO: code to do some web scraping!
// return the records change count
return recordsUpdatedTotal;
}
}
}
Next we’ll add HTML Agility Pack, so that we can read the web page structure. By providing XPath selection criteria, we can get specific about which elements we want to retrieve.
Add HTML Agility Pack
We need to install HTML Agility Pack, which is a NuGet library that provides all the web scraping capability in C#. Right click the solution and Manage NuGet Packages. Search for HTML Agility Pack and install the latest version.
Import HTML Agility Pack with a using statement at the top of the AutomationService class. With this, we can use the HtmlDocument object to read web page content. There are two key methods that come with HTML Agility Pack. SelectNodes will find all nodes with matching criteria, and SelectSingleNode will select one, or the first match on the given criteria.
Define Selection Criteria with XPath
The way that DOM content is selected is with XPath. XPath has been around since ancient times, and is a standard for selecting HTML content. To quote W3 Schools, “XPath can be used to navigate through elements and attributes in an XML document.”
It has its own syntax and works a bit like regular expressions. There’s nothing to install; XPath is just how you write selection criteria.
The Board Game Geek top 100 page lists games in a table with class ‘collection-table’, with each game in a row with id ‘row_’. We could provide an XPath string like this using SelectNodes method to obtain an array with all the row elements: “.//table[@class=’collection_table’]//tr[@id=’row_’]”
An HtmlNodeCollection will be returned, with each element as an HtmlNode.
Add the following code to the AutomationService:
using crudApp.Persistence.Contexts;
using crudApp.Persistence.Models;
using HtmlAgilityPack;
using System.Globalization;
namespace crudApp.Services.AutomationService
{
public class AutomationService : IAutomationService
{
private readonly ApplicationDbContext _context; // database context
public AutomationService(ApplicationDbContext context)
{
_context = context;
}
public async Task<int> RunAutomation()
{
int recordsUpdatedTotal = 0;
HttpClient httpClient = new();
// url to retrieve html document
string url = $"https://boardgamegeek.com/browse/boardgame";
string html = await httpClient.GetStringAsync(url);
HtmlDocument htmlDocument = new();
htmlDocument.LoadHtml(html);
// XPath to select all product containers
string xpath = ".//table[@class='collection_table']//tr[@id='row_']";
HtmlNodeCollection productNodes = htmlDocument.DocumentNode.SelectNodes(xpath);
// check if found anything
if (productNodes == null)
{
return 0;
}
// loop through each html node returned
foreach (HtmlNode productNode in productNodes)
{
// get the product title div and inner text
string titleText = productNode.SelectSingleNode(".//a[contains(@class, 'primary')]").InnerText.Trim();
Console.WriteLine(titleText);
// get the product rating div and inner text
string ratingText = productNode.SelectSingleNode("(.//td[@class='collection_bggrating'])[2]").InnerText.Trim();
// try to parse the rating as a decimal
if (decimal.TryParse(ratingText, NumberStyles.Number, CultureInfo.InvariantCulture, out decimal rating))
{
Console.WriteLine($"Rating: {rating}");
}
else
{
Console.WriteLine("invalid rating");
}
}
// TODO: Persist to database
// return the records change count
return recordsUpdatedTotal;
}
}
}
We can loop through each HtmlNode of the HtmlNodeCollection and perform further actions. This XPath criteria, “.//a[contains(@class, ‘primary’)]”, will return any a link HTML element that contains class ‘primary’. If an element contains multiple classes, it’s best to use the contains method to select it. With this method and selection criteria we can select the title of the boardgame.
Do the same now to obtain the average rating, you can use class name directly like this “.//td[@class=’collection_bggrating’]” Often you’ll need to clean up the final inner text of whatever element you retrieved; trim any whitespace and attempt to parse the decimal value.
To test that everything is working, we can output the results to the console window.
Remember to wire up the service to the program’s service container with the following in program.cs:
builder.Services.AddTransient<IProductService, ProductService>();
builder.Services.AddTransient<IAutomationService, AutomationService>(); // add the automation service
And finally, create the AutomationsController so that we can call the service from the outside world. This will be a post endpoint:
using crudApp.Services.AutomationService;
using Microsoft.AspNetCore.Mvc;
namespace crudApp.Controllers
{
[Route("api/[controller]")]
[ApiController]
public class AutomationsController : ControllerBase
{
private readonly IAutomationService _automationService;
public AutomationsController(IAutomationService automationService)
{
_automationService = automationService;
}
[HttpPost]
public async Task<IActionResult> Post()
{
try
{
var result = await _automationService.RunAutomation(parameters);
return Ok(result);
}
catch (Exception ex)
{
return BadRequest(ex.Message);
}
}
}
}
Test in Postman
Open up Postman, create and initiate a new Post request to hit the controller endpoint we just added. You can leave the post body blank.
Once we run this, you should see the output in the console window. If you see a list of boardgames with their rating, that means things are working.
Persist Scraped Data to Database
We could persist these changes to our database now quite easily. Modify the Product entity so that it contains a decimal field for Rating.
namespace crudApp.Persistence.Models
{
public class Product
{
public int Id { get; set; }
public string Name { get; set; }
public decimal Rating { get; set; } // add a rating field
}
}
Create a migration and update the database
In the RunAutomation method, add the following code which will create product entities in the database. Upon saving changes, the number of records changed will be returned as a response. Here is the completed code for the AutomationService:
using crudApp.Persistence.Contexts;
using crudApp.Persistence.Models;
using HtmlAgilityPack;
using System.Globalization;
namespace crudApp.Services.AutomationService
{
public class AutomationService : IAutomationService
{
private readonly ApplicationDbContext _context; // database context
public AutomationService(ApplicationDbContext context)
{
_context = context;
}
public async Task<int> RunAutomation()
{
int recordsUpdatedTotal = 0;
HttpClient httpClient = new();
// build a url and retrieve an html document
string url = $"https://boardgamegeek.com/browse/boardgame";
string html = await httpClient.GetStringAsync(url);
HtmlDocument htmlDocument = new();
htmlDocument.LoadHtml(html);
// XPath to select all product containers
string xpath = ".//table[@class='collection_table']//tr[@id='row_']";
HtmlNodeCollection productNodes = htmlDocument.DocumentNode.SelectNodes(xpath);
// check if found anything
if (productNodes == null)
{
return 0;
}
// remove existing products
List<Product> existingProducts = _context.Products.ToList();
_context.RemoveRange(existingProducts);
foreach (HtmlNode productNode in productNodes)
{
// get the product title div and inner text
string titleText = productNode.SelectSingleNode(".//a[contains(@class, 'primary')]").InnerText.Trim();
Console.WriteLine(titleText);
// get the product rating div and inner text
string ratingText = productNode.SelectSingleNode("(.//td[@class='collection_bggrating'])[2]").InnerText.Trim();
// try to parse the rating as a decimal
if (decimal.TryParse(ratingText, NumberStyles.Number, CultureInfo.InvariantCulture, out decimal rating))
{
Console.WriteLine($"Rating: {rating}");
// if all goes well, create a new product
Product newProduct = new();
newProduct.Name = titleText;
newProduct.Rating = rating;
_ = _context.Products.Add(newProduct);
}
else
{
Console.WriteLine("invalid rating");
}
}
// save changes to the database
recordsUpdatedTotal = await _context.SaveChangesAsync();
// return the records change count
return recordsUpdatedTotal;
}
}
}
Open up Postman again to test our script. This time we’ll see the number of records returned when the request is completed. You can check the database to see that there are new products created in the products table. You could also use the Get Products endpoint to retrieve a list of products:
Hopefully this provides some idea as to how powerful this kind of development can be when applied in certain scenarios. You could add a background task scheduler library like Hangfire to execute scraping scripts on a daily basis.
In our scripts, we’re navigating the DOM structure by specific elements. If this DOM structure was to change ever, like if the owners of the website changed their design, you would need to adapt your code. Keep that in mind if you are targeting sites that change frequently.
Obtaining data from API sources
The other way to get data from external sources is by deserializing API response data. This is actually the better way since it’s more reliable and convenient.
If you have the choice of getting data from an API or building scripts to scrape it from pages, go with the API source. In practice, it’s often the case that you’ll need to use a mix of both.
By inspecting the Network tab of Google Chrome browser, you can filter for Fetch/XHR and this will only show you API traffic. Inspecting this while interacting with a website’s search tools, buttons, etc. can help you find the APIs that are providing data to the website in the background.
Once you find the API that’s responsible for providing the data you need, check the Headers tab to obtain the API URL. This is what you can use in the scripts to point at the external API and gather data feeds.
Keep in mind that external API sources will often protect their endpoints with auth credentials and will deny any requests that don’t contain a valid bearer token. This can make things difficult when you are being a data pirate on the internet.
A more typical case may be that you are trying to connect to an external supplier / partner API and they will provide you authorization keys. In either case, the following method is how its done.
Create a Service for API Data
To take a look at a simple example of how we can obtain data from an external API source, we’ll get a list of facts about cats from Cat Facts API. We’ll build a new service to obtain this information directly and deserialize it into C# objects. No authorization keys are needed for the Cat Facts API.
Create a new AutomationServiceV2 class, which will implement the existing IAutomationService interface. In program.cs, swap out the implementation so that we use the new AutomationServiceV2 class. Being able to swap out classes like this is one of the main benefits of Dependency Injection and using interfaces.
builder.Services.AddTransient<IAutomationService, AutomationServiceV2>();
We won’t need to use HTML Agility Pack in our API retrieval service, but we will need Newtonsoft.Json. Right click the project and Manage NuGet Packages, search for Newtonsoft.Json and install it.
Add a using statement for Newtonsoft.Json at the top, and add the following code in the Run Automation method. We’ll just output the individual facts about cats to the console window and not worry about persistence.
using Newtonsoft.Json;
namespace crudApp.Services.AutomationService
{
public class AutomationServiceV2 : IAutomationService
{
public async Task<int> RunAutomation()
{
// here is how you can obtain an api source
string apiUrl = $"https://catfact.ninja/facts";
HttpClient httpClient = new();
httpClient.DefaultRequestHeaders.Add("Accept", "application/json;api_version=2");
HttpResponseMessage response = await httpClient.GetAsync(apiUrl);
response.EnsureSuccessStatusCode();
string jsonResponse = await response.Content.ReadAsStringAsync();
CatFactsResponseDTO responseObject = JsonConvert.DeserializeObject<CatFactsResponseDTO>(jsonResponse);
// once the data is deserialized into DTOs you can do anything with them
foreach (CatFactDTO item in responseObject.Data)
{
Console.WriteLine(item.Fact);
}
int recordsUpdatedTotal = 0;
return recordsUpdatedTotal;
}
}
}
In this code, we’re using the HttpClient to make an asynchronous get request to the apiUrl and retrieve an HttpResponseMessage. This contains the raw response data.
To convert the response data into C# objects we need to create DTOs that model the incoming data and then use JsonConvert.DeserializeObject. You don’t need to include every field from an API responses, only the ones you are interested in; sometimes you’ll have response objects with hundreds of fields. If you are feeling lazy, chatGPT can make this short work. The cat facts response data its pretty simple though.
Sending a GET request to https://catfact.ninja/facts will show us how the data is getting sent back.
The DTO structure will model the wrapping object CatFactsResponseDTO, and also, the list of individual CatFactDTO objects. The DTOs will look like this:
namespace crudApp.Services.AutomationService.DTOs
{
public class CatFactsResponseDTO
{
public int Current_Page { get; set; }
public List<CatFactDTO> Data { get; set; }
}
public class CatFactDTO
{
public string Fact { get; set; }
public int Length { get; set; }
}
}
Testing the API Script
Open up Postman and send post request to the automations endpoint. The body can be left blank.
If you set a breakpoint or check the console window, you can see the data is retrieved successfully and parsed into DTO objects. The output is displayed for each cat fact in the console window.
In Conclusion
Web scraping is a powerful tool that any developer can take advantage of. Its helpful to have some front-end knowledge to navigate the DOM structure, but overall its pretty easy to do. HTML Agility Pack is an essential NuGet package which provides most of the tools you’ll need when scraping websites.
For more complex scenarios, you may need to mimic an actual browser, manipulate buttons, perform logins, etc. Not to worry though, people have been web scraping since there was a web to scrape, and there are many tools out there like Selenium, which can handle complex automation. You can get pretty far just by using a combination of web scraping and gathering data from exposed API endpoints.
If you are thinking of building a web scraping app, check out the Nano Boilerplate so that you don’t waste hours coding app infrastructure. Focus on writing the important web scraping code instead!
I hope the explanations were clear and relatable. Good luck with your projects and thanks for reading.