Web-Scraping Techniques to Source Reliable Information for ChatGPT

Web-Scraping Techniques to Source Reliable Information for ChatGPT

Quickly and efficiently extract useful data from the web to augment the capabilities of LLMs

·

8 min read

Introduction

Since the release of OpenAI's ChatGPT, and later its REST API, the technology world has been enveloped by the vast capabilities of this paradigm-shifting technology and pushing it to its limits. However, for as many strengths as it has, ChatGPT has two pronounced weaknesses: 1) its training data cutoff is 2021, meaning it has no knowledge of information in more recent times, and 2) it is prone to hallucinations or making up information.

To circumvent this issue, software developers and prompt engineers alike have taken on the task of supplementing ChatGPT's knowledge base with additional information that can provide more relevant, reliable, and informed responses. This process involves a combination of technologies and prompt engineering that we'll discuss in depth in this series.

In this article, we'll investigate how we can write a program to scrape the internet for content that we can use to supplement an LLM like ChatGPT's knowledge beyond its training data. Specifically, we'll gather information from Nourish by WebMD and process articles to create a databank of sources on nutrition that will be used to power a nutrition chatbot.


This is part one of my series on creating a nutrition chatbot powered by data from WebMD using sentence-transformers, FAISS, and ChatGPT.

  • Part 1: Web-Scraping Techniques to Source Reliable Information for ChatGPT

  • Part 2: Leveraging Vector Embeddings and Similarity Search to Supplement ChatGPT's Training Data (coming soon)

  • Part 3: ChatGPT Prompt-Engineering Techniques for Providing Contextual Information (coming soon)


Nourish by WebMD provides a plethora of articles related to nutrition and eating well. Each article is conveniently broken up into multiple sub-headings, with the text underneath each heading providing additional relevant information to the main article topic. Since each sub-heading contains specific information on a sub-topic related to the main topic, instead of choosing the save the entire article as one "document" so to say, we'll want to break the article up by headings into multiple documents.

All of the articles are listed on WebMD's Health & Diet Medical Reference, which is paginated to 23 pages at the time of writing. Each page in this reference contains additional articles that we will need to scrape.

To perform the web scraping, we'll use the BeautifulSoup library, as well as the built-in requests library. Recall that we have to iterate through each paginated page in the Reference. We'll then use html2text to convert the plain HTML of each page into markdown content and split the page by markdown headings into multiple documents.

You can do this project in a Jupyter Notebook, or on a platform like Google Colab. Alternatively, you can create a new Python project in an IDE like Pycharm or VS Code. I recommend using a Jupyter Notebook for following along with this article.

The code in this article is also available for you to follow along (and run yourself) on Google Colab.

Getting Started

Let's first install the necessary dependencies for this project:

pip install bs4 html2text

Next, we'll define some custom request headers to spoof a web-browser client that we will use when making requests:

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Accept-Language': 'en-US,en;q=0.9',
}

Crawling through the Reference

Because each article we want to scrape is located within WebMD's Health & Diet Medical Reference, and the Reference itself is paginated, we need to first create a web-crawler program to identify each link we will need to scrape.

To do so, let's first make a get_links_from_page function that will retrieve all of the links to articles from a specific page within the Reference:

import requests
from bs4 import BeautifulSoup


def get_links_from_page(index):
  # Send a GET request to the page we want to crawl
  url = f"https://www.webmd.com/diet/medical-reference/default.htm?pg={index}"
  response = requests.get(url, headers=headers)

  # Parse the HTML content using BeautifulSoup
  soup = BeautifulSoup(response.content, 'html.parser')

  # Find the ul element with the links to articles
  ul_element = soup.find('section', class_='dynamic-index-feed').find('ul', class_='list')

  # Find all the anchors (a tags) within the ul element
  anchors = ul_element.find_all('a')

  # Extract the href attribute from each link
  links = [anchor['href'] for anchor in anchors]

  return links

Next, we'll create a get_all_links_from_reference function that will iterate through all 23 pages in the Reference and fetch all of the links to articles.

all_links = []

def get_all_links_from_reference():
  for index in range(1, 24):
    all_links.extend(get_links_from_page(index))

Once we run that, we now have all of the links in the WebMD Health & Diet Reference, and we can now move on to scraping the individual articles for their content.

Scraping and Processing Articles

As previously discussed, we will break up each article into multiple "documents" separated by headings within the article. Let's make a utility Document class to help us structure our code and keep track of the title and URL of the article, as well as the content of the document:

class Document:
  def __init__(self, title, url, content):
    self.title = title
    self.url = url
    self.content = content

We'll also write a brief function that utilizes requests and BeautifulSoup to scrape a given article via its URL:

import requests
from bs4 import BeautifulSoup


def scrape_webpage(url):
  response = requests.get(url, headers=headers)
  soup = BeautifulSoup(response.content, 'html.parser')
  return soup

We now want to begin the work on processing the HTML into documents. First, let's write a function that converts the webpage into markdown text:

from html2text import html2text


def webpage_to_markdown(soup):
  # Find the article body, either by `class_='article__body'` or `class_='article-body'`
  article_body = soup.find(class_='article__body')
  if article_body is None:
    article_body = soup.find(class_='article-body')

  # Extract the HTML content from the article body
  html_content = str(article_body)

  # Convert the HTML content into markdown
  markdown_content = html2texthtml2text(html_content, bodywidth=0)
  return markdown_content

For the sake of making the content more legible for ChatGPT, we'll make a function that strips links from the markdown content and replaces them with just the text of the link. We can do so using regular expressions, like so:

import re


def strip_links_from_markdown(markdown_content):
  # Make a RegEx pattern
  link_pattern = re.compile(r'\[(.*?)\]\((.*?)\)')

  # Find all matches in markdown_content
  matches = link_pattern.findall(markdown_content)

  for _match in matches:
    full_match = f'[{_match[0]}]({_match[1]})'

    # Replace the full link with just the text
    markdown_content = markdown_content.replace(full_match, _match[0])

  return markdown_content

Lastly, split the articles by markdown headings into multiple documents.

def split_markdown_by_headings(markdown_content):
  # Define the regular expression pattern to match H1 and H2 headings
  pattern = r'(#{1,2}.*)'

  # Split the Markdown text based on the headings
  sections = re.split(pattern, markdown_content)

  # Combine each heading with its corresponding text
  combined_sections = []
  for i in range(1, len(sections), 2):
    heading = sections[i].strip()  # Get the heading from sections[i]
    text = sections[i + 1].strip() if i + 1 < len(
        sections) else ''  # Get the text from sections[i + 1] if it exists, otherwise use an empty string
    combined_section = f"{heading}\n{text}"  # Combine the heading and text using a newline character
    combined_sections.append(combined_section)  # Add the combined section to the list

  if len(combined_sections) == 0:
    combined_sections = [markdown_content]

  return combined_sections

We now have all of the logic to both scrape and process articles from the Reference. We can put it together in a function, create_documents_from_webpage:

def create_documents_from_webpage(url):
  # Wrap function in a try-except block to handle any potential exceptions
  try:
    soup = scrape_webpage(url)

    # Extract the title from the article
    title = soup.find('h1').get_text().strip()

    # Convert the webpage HTML to markdown
    markdown_content = webpage_to_markdown(soup)

    # Strip links from the markdown content
    markdown_content = strip_links_from_markdown(markdown_content)

    # Split the article up headings
    contents_by_heading = split_markdown_by_headings(markdown_content)
    docs = []
    for content in contents_by_heading:
      docs.append(Document(title=title, url=url, content=content))
    return docs
  except:
    return []

With this function, we can provide any URL from the Reference, and it will generate a list of documents with content from each of the subheadings.

Generate Databank of Sources

Now that we have all of the logic required to scrape and process articles, we can move on to generating the databank of nutritional sources that will be utilized by the chatbot. To do so, we must iterate through each of the articles on of the pages in the Reference, and save all of the documents created.

We'll put together all of our previous work in the generate_databank function:

def generate_databank():
  docs = []

  # Get all of the links in the Reference
  all_links = get_all_links_from_reference()

  # Iterate through every link in the Reference
  for link in all_links:
    # Add all of the documents created from the article
    docs_from_article = create_documents_from_webpage(link)
    docs.extend(docs_from_article)

  return docs

Feel free to add additional debug comments, such as which link is currently being processed, how many links are left, how many documents were added, etc.

After running generate_databank, we now have a databank of nutrition sources!

Keep in mind that this may be a memory-intensive process, and depending on your machine, you may want to consider running this program in a hosted environment like Google Colab. I was able to run this program on my 2023 M2 Pro MacBook Pro, but results may vary by your specific setup.

Conclusion

Through following the steps in this article, we utilized real-world web-scraping techniques to gather and process information from the internet and create a databank of sources; this databank will be used to provide reliable information to our nutrition chatbot.

We used the BeautifulSoup library and the requests library to crawl through the WebMD Health & Diet Medical Reference, and wrote functions to convert the webpage into markdown text, strip links from the markdown content, and split the articles by markdown headings. Finally, we generated the databank of nutritional sources by iterating through all of the articles on pages in the Reference and saving all of the documents created.

In the next article in this series, we'll explore using sentence-transformers and FAISS to create a vector store index to connect the databank we created to an LLM like ChatGPT.

Did you find this article valuable?

Support Cole Gawin by becoming a sponsor. Any amount is appreciated!