Detailed walkthrough

Go throught the code step-by-step to really understand what’t going on

First load HTML and customize the HTML2text process by bs4

Here is code


import bs4
from langchain_community.document_loaders import WebBaseLoader
 
# Only keep post title, headers, and content from the full HTML.
bs4_strainer = bs4.SoupStrainer(class_=("post-title", "post-header", "post-content"))
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs={"parse_only": bs4_strainer},
)
docs = loader.load()
 
len(docs[0].page_content)

DoucmentLoader

This is a object ot load data from source and format it into a list

Docs Detailed documentation on how to use DocumentLoaders.
Integrations 160+ integrations to choose from.
Interface API reference for the base interface.