Detailed walkthrough
Go throught the code step-by-step to really understand what’t going on
First load HTML and customize the HTML2text process by bs4
Here is code
import bs4
from langchain_community.document_loaders import WebBaseLoader
# Only keep post title, headers, and content from the full HTML.
bs4_strainer = bs4.SoupStrainer(class_=("post-title", "post-header", "post-content"))
loader = WebBaseLoader(
web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
bs_kwargs={"parse_only": bs4_strainer},
)
docs = loader.load()
len(docs[0].page_content)DoucmentLoader
This is a object ot load data from source and format it into a list
-
Docs Detailed documentation on how to use DocumentLoaders.
-
Integrations 160+ integrations to choose from.
-
Interface API reference for the base interface.
Last updated on