GitbookLoader#
- class langchain_community.document_loaders.gitbook.GitbookLoader(
- web_page: str,
- load_all_paths: bool = False,
- base_url: str | None = None,
- content_selector: str = 'main',
- continue_on_failure: bool = False,
- show_progress: bool = True,
- *,
- sitemap_url: str | None = None,
- allowed_domains: Set[str] | None = None,
Load GitBook data.
load from either a single page, or
load all (relative) paths in the sitemap, handling nested sitemap indexes.
When load_all_paths=True, the loader parses XML sitemaps and requires the lxml package to be installed (pip install lxml).
Initialize with web page and whether to load all paths.
- Parameters:
web_page (str) – The web page to load or the starting point from where relative paths are discovered.
load_all_paths (bool) – If set to True, all relative paths in the navbar are loaded instead of only web_page. Requires lxml package.
base_url (str | None) – If load_all_paths is True, the relative paths are appended to this base url. Defaults to web_page.
content_selector (str) – The CSS selector for the content to load. Defaults to “main”.
continue_on_failure (bool) – whether to continue loading the sitemap if an error occurs loading a url, emitting a warning instead of raising an exception. Setting this to True makes the loader more robust, but also may result in missing data. Default: False
show_progress (bool) – whether to show a progress bar while loading. Default: True
sitemap_url (str | None) – Custom sitemap URL to use when load_all_paths is True. Defaults to “{base_url}/sitemap.xml”.
allowed_domains (Set[str] | None) – Optional set of allowed domains to fetch from. If None (default), the loader will restrict crawling to the domain of the web_page URL to prevent potential SSRF vulnerabilities. Provide an explicit set (e.g., {“example.com”, “docs.example.com”}) to allow crawling across multiple domains. Use with caution in server environments where users might control the input URLs.
Methods
__init__
(web_page[, load_all_paths, ...])Initialize with web page and whether to load all paths.
Asynchronously fetch text from GitBook page(s).
aload
()Load data into Document objects.
Fetch text from one single GitBook page or recursively from sitemap.
load
()Load data into Document objects.
load_and_split
([text_splitter])Load Documents and split into chunks.
- __init__(
- web_page: str,
- load_all_paths: bool = False,
- base_url: str | None = None,
- content_selector: str = 'main',
- continue_on_failure: bool = False,
- show_progress: bool = True,
- *,
- sitemap_url: str | None = None,
- allowed_domains: Set[str] | None = None,
Initialize with web page and whether to load all paths.
- Parameters:
web_page (str) – The web page to load or the starting point from where relative paths are discovered.
load_all_paths (bool) – If set to True, all relative paths in the navbar are loaded instead of only web_page. Requires lxml package.
base_url (str | None) – If load_all_paths is True, the relative paths are appended to this base url. Defaults to web_page.
content_selector (str) – The CSS selector for the content to load. Defaults to “main”.
continue_on_failure (bool) – whether to continue loading the sitemap if an error occurs loading a url, emitting a warning instead of raising an exception. Setting this to True makes the loader more robust, but also may result in missing data. Default: False
show_progress (bool) – whether to show a progress bar while loading. Default: True
sitemap_url (str | None) – Custom sitemap URL to use when load_all_paths is True. Defaults to “{base_url}/sitemap.xml”.
allowed_domains (Set[str] | None) – Optional set of allowed domains to fetch from. If None (default), the loader will restrict crawling to the domain of the web_page URL to prevent potential SSRF vulnerabilities. Provide an explicit set (e.g., {“example.com”, “docs.example.com”}) to allow crawling across multiple domains. Use with caution in server environments where users might control the input URLs.
- async alazy_load() AsyncIterator[Document] [source]#
Asynchronously fetch text from GitBook page(s).
- Return type:
AsyncIterator[Document]
- lazy_load() Iterator[Document] [source]#
Fetch text from one single GitBook page or recursively from sitemap.
- Return type:
Iterator[Document]
- load_and_split(
- text_splitter: TextSplitter | None = None,
Load Documents and split into chunks. Chunks are returned as Documents.
Do not override this method. It should be considered to be deprecated!
- Parameters:
text_splitter (Optional[TextSplitter]) – TextSplitter instance to use for splitting documents. Defaults to RecursiveCharacterTextSplitter.
- Returns:
List of Documents.
- Return type:
list[Document]
Examples using GitbookLoader