Crawling depth¶
The depth
parameter in the corpus()
function indicates the crawling depth, or how Alan AI should index web pages and other resources. The effect of the depth
parameter varies depending on the type of corpus:
Static corpuses¶
In static corpuses, the depth
parameter indicates how ‘deep’ down the resource hierarchy the crawler must go to retrieve the content for the AI agent.
Here is how the crawling process works for static corpuses:
The Q&A service retrieves content from one or more start URLs specified in the
corpus()
function.The service parses these start pages to find and collect all unique links to sub-pages within the same domain.
The service follows these links to access and retrieve the content of each linked page.
This process continues recursively, as the service follows links from each newly discovered page to explore further into deeper levels of the website or resource.
For example, if depth
is set to 1, the crawler will navigate to start page and retrieve its content, get all unique links to sub-pages in the same domain and retrieve the content from the linked sub-pages.
Note
Choose the crawling depth wisely. A deeper crawl can lead to more accurate answers, but it may also impact the performance of the Q&A service.
corpus({
title: `HTTP corpus`,
urls: [
`https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview`,
`https://developer.mozilla.org/en-US/docs/Web/HTTP/Messages`,
`https://developer.mozilla.org/en-US/docs/Web/HTTP/Session`],
depth: 1,
maxPages: 5,
});
Corpuses with Puppeteer¶
In corpuses with the Puppeteer crawler, the depth
parameter indicates the total number of transitions to linked sub-pages. Puppeteer indexes content horizontally, across the same level of a website’s hierarchy, rather than digging deeper into nested pages.
Here is how the Puppeteer crawler works:
The crawler retrieves content from one or more start URLs specified in the
corpus()
function.The crawler parses these start pages to find and collect all unique links to sub-pages that may reside within the same domain or outside the domain.
The crawler follows these links to access and retrieve the content of each linked page.
Instead of following links to deeper, nested pages, the crawler proceeds to pages at the same hierarchical level.
Note
Mind the following:
If
depth
is set to a value lower thanmaxPages
, Alan AI will set it to match themaxPages
value.If the crawler function uses additional filters to skip certain pages, you may reach the depth limit before gathering the expected maximum number of pages. This can happen because Puppeteer follows any links, including those that lead outside the domain.
corpus({
title: `Knowledge Base`,
urls: [`urls to crawl`]
crawler: {
puppeteer: crawlPages(),
browserLog: 'on',
args: {arg1: 'value1', agr2: 'value2'},
},
depth: 10,
maxPages: 10,
});
async function* crawlPages({url, page, document, args}) {
// crawlPages function code ...
}