Crawling depth

The depth parameter in the corpus() function indicates the crawling depth, or how Alan AI should index web pages and other resources. The effect of the depth parameter varies depending on the type of corpus:

Static corpuses

In static corpuses, the depth parameter indicates how ‘deep’ down the resource hierarchy the crawler must go to retrieve the content for the AI agent.

Here is how the crawling process works for static corpuses:

  1. The Q&A service retrieves content from one or more start URLs specified in the corpus() function.

  2. The service parses these start pages to find and collect all unique links to sub-pages within the same domain.

  3. The service follows these links to access and retrieve the content of each linked page.

  4. This process continues recursively, as the service follows links from each newly discovered page to explore further into deeper levels of the website or resource.

For example, if depth is set to 1, the crawler will navigate to start page and retrieve its content, get all unique links to sub-pages in the same domain and retrieve the content from the linked sub-pages.

Note

Choose the crawling depth wisely. A deeper crawl can lead to more accurate answers, but it may also impact the performance of the Q&A service.

Dialog script
corpus({
    title: `HTTP corpus`,
    urls: [
        `https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview`,
        `https://developer.mozilla.org/en-US/docs/Web/HTTP/Messages`,
        `https://developer.mozilla.org/en-US/docs/Web/HTTP/Session`],
    depth: 1,
    maxPages: 5,
});

Corpuses with Puppeteer

In corpuses with the Puppeteer crawler, the depth parameter indicates the total number of transitions to linked sub-pages. Puppeteer indexes content horizontally, across the same level of a website’s hierarchy, rather than digging deeper into nested pages.

Here is how the Puppeteer crawler works:

  1. The crawler retrieves content from one or more start URLs specified in the corpus() function.

  2. The crawler parses these start pages to find and collect all unique links to sub-pages that may reside within the same domain or outside the domain.

  3. The crawler follows these links to access and retrieve the content of each linked page.

  4. Instead of following links to deeper, nested pages, the crawler proceeds to pages at the same hierarchical level.

Note

Mind the following:

  • If depth is set to a value lower than maxPages, Alan AI will set it to match the maxPages value.

  • If the crawler function uses additional filters to skip certain pages, you may reach the depth limit before gathering the expected maximum number of pages. This can happen because Puppeteer follows any links, including those that lead outside the domain.

Dialog script
corpus({
    title: `Knowledge Base`,
    urls: [`urls to crawl`]
    crawler: {
        puppeteer: crawlPages(),
        browserLog: 'on',
        args: {arg1: 'value1', agr2: 'value2'},
    },
    depth: 10,
    maxPages: 10,
});

async function* crawlPages({url, page, document, args}) {

    // crawlPages function code ...

}