Static corpus

The Q&A service lets you create a Q&A AI assistant that uses static data sources: company website pages, product manuals, guidelines, FAQ pages, articles and so on.

You can define the following types of static corpuses in the dialog script:

  • Web corpus: retrieve information from website pages and PDF files available online

  • Text corpus: use plain text as an information source

Web corpus

To define a web corpus for your Q&A AI assistant, use the corpus() function.

Note

The corpus() syntax differs between Alan AI SLU versions. Select the appropriate SLU version using the tabs below.

Dialog script
corpus({
    title: `HTTP corpus`,
    urls: [
        `https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview`,
        `https://developer.mozilla.org/en-US/docs/Web/HTTP/Messages`,
        `https://developer.mozilla.org/en-US/docs/Web/HTTP/Session`],
    exclude: [`https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/Evolution_of_HTTP`],
    depth: 1,
    maxPages: 5,
    priority: 0,
});
../../../_images/corpus-testing.png

Corpus parameters

Name

Type

Is Required

Description

title

string

False

Corpus title.

urls

string array

True

List of URLs from which information must be retrieved. You can define URLs of website folders and pages.

exclude

string array

False

List of URLs to be excluded from indexing. You can define URLs of website folders and pages.

depth

integer

False

Crawl depth for web and PDF resources. The minimum value is 0 (crawling only the page content without linked resources). For details, see Data crawling.

maxPages

integer

True

Maximum number of pages and files to index. If not set, only 1 page with the defined URL will be indexed.

priority

integer

False

Priority level assigned to the corpus. Corpuses with higher priority are considered more relevant when user requests are processed.

query

function

False

Transforms function used to process user queries. For details, see Dynamic corpus.

transforms

function

False

Transforms function used to format the corpus output. For details, see Static corpus transforms.

Note

Mind the following:

  • Make sure the websites and pages you define in the corpus() function are not protected from crawling. The Q&A service cannot retrieve content from such resources.

  • The indexing process may take some time. To check the progress and results, use the Alan AI Studio logs.

  • The maximum number of indexed pages depends on your pricing plan. For details, contact the Alan AI Sales Team.

Data crawling

The crawl depth defines how ‘far’ down the resource hierarchy the crawler must go to retrieve the content for the Q&A assistant. For example, if you set the crawl depth to 1, the crawler will access the page accessible at the start URL, extract all unique links to other pages in the same domain from this page and retrieve information from the start and linked pages.

Choose the crawl depth wisely. The deeper the level, the more likely users are to receive accurate answers to their questions. However, a deeper crawl depth may have an impact on the Q&A service’s performance.

Dialog script
corpus({
    url: `https://developer.mozilla.org/en-US/docs/Web/HTTP/`,
    depth: 1,
    maxPages: 10,
});

Corpus parameters

Name

Type

Is Required

Description

url

string

True

Resource URLs from which information must be retrieved. You can define a URL of a website folder and page.

depth

integer

False

Crawl depth for web and PDF resources. The minimum value is 0 (crawling only the page content without linked resources). For details, see Data crawling.

maxPages

integer

True

Maximum number of pages and files to index. If not set, only 1 page with the defined URL will be indexed.

Data crawling

The crawl depth defines how ‘far’ down the resource hierarchy the crawler must go to retrieve the content for the Q&A assistant. For example, if you set the crawl depth to 1, the crawler will access the page accessible at the start URL, extract all unique links to other pages in the same domain from this page and retrieve information from the start and linked pages.

Choose the crawl depth wisely. The deeper the level, the more likely users are to receive accurate answers to their questions. However, a deeper crawl depth may have an impact on the Q&A service’s performance.

Note

Mind the following:

  • Make sure the websites and pages you define in the corpus() function are not protected from crawling. The Q&A service cannot retrieve the content from such resources.

  • The indexing process may take some time. To check the progress and results, use the Alan AI Studio logs.

  • The maximum number of indexed pages depends on your pricing plan. For details, contact the Alan AI Sales Team.

Text corpus

To define a text corpus for the Q&A AI assistant, add plain text strings to the corpus() function:

Dialog script
corpus(`
    Hi, I am your HTTP AI assistant.
    I'm here to offer insights into the HTTP protocol.
    I can answer any questions regarding HTTP requests and responses, status codes, sessions and more.
    Need assistance unraveling the complexities of HTTP protocol? I'm at your service with clear explanations.
`)