Static corpus

The Q&A service lets you create a Q&A AI agent that uses static data sources: company website pages, product manuals, guidelines, FAQ pages, articles and so on.

You can define the following types of static corpuses in the dialog script:

  • Web corpus: retrieve information from website pages and PDF files available online

  • Text corpus: use plain text as an information source

Web corpus

To define a web corpus for your Q&A AI agent, use the corpus() function.

Dialog script
corpus({
    title: `HTTP corpus`,
    urls: [
        `https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview`,
        `https://developer.mozilla.org/en-US/docs/Web/HTTP/Messages`,
        `https://developer.mozilla.org/en-US/docs/Web/HTTP/Session`],
    exclude: [`https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/Evolution_of_HTTP`],
    depth: 1,
    maxPages: 5,
    priority: 0,
});
../../../_images/corpus-testing.png

Corpus parameters

Name

Type

Is Required

Description

title

string

False

Corpus title.

urls

string array

True

List of URLs from which information must be retrieved. You can define URLs of website folders and pages.

auth

JSON object

False

Credentials to access resources that require basic authentication: {username: 'johnsmith', password: 'password'}. For details, see Protected web resources.

include

string array

False

Resources to be obligatory indexed. You can define an array of URLs or use RegEx to specify a rule. For details, see Corpus includes and excludes.

exclude

string array

False

Resources to be excluded from indexing. You can define an array of URLs or use RegEx to specify a rule. For details, see Corpus includes and excludes.

depth

integer

False

Crawl depth for web and PDF resources. The minimum value is 0 (crawling only the page content without linked resources). For details, see Data crawling.

maxPages

integer

True

Maximum number of pages and files to index. If not set, only 1 page with the defined URL will be indexed.

query

function

False

Transforms function used to process user queries. For details, see Dynamic corpus.

transforms

function

False

Transforms function used to format the corpus output. For details, see Static corpus transforms.

priority

integer

False

Priority level assigned to the corpus. Corpuses with higher priority are considered more relevant when user requests are processed.

Note

Mind the following:

  • Make sure the websites and pages you define in the corpus() function are not protected from crawling. The Q&A service cannot retrieve content from such resources.

  • The indexing process may take some time. To check the progress and results, use the Alan AI Studio logs.

  • The maximum number of indexed pages depends on your pricing plan. For details, contact the Alan AI Sales Team.

Data crawling

The crawl depth defines how ‘far’ down the resource hierarchy the crawler must go to retrieve the content for the Q&A agent. For example, if you set the crawl depth to 1, the crawler will access the page accessible at the start URL, extract all unique links to other pages in the same domain from this page and retrieve information from the start and linked pages.

Choose the crawl depth wisely. The deeper the level, the more likely users are to receive accurate answers to their questions. However, a deeper crawl depth may have an impact on the Q&A service’s performance.

Text corpus

To define a text corpus for the Q&A AI agent, add plain text strings to the corpus() function:

Dialog script
corpus(`
    Hi, I am your HTTP AI agent.
    I'm here to offer insights into the HTTP protocol.
    I can answer any questions regarding HTTP requests and responses, status codes, sessions and more.
    Need assistance unraveling the complexities of HTTP protocol? I'm at your service with clear explanations.
`)