Static corpus¶
The Q&A service lets you create a Q&A AI agent that uses static data sources: company website pages, product manuals, guidelines, FAQ pages, articles and so on.
You can define the following types of static corpuses in the dialog script:
Web corpus: retrieve information from website pages and PDF files available online
Text corpus: use plain text as an information source
Web corpus¶
To define a web corpus for your Q&A AI agent, use the corpus() function.
corpus({
title: `HTTP corpus`,
urls: [
`https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview`,
`https://developer.mozilla.org/en-US/docs/Web/HTTP/Messages`,
`https://developer.mozilla.org/en-US/docs/Web/HTTP/Session`],
exclude: [`https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/Evolution_of_HTTP`],
depth: 1,
maxPages: 5,
priority: 0,
});
![../../../_images/corpus-testing.png](../../../_images/corpus-testing.png)
Corpus parameters¶
Name |
Type |
Is Required |
Description |
---|---|---|---|
|
string |
False |
Corpus title. |
|
string array |
True |
List of URLs from which information must be retrieved. You can define URLs of website folders and pages. |
|
JSON object |
False |
Credentials to access resources that require basic authentication: |
|
string array |
False |
Resources to be obligatory indexed. You can define an array of URLs or use RegEx to specify a rule. For details, see Corpus includes and excludes. |
|
string array |
False |
Resources to be excluded from indexing. You can define an array of URLs or use RegEx to specify a rule. For details, see Corpus includes and excludes. |
|
integer |
False |
Crawl depth for web and PDF resources. The minimum value is 0 (crawling only the page content without linked resources). For details, see Data crawling. |
|
integer |
True |
Maximum number of pages and files to index. If not set, only 1 page with the defined URL will be indexed. |
|
function |
False |
Transforms function used to process user queries. For details, see Dynamic corpus. |
|
function |
False |
Transforms function used to format the corpus output. For details, see Static corpus transforms. |
|
integer |
False |
Priority level assigned to the corpus. Corpuses with higher priority are considered more relevant when user requests are processed. |
Note
Mind the following:
Make sure the websites and pages you define in the
corpus()
function are not protected from crawling. The Q&A service cannot retrieve content from such resources.The indexing process may take some time. To check the progress and results, use the Alan AI Studio logs.
The maximum number of indexed pages depends on your pricing plan. For details, contact the Alan AI Sales Team.
Data crawling¶
The crawl depth defines how ‘far’ down the resource hierarchy the crawler must go to retrieve the content for the Q&A agent. For example, if you set the crawl depth to 1
, the crawler will access the page accessible at the start URL, extract all unique links to other pages in the same domain from this page and retrieve information from the start and linked pages.
Choose the crawl depth wisely. The deeper the level, the more likely users are to receive accurate answers to their questions. However, a deeper crawl depth may have an impact on the Q&A service’s performance.
Text corpus¶
To define a text corpus for the Q&A AI agent, add plain text strings to the corpus()
function:
corpus(`
Hi, I am your HTTP AI agent.
I'm here to offer insights into the HTTP protocol.
I can answer any questions regarding HTTP requests and responses, status codes, sessions and more.
Need assistance unraveling the complexities of HTTP protocol? I'm at your service with clear explanations.
`)