Create a static corpus

You can create a static corpus for your AI agent. To build such a data corpus, Alan AI automatically crawls static data sources like web pages, PDF documents, text, CSV and markdown files and creates a knowledge memory for the AI agent. The AI agent then uses this memory to answer user queries, providing clear, well-formatted responses that may include:

  • Formatted text

  • Lists

  • Images

  • Diagrams

  • Formulas

  • Code snippets

  • Links to the original source and so on

Use case

You are developing an AI agent for a cloud service platform. Your goal is to enable the AI agent to assist users with questions about managing resources like VMs, buckets and services. To achieve this, you need to build a static corpus by crawling the platform documentation that covers these topics.

Prerequisites

To successfully follow this exercise, make sure you have signed up for Alan AI Studio and created a project for the AI agent. For details, see Sign up for Alan AI Studio.

Defining a static corpus

To define a static corpus:

  1. To the dialog script in Alan AI Studio, add the corpus() function:

    Dialog script
    corpus({
        title: `Cloud documentation`,
        urls: [
            `https://cloud.google.com/compute/docs/overview`,
            `https://cloud.google.com/compute/docs/images`,
            `https://cloud.google.com/compute/docs/disks`,
            `https://cloud.google.com/storage/docs/buckets`],
        depth: 1,
        maxPages: 30,
        priority: 1,
    });
    

    Here, the data corpus uses the following parameters:

    • title: corpus name

    • urls: URLs of the web pages you want to crawl

    • depth: crawling depth that determines how many levels deep the crawler must go to retrieve content

    • maxPages: maximum number of pages the crawler must retrieve

    • priority: corpus priority level

  2. Save the dialog script.

Validation

To make sure the pages have been crawled and the data corpus is successfully created:

  1. At the bottom of Alan AI Studio, open logs and make sure the corpus task is marked as ready.

  2. At the top of Alan AI Studio, click Crawler Tasks and make sure the crawler task status is complete.

    ../../_images/crawler-tasks.png
  3. In the code editor, to the left of the corpus() function, click the Magnifying glass icon. Use the Corpus Explorer to examine what content has been added to the knowledge memory.

    ../../_images/corpus-explorer.png
  4. In the Debugging Chat on the right, ask questions about infrastructure objects, for example:

    • What is a regional persistent disk, and when should it be used?

    • How do I list all FreeBSD images?

    • What are the limitations in naming buckets?

    ../../_images/corpus-questions.png