Corpus filtering

When developing an AI assistant that works with multiple data corpuses, it’s often necessary to filter the data based on criteria like user roles, product versions, or individual preferences. By segmenting the content this way, you ensure that users only access relevant information, enhancing the AI assistant’s functionality and efficiency.

To filter corpuses, you can use the visual state. You can create filters for each data segment, such as user roles or product versions, and define the corpus() function within these filters.

Example of use

Assume you want to allow users to access GitHub documentation based on their role: user or admin . To do this, you can implement the following script:

Dialog script
const user = visual(state => state.role === "user");
const admin = visual(state => state.role === "admin");

user(() => {
    corpus({
        title: `User docs`,
        urls: [`https://docs.github.com/en`],
        maxPages: 10,
        maxDepth: 3,
    });
});


admin(() => {
    corpus({
        title: `Admin docs`,
        urls: [`https://docs.github.com/en/enterprise-cloud@latest/admin`],
        maxPages: 10,
        maxDepth: 3,
    });
});

To test it, in Alan AI Studio, set the visual state to "user":"admin" and ask a question: How do I create an account?. The AI assistant will provide an answer from the Admin docs corpus.

../../../_images/corpus-filtering-static.png

Assume you want to crawl Python documentation for two version: 3.11 and 3.12. You can implement the following script using the Puppeteer crawler and visual state:

Dialog script
async function* crawlDocs({url, page, document, args}) {
    let {version} = args;

    if (url.includes(`docs.python.org`)) {
        await page.waitForSelector(`div.documentwrapper`, {timeout: 15000});
        try {
            await page.waitForSelector(`div.documentwrapper`, {timeout: 5000});
        } catch(e) {
        }

        const html = await page.evaluate(() => {
            const selector = 'div.body';
            return document.querySelector(selector).innerHTML;
        });

        // Getting a list of URLs from the page
        let urls = await page.evaluate(() => {
            const anchorElements = document.querySelectorAll('a');
            const linksArray = Array.from(anchorElements).map(anchor => anchor.href);
            return linksArray;
        });
        console.log("Intial URLs list: " + urls);

        // Getting the page content
        let content = await api.html2md_v2({html});
        console.log("Page content: " + content);
        yield {content, mimeType: 'text/markdown'};

        // Filtering URLs
        urls = urls.filter(f=> f.includes(version));
        console.log("Filtered URLs: " + urls);
        yield {urls};
    }
}

const v311 = visual(state => state.version === "3.11");
const v312 = visual(state => state.version === "3.12");

v311(() => {
    corpus({
        title: `Python docs 3.11`,
        urls: [`https://docs.python.org/3/`],
        crawler: {
            args: {version: '3.11'},
            puppeteer: crawlDocs,
            browserLog: 'on',
        },
        maxPages: 10,
        maxDepth: 100,
    });
});

v312(() => {
    corpus({
        title: `Python docs 3.12`,
        urls: [`https://docs.python.org/3/`],
        crawler: {
            args: {version: '3.12'},
            puppeteer: crawlDocs,
            browserLog: 'on',
        },
        maxPages: 10,
        maxDepth: 100,
    });
});

Here is how this script works:

  1. The crawlDocs() function specifies the behavior of the Puppeteer crawler and indicates which content should be crawled. It takes the args argument in which the required docs version is passed.

    1. Puppeteer waits for the div.documentwrapper selector to be displayed and then crawls the content of the main docs page passed to the corpus() function: https://docs.python.org/3/.

    2. From the main docs page, Puppeteer retrieves all links to sub-pages.

    3. The function filters these links to make sure they match the specified version 3.11. Only pages with URLs containing 3.11 are selected for further crawling.

    4. Steps a-c are repeated for each page from the filtered list until the maxPages limit is reached.

  2. The script contains two filters created with the visual state: v311 and v312.

  3. Each corpus() is defined within its respective filter. The corpuses define the URL of docs to be crawled, the crawler type, function to be used and arguments passed to the crawlDocs() function: args: {version: '3.11'} or args: {version: '3.12'}. They also include other parameters such as max number of pages and corpus depth.

To test it, in Alan AI Studio, set the visual state to "version":"3.11" and ask a question: What are the new features in Python? The AI assistant will provide an answer from the Python docs 3.11 corpus.

../../../_images/corpus-filtering-puppeteer.png