Corpus filtering¶
When developing an AI assistant that works with multiple data corpuses, it’s often necessary to filter the data based on criteria like user roles, product versions, or individual preferences. By segmenting the content this way, you ensure that users only access relevant information, enhancing the AI assistant’s functionality and efficiency.
To filter corpuses, you can use the visual state. You can create filters for each data segment, such as user roles or product versions, and define the corpus()
function within these filters.
Example of use¶
Assume you want to allow users to access GitHub documentation based on their role: user
or admin
. To do this, you can implement the following script:
const user = visual(state => state.role === "user");
const admin = visual(state => state.role === "admin");
user(() => {
corpus({
title: `User docs`,
urls: [`https://docs.github.com/en`],
maxPages: 10,
maxDepth: 3,
});
});
admin(() => {
corpus({
title: `Admin docs`,
urls: [`https://docs.github.com/en/enterprise-cloud@latest/admin`],
maxPages: 10,
maxDepth: 3,
});
});
To test it, in Alan AI Studio, set the visual state to "user":"admin"
and ask a question: How do I create an account?
. The AI assistant will provide an answer from the Admin docs
corpus.
Assume you want to crawl Python documentation for two version: 3.11
and 3.12
. You can implement the following script using the Puppeteer crawler and visual state:
async function* crawlDocs({url, page, document, args}) {
let {version} = args;
if (url.includes(`docs.python.org`)) {
await page.waitForSelector(`div.documentwrapper`, {timeout: 15000});
try {
await page.waitForSelector(`div.documentwrapper`, {timeout: 5000});
} catch(e) {
}
const html = await page.evaluate(() => {
const selector = 'div.body';
return document.querySelector(selector).innerHTML;
});
// Getting a list of URLs from the page
let urls = await page.evaluate(() => {
const anchorElements = document.querySelectorAll('a');
const linksArray = Array.from(anchorElements).map(anchor => anchor.href);
return linksArray;
});
console.log("Intial URLs list: " + urls);
// Getting the page content
let content = await api.html2md_v2({html});
console.log("Page content: " + content);
yield {content, mimeType: 'text/markdown'};
// Filtering URLs
urls = urls.filter(f=> f.includes(version));
console.log("Filtered URLs: " + urls);
yield {urls};
}
}
const v311 = visual(state => state.version === "3.11");
const v312 = visual(state => state.version === "3.12");
v311(() => {
corpus({
title: `Python docs 3.11`,
urls: [`https://docs.python.org/3/`],
crawler: {
args: {version: '3.11'},
puppeteer: crawlDocs,
browserLog: 'on',
},
maxPages: 10,
maxDepth: 100,
});
});
v312(() => {
corpus({
title: `Python docs 3.12`,
urls: [`https://docs.python.org/3/`],
crawler: {
args: {version: '3.12'},
puppeteer: crawlDocs,
browserLog: 'on',
},
maxPages: 10,
maxDepth: 100,
});
});
Here is how this script works:
The
crawlDocs()
function specifies the behavior of the Puppeteer crawler and indicates which content should be crawled. It takes theargs
argument in which the required docs version is passed.Puppeteer waits for the
div.documentwrapper
selector to be displayed and then crawls the content of the main docs page passed to thecorpus()
function:https://docs.python.org/3/
.From the main docs page, Puppeteer retrieves all links to sub-pages.
The function filters these links to make sure they match the specified version
3.11
. Only pages with URLs containing3.11
are selected for further crawling.Steps a-c are repeated for each page from the filtered list until the
maxPages
limit is reached.
The script contains two filters created with the visual state:
v311
andv312
.Each
corpus()
is defined within its respective filter. The corpuses define the URL of docs to be crawled, the crawler type, function to be used and arguments passed to thecrawlDocs()
function:args: {version: '3.11'}
orargs: {version: '3.12'}
. They also include other parameters such as max number of pages and corpus depth.
To test it, in Alan AI Studio, set the visual state to "version":"3.11"
and ask a question: What are the new features in Python?
The AI assistant will provide an answer from the Python docs 3.11
corpus.