Corpus includes and excludes

To instruct the corpus() function to selectively include or exclude pages from the data corpus, use the include and exclude parameters. These parameters accept arrays of URLs or regular expressions (RegEx) for matching documents.

Assume you want to index a page that contains a link to a PDF document and want to make sure this PDF document gets obligatory crawled. At the same time, you want to exclude auxiliary pages such as Jobs and Web Accessibility Statement. To do this, use the following script:

Dialog script
corpus({
    urls: ["https://catalog.manhattan.edu/undergraduate/"],
    include: [/.*\.pdf/],
    exclude: ["https://manhattan.edu/web-accessibility-statement", "https://inside.manhattan.edu/offices/human-resources/jobs"],
    maxPages: 10,
    depth: 1
});