Corpus includes and excludes¶
To instruct the corpus()
function to selectively include or exclude pages from the data corpus, use the include
and exclude
parameters. These parameters accept arrays of URLs or regular expressions (RegEx) for matching documents.
Assume you want to index a page that contains a link to a PDF document and want to make sure this PDF document gets obligatory crawled. At the same time, you want to exclude auxiliary pages such as Jobs
and Web Accessibility Statement
. To do this, use the following script:
corpus({
urls: ["https://catalog.manhattan.edu/undergraduate/"],
include: [/.*\.pdf/],
exclude: ["https://manhattan.edu/web-accessibility-statement", "https://inside.manhattan.edu/offices/human-resources/jobs"],
maxPages: 10,
depth: 1
});