# 🕷Serverless Web Crawler and Search Engine with Step Functions and Kendra ## Overview This sample aims to demonstrate how to create a serverless web crawler (or web scraper) using [AWS Lambda](https://aws.amazon.com/lambda/) and [AWS Step Functions](https://aws.amazon.com/step-functions/). It scales to crawl large websites that would time out if we used just a single lambda to crawl a site. The web crawler is written in Typescript, and uses [Puppeteer](https://github.com/puppeteer/puppeteer) to extract content and URLs from a given webpage. Additionally, this sample demonstrates an example use-case for the crawler by indexing crawled content into [Amazon Kendra](https://aws.amazon.com/kendra/), providing a machine-learning powered search over our crawled content. The CloudFormation stack for the Kendra resources is optional, you can deploy just the web crawler if you like. Make sure to review [kendra's pricing and free tier](https://aws.amazon.com/kendra/pricing/) before deploying the kendra part of the sample. The [AWS Cloud Development Kit (CDK)](https://aws.amazon.com/cdk/) is used to define the infrastructure for this sample as code. ## Architecture ![architecture-diagram](.readme-resources/architecture.png) * The Start Crawl Lambda is invoked with details of the website to crawl. * The Start Crawl Lambda creates a Dynamo DB table which will be used as the URL queue for the crawl. * The Start Crawl Lambda writes the initial URLs to the queue. * The Start Crawl Lambda triggers an execution of the web crawler state machine (see the section below). * The Web Crawler State Machine crawls the website, visiting URLs it discovers and optionally writing content to S3. * Kendra provides us with the ability to search our crawled content in S3. ## The Web Crawler The web crawler is best explained by the AWS Step Functions State Machine diagram: ![state-machine-diagram](.readme-resources/state-machine.png) * Read Queued Urls: Reads all non-visited URLs from the URL queue DynamoDB table. * Crawl Page And Queue Urls: Visits a single webpage, extracts its content, and writes new URLs to the URL queue. This step is executed in parallel across a batch of URLs. Batch size is configured in `lambda/config/constants.ts`. * Continue Execution: This is responsible for spawning a new state machine execution as we approach the [execution history limit](https://docs.aws.amazon.com/step-functions/latest/dg/bp-history-limit.html). * Complete Crawl: Delete the URL queue DynamoDB table and trigger a sync of the Kendra data source if applicable. ## Prerequisites * The [aws-cli](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html) must be installed *and* configured with an AWS account with a profile (see https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html for instructions on how to do this on your preferred development platform). Please ensure your profile is configured with a default AWS region. * This project requires [Node.js](http://nodejs.org/) ≥ 16 and [NPM](https://npmjs.org/) ≥ 8.3.0. To make sure you have them available on your machine, try running the following command. ```sh npm -v && node -v ``` * Install or update the [AWS CDK CLI](https://docs.aws.amazon.com/cdk/latest/guide/getting_started.html#getting_started_install) from npm. This uses CDK v2. ```sh npm i -g aws-cdk ``` * [Bootstrap your AWS account for CDK](https://docs.aws.amazon.com/cdk/latest/guide/bootstrapping.html) if you haven't done so already ## Deploy This repository provides a utility script to build and deploy the sample. To deploy the web crawler on its own, run: `./deploy --profile ` Or you can deploy the web crawler with Kendra too: `./deploy --profile --with-kendra` Note that if deploying with Kendra, ensure your profile is configured with one of the AWS regions that supports Kendra. See [the AWS Regional Services List](https://aws.amazon.com/about-aws/global-infrastructure/regional-product-services/) for details. ## Run The Crawler When the infrastructure has been deployed, you can trigger a run of the crawler with the included utility script: `./crawl --profile --name lambda-docs --base-url https://docs.aws.amazon.com/ --start-paths /lambda --keywords lambda/latest/dg` You can play with the arguments above to try different websites. * `--base-url` is used to specify the target website to crawl, only URLs starting with the base url will be queued. * `--start-paths` specifies one or more paths in the website to start at. * `--keywords` parameter filters the URLs which are queued to only ones containing one or more of the given keywords, (ie above, only URLs containing `lambda/latest/dg` are added to the queue) * `--name` is optional, and is used to help identify which step function execution or dynamodb table corresponds to which crawl. The `crawl` script will print a link to the AWS console so you can watch your Step Function State Machine execution in action. ## Search Crawled Content If you also deployed the Kendra stack (`--with-kendra`), you can visit the Kendra console to see an example search page for the Kendra index. The `crawl` script will print a link to this page if you deployed Kendra. Note that it will take a few minutes once the crawler has completed for Kendra to index the newly stored content. ![kendra-screenshot](./.readme-resources/kendra-screenshot.png) ## Run The Crawler Locally If you're playing with the core crawler logic, it might be handy to test it out locally. You can run the crawler locally with: `./local-crawl --base-url https://docs.aws.amazon.com/ --start-paths /lambda --keywords lambda/latest/dg` ## Cleaning Up You can clean up all your resources when you're done via the `destroy` script. If you deployed just the web crawler: `./destroy --profile ` Or if you deployed the web crawler with Kendra too: `./destroy --profile --with-kendra` ## Security See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information. ## License This library is licensed under the MIT-0 License. See the [LICENSE](LICENSE) file.