Firecrawl thoroughly crawls websites, ensuring comprehensive data extraction while bypassing any web blocker mechanisms. Here’s how it works:

  1. URL Analysis: Begins with a specified URL, identifying links by looking at the sitemap and then crawling the website. If no sitemap is found, it will crawl the website following the links.

  2. Recursive Traversal: Recursively follows each link to uncover all subpages.

  3. Content Scraping: Gathers content from every visited page while handling any complexities like JavaScript rendering or rate limits.

  4. Result Compilation: Converts collected data into clean markdown or structured output, perfect for LLM processing or any other task.

This method guarantees an exhaustive crawl and data collection from any starting URL.

Crawling

/crawl endpoint

用于爬取一个URL及其所有可访问的子页面。这会提交一个爬虫任务并返回一个作业ID,以检查爬虫的状态。

安装

pip install firecrawl-py

使用

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key="YOUR_API_KEY")

crawl_result = app.crawl_url('mendable.ai', {'crawlerOptions': {'excludes': ['blog/*']}})

# Get the markdown
for result in crawl_result:
    print(result['markdown'])

Job ID Response

如果您不使用SDK或更喜欢使用webhook或其他轮询方法,可以将wait_until_done设置为false。这将返回一个jobId。 对于cURL,/crawl将始终返回一个jobId,您可以用它来检查爬虫的状态。

{ "jobId": "1234-5678-9101" }

检查爬虫作业状态

用于检查爬虫作业的状态并获取其结果。

status = app.check_crawl_status(job_id)

响应

{
  "status": "completed",
  "current": 22,
  "total": 22,
  "data": [
    {
      "content": "Raw Content ",
      "markdown": "# Markdown Content",
      "provider": "web-scraper",
      "metadata": {
        "title": "Mendable | AI for CX and Sales",
        "description": "AI for CX and Sales",
        "language": null,
        "sourceURL": "https://www.mendable.ai/"
      }
    }
  ]
}