注意:此文档使用的是 v0 版本的 Firecrawl API,该版本已被弃用。我们建议切换到 v1

安装

要安装 Firecrawl Rust SDK,请在您的 Cargo.toml 文件中添加以下内容:

[dependencies]
firecrawl = "^0.1"
tokio = { version = "^1", features = ["full"] }
serde = { version = "^1.0", features = ["derive"] }
serde_json = "^1.0"
uuid = { version = "^1.10", features = ["v4"] }

[build-dependencies]
tokio = { version = "1", features = ["full"] }

使用

  1. firecrawl.dev 获取 API 密钥。
  2. 将 API 密钥设置为环境变量 FIRECRAWL_API_KEY 或将其作为参数传递给 FirecrawlApp 结构体。

以下是如何在 Rust 中使用 SDK 的示例:

use firecrawl::FirecrawlApp;

#[tokio::main]
async fn main() {
  let api_key = "YOUR_API_KEY";
  let api_url = "https://api.firecrawl.dev";
  let app = FirecrawlApp::new(api_key, api_url).expect("Failed to initialize FirecrawlApp");

  // 抓取单个 URL
  let scrape_result = app.scrape_url("https://docs.firecrawl.dev", None).await;
  match scrape_result {
    Ok(data) => println!("Scraped Data: {}", data),
    Err(e) => eprintln!("Error occurred while scraping: {}", e),
  }
  // 爬取一个网站
  let crawl_params = json!({
    "pageOptions": {
      "onlyMainContent": true
    }
  });

  let crawl_result = app.crawl_url("https://docs.firecrawl.dev", Some(crawl_params)).await;
  
  match crawl_result {
    Ok(data) => println!("Crawl Result: {}", data),
    Err(e) => eprintln!("Error occurred while crawling: {}", e),
  }
}

抓取 URL

使用 scrape_url 方法抓取单个 URL,并处理错误。它接受 URL 作为参数,并返回抓取的数据作为 serde_json::Value

let scrape_result = app.scrape_url("https://docs.firecrawl.dev", None).await;

match scrape_result {
  Ok(data) => println!("Scraped Data: {}", data),
  Err(e) => eprintln!("Failed to scrape URL: {}", e),
}

爬取网站

使用 crawl_url 方法爬取网站。它接受起始 URL 和可选参数作为参数。params 参数允许您指定爬取任务的其他选项,例如要爬取的最大页面数、允许的域和输出格式。

let crawl_params = json!({
  "crawlerOptions": {
    "excludes": ["blog/"],
    "includes": [], // 留空表示所有页面
    "limit": 1000
  },
  "pageOptions": {
    "onlyMainContent": true
  }
});
let crawl_result = app.crawl_url("https://docs.firecrawl.dev", Some(crawl_params)).await;

match crawl_result {
  Ok(data) => println!("Crawl Result: {}", data),
  Err(e) => eprintln!("Failed to crawl URL: {}", e),
}

检查爬取状态

使用 check_crawl_status 方法检查爬取任务的状态。它接受任务 ID 作为参数,并返回当前爬取任务的状态。

let job_id = "your_job_id_here";
let status = app.check_crawl_status(job_id).await;

match status {
  Ok(data) => println!("Crawl Status: {}", data),
  Err(e) => eprintln!("Failed to check crawl status: {}", e),
}

取消爬取任务

使用 cancel_crawl_job 方法取消爬取任务。它接受任务 ID 作为参数,并返回爬取任务的取消状态。

let job_id = "your_job_id_here";
let canceled = app.cancel_crawl_job(job_id).await;

match canceled {
  Ok(status) => println!("Cancellation Status: {}", status),
  Err(e) => eprintln!("Failed to cancel crawl job: {}", e),
}

从 URL 提取结构化数据

通过 LLM 提取,您可以从任何 URL 中轻松提取结构化数据。以下是如何使用它的方法:

let json_schema = json!({
  "type": "object",
  "properties": {
    "top": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
        "title": {"type": "string"},
        "points": {"type": "number"},
        "by": {"type": "string"},
        "commentsURL": {"type": "string"}
      },
      "required": ["title", "points", "by", "commentsURL"]
      },
      "minItems": 5,
      "maxItems": 5,
      "description": "Top 5 stories on Hacker News"
    }
  },
  "required": ["top"]
});

let llm_extraction_params = json!({
  "extractorOptions": {
    "extractionSchema": json_schema
  }
});

let scrape_result = app.scrape_url("https://news.ycombinator.com", Some(llm_extraction_params)).await;

match scrape_result {
  Ok(data) => println!("LLM Extraction Result: {}", data),
  Err(e) => eprintln!("Failed to perform LLM extraction: {}", e),
}

搜索查询

要搜索网页,获取最相关的结果,抓取每个页面并返回 Markdown 格式的内容,可以使用 search 方法。该方法接受查询作为参数,并返回搜索结果。

let query = "What is firecrawl?";
let search_result = app.search(query).await;

match search_result {
  Ok(data) => println!("Search Result: {}", data),
  Err(e) => eprintln!("Failed to search: {}", e),
}

错误处理

SDK 会处理 Firecrawl API 返回的错误,并在请求过程中发生错误时抛出适当的异常。如果请求过程中出现错误,将抛出带有描述性错误消息的异常。