注意:此文档使用的是 v0 版本的 Firecrawl API,该版本已被弃用。我们建议切换到 v1。
要安装 Firecrawl Rust SDK,请在您的 Cargo.toml
文件中添加以下内容:
[dependencies]
firecrawl = "^0.1"
tokio = { version = "^1", features = ["full"] }
serde = { version = "^1.0", features = ["derive"] }
serde_json = "^1.0"
uuid = { version = "^1.10", features = ["v4"] }
[build-dependencies]
tokio = { version = "1", features = ["full"] }
- 从 firecrawl.dev 获取 API 密钥。
- 将 API 密钥设置为环境变量
FIRECRAWL_API_KEY
或将其作为参数传递给 FirecrawlApp
结构体。
以下是如何在 Rust 中使用 SDK 的示例:
use firecrawl::FirecrawlApp;
#[tokio::main]
async fn main() {
let api_key = "YOUR_API_KEY";
let api_url = "https://api.firecrawl.dev";
let app = FirecrawlApp::new(api_key, api_url).expect("Failed to initialize FirecrawlApp");
// 抓取单个 URL
let scrape_result = app.scrape_url("https://docs.firecrawl.dev", None).await;
match scrape_result {
Ok(data) => println!("Scraped Data: {}", data),
Err(e) => eprintln!("Error occurred while scraping: {}", e),
}
// 爬取一个网站
let crawl_params = json!({
"pageOptions": {
"onlyMainContent": true
}
});
let crawl_result = app.crawl_url("https://docs.firecrawl.dev", Some(crawl_params)).await;
match crawl_result {
Ok(data) => println!("Crawl Result: {}", data),
Err(e) => eprintln!("Error occurred while crawling: {}", e),
}
}
抓取 URL
使用 scrape_url
方法抓取单个 URL,并处理错误。它接受 URL 作为参数,并返回抓取的数据作为 serde_json::Value
。
let scrape_result = app.scrape_url("https://docs.firecrawl.dev", None).await;
match scrape_result {
Ok(data) => println!("Scraped Data: {}", data),
Err(e) => eprintln!("Failed to scrape URL: {}", e),
}
爬取网站
使用 crawl_url
方法爬取网站。它接受起始 URL 和可选参数作为参数。params
参数允许您指定爬取任务的其他选项,例如要爬取的最大页面数、允许的域和输出格式。
let crawl_params = json!({
"crawlerOptions": {
"excludes": ["blog/"],
"includes": [], // 留空表示所有页面
"limit": 1000
},
"pageOptions": {
"onlyMainContent": true
}
});
let crawl_result = app.crawl_url("https://docs.firecrawl.dev", Some(crawl_params)).await;
match crawl_result {
Ok(data) => println!("Crawl Result: {}", data),
Err(e) => eprintln!("Failed to crawl URL: {}", e),
}
检查爬取状态
使用 check_crawl_status
方法检查爬取任务的状态。它接受任务 ID 作为参数,并返回当前爬取任务的状态。
let job_id = "your_job_id_here";
let status = app.check_crawl_status(job_id).await;
match status {
Ok(data) => println!("Crawl Status: {}", data),
Err(e) => eprintln!("Failed to check crawl status: {}", e),
}
取消爬取任务
使用 cancel_crawl_job
方法取消爬取任务。它接受任务 ID 作为参数,并返回爬取任务的取消状态。
let job_id = "your_job_id_here";
let canceled = app.cancel_crawl_job(job_id).await;
match canceled {
Ok(status) => println!("Cancellation Status: {}", status),
Err(e) => eprintln!("Failed to cancel crawl job: {}", e),
}
从 URL 提取结构化数据
通过 LLM 提取,您可以从任何 URL 中轻松提取结构化数据。以下是如何使用它的方法:
let json_schema = json!({
"type": "object",
"properties": {
"top": {
"type": "array",
"items": {
"type": "object",
"properties": {
"title": {"type": "string"},
"points": {"type": "number"},
"by": {"type": "string"},
"commentsURL": {"type": "string"}
},
"required": ["title", "points", "by", "commentsURL"]
},
"minItems": 5,
"maxItems": 5,
"description": "Top 5 stories on Hacker News"
}
},
"required": ["top"]
});
let llm_extraction_params = json!({
"extractorOptions": {
"extractionSchema": json_schema
}
});
let scrape_result = app.scrape_url("https://news.ycombinator.com", Some(llm_extraction_params)).await;
match scrape_result {
Ok(data) => println!("LLM Extraction Result: {}", data),
Err(e) => eprintln!("Failed to perform LLM extraction: {}", e),
}
搜索查询
要搜索网页,获取最相关的结果,抓取每个页面并返回 Markdown 格式的内容,可以使用 search
方法。该方法接受查询作为参数,并返回搜索结果。
let query = "What is firecrawl?";
let search_result = app.search(query).await;
match search_result {
Ok(data) => println!("Search Result: {}", data),
Err(e) => eprintln!("Failed to search: {}", e),
}
错误处理
SDK 会处理 Firecrawl API 返回的错误,并在请求过程中发生错误时抛出适当的异常。如果请求过程中出现错误,将抛出带有描述性错误消息的异常。