SourceSync.ai 是一个检索增强生成(Retrieval Augmented Generation)即服务平台,可以帮助您使用自己的数据构建 AI 应用程序。本指南将解释如何使用 Firecrawl 与 SourceSync.ai 进行网页抓取。
-
首先,从您的 Firecrawl 仪表盘 获取您的 Firecrawl API 密钥。
-
配置您的 SourceSync.ai 命名空间以使用 Firecrawl 作为网页抓取提供者:
curl -X PATCH https://api.sourcesync.ai/v1/namespaces/YOUR_NAMESPACE_ID \
-H "Authorization: Bearer YOUR_SOURCE_SYNC_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"webScraperConfig": {
"provider": "FIRECRAWL",
"apiKey": "YOUR_FIRECRAWL_API_KEY"
}
}'
配置完成后,您可以使用 SourceSync.ai 的网页抓取端点,并利用 Firecrawl 的功能。以下是主要的摄取方法:
URL 列表摄取
抓取特定的 URL:
curl -X POST https://api.sourcesync.ai/v1/ingest/urls \
-H "Authorization: Bearer YOUR_SOURCE_SYNC_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"namespaceId": "YOUR_NAMESPACE_ID",
"ingestConfig": {
"source": "URLS_LIST",
"config": {
"urls": [
"https://example.com/page1",
"https://example.com/page2"
],
"scrapeOptions": {
"includeSelectors": ["article", "main"],
"excludeSelectors": [".navigation", ".footer"]
}
}
}
}'
网站爬取
使用自定义规则爬取整个网站:
curl -X POST https://api.sourcesync.ai/v1/ingest/website \
-H "Authorization: Bearer YOUR_SOURCE_SYNC_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"namespaceId": "YOUR_NAMESPACE_ID",
"ingestConfig": {
"source": "WEBSITE",
"config": {
"url": "https://example.com",
"maxDepth": 3,
"maxLinks": 100,
"includePaths": ["/docs", "/blog"],
"excludePaths": ["/admin"],
"scrapeOptions": {
"includeSelectors": ["article", "main"],
"excludeSelectors": [".navigation", ".footer"]
}
}
}
}'
站点地图处理
处理站点地图中的所有URL:
curl -X POST https://api.sourcesync.ai/v1/ingest/sitemap \
-H "Authorization: Bearer YOUR_SOURCE_SYNC_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"namespaceId": "YOUR_NAMESPACE_ID",
"ingestConfig": {
"source": "SITEMAP",
"config": {
"url": "https://example.com/sitemap.xml",
"scrapeOptions": {
"includeSelectors": ["article", "main"],
"excludeSelectors": [".navigation", ".footer"]
}
}
}
}'
使用Firecrawl与SourceSync.ai时,您可以访问以下功能:
- JavaScript渲染支持
- 自动速率限制
- CSS选择器内容提取
- 深度控制的递归爬取
- 站点地图处理
如需额外支持: