提取 | Firecrawl

使用Firecrawl抓取和提取结构化数据

Firecrawl利用大型语言模型（LLM）高效地从网页中提取结构化数据。以下是操作步骤：

模式定义： 使用JSON Schema（遵循OpenAI工具模式）定义要抓取的URL和所需的数据模式。该模式指定了您期望从页面中提取的数据结构。
抓取端点： 将URL和模式传递给抓取端点。该端点的文档可以在这里找到：抓取端点文档
结构化数据检索： 以您定义的模式接收抓取到的数据。然后，您可以根据需要在应用程序中使用这些数据或进行进一步处理。

这种方法简化了数据提取过程，减少了手动操作并提高了效率。

提取结构化数据

/scrape（带提取）端点

用于从抓取的页面中提取结构化数据。

from firecrawl import FirecrawlApp
from pydantic import BaseModel, Field

# 使用你的API密钥初始化FirecrawlApp
app = FirecrawlApp(api_key='your_api_key')

class ExtractSchema(BaseModel):
    company_mission: str
    supports_sso: bool
    is_open_source: bool
    is_in_yc: bool

data = app.scrape_url('https://docs.firecrawl.dev/', {
    'formats': ['json'],
    'jsonOptions': {
        'schema': ExtractSchema.model_json_schema(),
    }
})
print(data["json"])

输出：

JSON

{
    "success": true,
    "data": {
      "json": {
        "company_mission": "训练一个安全的人工智能，利用您的技术资源回答客户和员工的问题，这样您的团队就不必这样做了",
        "supports_sso": true,
        "is_open_source": false,
        "is_in_yc": true
      },
      "metadata": {
        "title": "Mendable",
        "description": "Mendable让您轻松构建AI聊天应用。摄取、定制，然后只需一行代码即可在您想要的任何地方部署。由SideGuide提供支持",
        "robots": "follow, index",
        "ogTitle": "Mendable",
        "ogDescription": "Mendable让您轻松构建AI聊天应用。摄取、定制，然后只需一行代码即可在您想要的任何地方部署。由SideGuide提供支持",
        "ogUrl": "https://docs.firecrawl.dev/",
        "ogImage": "https://docs.firecrawl.dev/mendable_new_og1.png",
        "ogLocaleAlternate": [],
        "ogSiteName": "Mendable",
        "sourceURL": "https://docs.firecrawl.dev/"
      },
    }
}

无模式提取（新功能）

现在可以通过仅传递一个prompt给端点来进行无模式提取。LLM会选择数据的结构。

curl -X POST https://api.firecrawl.dev/v1/scrape \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer YOUR_API_KEY' \
    -d '{
      "url": "https://docs.firecrawl.dev/",
      "formats": ["json"],
      "jsonOptions": {
        "prompt": "从页面中提取公司使命。"
      }
    }'

输出：

JSON

{
    "success": true,
    "data": {
      "json": {
        "company_mission": "训练一个安全的人工智能，使用您的技术资源回答客户和员工的问题，以便您的团队不必这样做",
      },
      "metadata": {
        "title": "Mendable",
        "description": "Mendable 允许您轻松构建 AI 聊天应用程序。摄取、自定义，然后只需一行代码即可在您想要的任何位置部署。由 SideGuide 提供支持",
        "robots": "follow, index",
        "ogTitle": "Mendable",
        "ogDescription": "Mendable 允许您轻松构建 AI 聊天应用程序。摄取、自定义，然后只需一行代码即可在您想要的任何位置部署。由 SideGuide 提供支持",
        "ogUrl": "https://docs.firecrawl.dev/",
        "ogImage": "https://docs.firecrawl.dev/mendable_new_og1.png",
        "ogLocaleAlternate": [],
        "ogSiteName": "Mendable",
        "sourceURL": "https://docs.firecrawl.dev/"
      },
    }
}

提取对象

extract对象接受以下参数：

schema: 用于提取的模式。
systemPrompt: 用于提取的系统提示。
prompt: 用于无模式提取的提示。

开始使用

功能

Alpha 功能

集成

贡献

LLM 提取

使用Firecrawl抓取和提取结构化数据

提取结构化数据

/scrape（带提取）端点

无模式提取（新功能）

提取对象

开始使用

功能

Alpha 功能

集成

贡献

​使用Firecrawl抓取和提取结构化数据

​提取结构化数据

​/scrape（带提取）端点

​无模式提取（新功能）

​提取对象

使用Firecrawl抓取和提取结构化数据

提取结构化数据

/scrape（带提取）端点

无模式提取（新功能）

提取对象