> ## Documentation Index
> Fetch the complete documentation index at: https://stagehand-stg-1784.mintlify.site/llms.txt
> Use this file to discover all available pages before exploring further.

# Extract

> Extract structured data from a webpage

## What is `extract()`?

```typescript theme={null}
page.extract("extract the name of the repository");
```

`extract` grabs structured data from a webpage. You can define your schema with [zod](https://github.com/colinhacks/zod) (TypeScript) or [pydantic](https://github.com/pydantic/pydantic) (Python). If you do not want to define a schema, you can also call `extract` with just a [natural language prompt](#prompt-only-extraction), or call `extract` [with no parameters](#extract-with-no-parameters).

## Why use `extract()`?

<CardGroup cols={2}>
  <Card title="Structured" icon="brackets-curly" href="#list-of-objects-extraction">
    Turn messy webpage data into clean objects that follow a schema.
  </Card>

  <Card title="Resilient" icon="dumbbell" href="#extract-with-context">
    Build resilient extractions that don't break when the website changes
  </Card>
</CardGroup>

<Note>
  For TypeScript, the extract schemas are defined using zod schemas.

  For Python, the extract schemas are defined using pydantic models.
</Note>

## Using `extract()`

### Single object Extraction

Here is how an `extract` call might look for a single object:

<CodeGroup>
  ```typescript TypeScript theme={null}
  import { z } from 'zod/v3';

  const item = await page.extract({
    instruction: "extract the price of the item",
    schema: z.object({
      price: z.number(),
    }),
  });
  ```

  ```python Python theme={null}
  from pydantic import BaseModel

  class Extraction(BaseModel):
      price: float

  item = await page.extract(
      "extract the price of the item", 
      schema=Extraction
  )
  ```
</CodeGroup>

Your output schema will look like:

```Example theme={null}
{ price: number }
```

### List of objects Extraction

Here is how an `extract` call might look for a list of objects.

<CodeGroup>
  ```typescript TypeScript theme={null}
  import { z } from 'zod/v3';

  const apartments = await page.extract({
    instruction:
      "Extract ALL the apartment listings and their details, including address, price, and square feet.",
    schema: z.object({
      list_of_apartments: z.array(
        z.object({
          address: z.string(),
          price: z.string(),
          square_feet: z.string(),
        }),
      ),
    })
  })

  console.log("the apartment list is: ", apartments);
  ```

  ```python Python theme={null}
  from pydantic import BaseModel

  class Apartment(BaseModel):
      address: str
      price: str
      square_feet: str

  class Apartments(BaseModel):
      list_of_apartments: list[Apartment]

  apartments = await page.extract(
      "Extract ALL the apartment listings and their details as a list, including address, price, and square feet for each apartment",
      schema=Apartments
  )

  print("the apartment list is: ", apartments)
  ```
</CodeGroup>

Your output schema will look like:

```Example theme={null}
list_of_apartments: [
    {
      address: "street address here",
      price: "$1234.00",
      square_feet: "700"
    },
    {
        address: "another address here",
        price: "1010.00",
        square_feet: "500"
    },
    ...
]
```

### Prompt-only Extraction

You can call `extract` with just a natural language prompt:

<CodeGroup>
  ```typescript TypeScript theme={null}
  const result = await page.extract("extract the name of the repository");
  ```

  ```python Python theme={null}
  result = await page.extract("extract the name of the repository")
  ```
</CodeGroup>

When you call `extract` with just a prompt, your output schema will look like:

```Example theme={null}
{ extraction: string }
```

### Extract with no parameters

Here is how you can call `extract` with no parameters.

<CodeGroup>
  ```typescript TypeScript theme={null}
  const pageText = await page.extract();
  ```

  ```python Python theme={null}
  pageText = await page.extract()
  ```
</CodeGroup>

Output schema:

```Example theme={null}
{ pageText: string }
```

Calling `extract` with no parameters will return hierarchical tree representation of the root DOM. This will not be passed through an LLM. It will look something like this:

```
Accessibility Tree:
[0-2] RootWebArea: What is Stagehand? - 🤘 Stagehand
  [0-37] scrollable
    [0-118] body
      [0-241] scrollable
        [0-242] div
          [0-244] link: 🤘 Stagehand home page light logo
            [0-245] span
              [0-246] StaticText: 🤘 Stagehand
              [0-247] StaticText: home page
```

## Best practices

### Extract with Context

You can provide additional context to your schema to help the model extract the data more accurately.

<CodeGroup>
  ```typescript TypeScript theme={null}
  import { z } from 'zod/v3';

  const apartments = await page.extract({
   instruction:
     "Extract ALL the apartment listings and their details, including address, price, and square feet.",
   schema: z.object({
     list_of_apartments: z.array(
       z.object({
         address: z.string().describe("the address of the apartment"),
         price: z.string().describe("the price of the apartment"),
         square_feet: z.string().describe("the square footage of the apartment"),
       }),
     ),
   })
  })
  ```

  ```python Python theme={null}
  from pydantic import BaseModel, Field

  class Apartment(BaseModel):
      address: str = Field(..., description="the address of the apartment")
      price: str = Field(..., description="the price of the apartment")
      square_feet: str = Field(..., description="the square footage of the apartment")

  class Apartments(BaseModel):
      list_of_apartments: list[Apartment]

  apartments = await page.extract(
      "Extract ALL the apartment listings and their details as a list. For each apartment, include: the address of the apartment, the price of the apartment, and the square footage of the apartment",
      schema=Apartments
  )
  ```
</CodeGroup>

### Link Extraction

<Note>
  To extract links or URLs, in the TypeScript version of Stagehand, you'll need to define the relevant field as `z.string().url()`.
  In Python, you'll need to define it as `HttpUrl`.
</Note>

Here is how an `extract` call might look for extracting a link or URL. This also works for image links.

<CodeGroup>
  ```typescript TypeScript theme={null}
  import { z } from 'zod/v3';

  const extraction = await page.extract({
    instruction: "extract the link to the 'contact us' page",
    schema: z.object({
      link: z.string().url(), // note the usage of z.string().url() here
    }),
  });

  console.log("the link to the contact us page is: ", extraction.link);
  ```

  ```python Python theme={null}
  from pydantic import BaseModel, HttpUrl

  class Extraction(BaseModel):
      link: HttpUrl # note the usage of HttpUrl here

  extraction = await page.extract(
      "extract the link to the 'contact us' page", 
      schema=Extraction
  )

  print("the link to the contact us page is: ", extraction.link)
  ```
</CodeGroup>

<Tip>
  Inside Stagehand, extracting links works by asking the LLM to select an ID. Stagehand looks up that ID in a mapping of IDs -> URLs. When logging the LLM trace, you should expect to see IDs. The actual URLs will be included in the final `ExtractResult`.
</Tip>

## Troubleshooting

<AccordionGroup>
  <Accordion title="Empty or partial results">
    **Problem**: `extract()` returns empty or incomplete data

    **Solutions**:

    * **Check your instruction clarity**: Make sure your instruction is specific and describes exactly what data you want to extract
    * **Verify the data exists**: Use `page.observe()` first to confirm the data is present on the page
    * **Wait for dynamic content**: If the page loads content dynamically, use `page.act("wait for the content to load")` before extracting

    **Solution: Wait for content before extracting**

    <CodeGroup>
      ```typescript TypeScript theme={null}
      // Wait for content before extracting
      await page.act("wait for the product listings to load");
      const products = await page.extract({
        instruction: "extract all product names and prices",
        schema: z.object({
          products: z.array(z.object({
            name: z.string(),
            price: z.string()
          }))
        })
      });
      ```

      ```python Python theme={null}
      # Wait for content before extracting
      await page.act("wait for the product listings to load")
      products = await page.extract(
          "extract all product names and prices",
          schema=ProductList
      )
      ```
    </CodeGroup>
  </Accordion>

  <Accordion title="Schema validation errors">
    **Problem**: Getting schema validation errors or type mismatches

    **Solutions**:

    * **Use optional fields**: Make fields optional with `z.optional()` (TypeScript) or `Optional[type]` (Python) if the data might not always be present
    * **Use flexible types**: Consider using `z.string()` instead of `z.number()` for prices that might include currency symbols
    * **Add descriptions**: Use `.describe()` (TypeScript) or `Field(description="...")` (Python) to help the model understand field requirements

    **Solution: More flexible schema**

    <CodeGroup>
      ```typescript TypeScript theme={null}
      const schema = z.object({
        price: z.string().describe("price including currency symbol, e.g., '$19.99'"),
        availability: z.string().optional().describe("stock status if available"),
        rating: z.number().optional()
      });
      ```

      ```python Python theme={null}
      class FlexibleProduct(BaseModel):
          price: str = Field(description="price including currency symbol, e.g., '$19.99'")
          availability: Optional[str] = Field(default=None, description="stock status if available")
          rating: Optional[float] = None
      ```
    </CodeGroup>
  </Accordion>

  <Accordion title="Inconsistent results">
    **Problem**: Extraction results vary between runs

    **Solutions**:

    * **Be more specific in instructions**: Instead of "extract prices", use "extract the numerical price value for each item"
    * **Use context in schema descriptions**: Add field descriptions to guide the model
    * **Combine with observe**: Use `page.observe()` to understand the page structure first

    **Solution: Validate with observe first**

    <CodeGroup>
      ```typescript TypeScript theme={null}
      // First observe to understand the page structure
      const elements = await page.observe("find all product listings");
      console.log("Found elements:", elements.map(e => e.description));

      // Then extract with specific targeting
      const products = await page.extract({
        instruction: "extract name and price from each product listing shown on the page",
        schema: z.object({
          products: z.array(z.object({
            name: z.string().describe("the product title or name"),
            price: z.string().describe("the price as displayed, including currency")
          }))
        })
      });
      ```

      ```python Python theme={null}
      # First observe to understand the page structure
      elements = await page.observe("find all product listings")
      print("Found elements:", [e.description for e in elements])

      # Then extract with specific targeting
      products = await page.extract(
          "extract name and price from each product listing shown on the page",
          schema=ProductSchema
      )
      ```
    </CodeGroup>
  </Accordion>

  <Accordion title="Performance issues">
    **Problem**: Extraction is slow or timing out

    **Solutions**:

    * **Reduce scope**: Extract smaller chunks of data in multiple calls rather than everything at once
    * **Use targeted instructions**: Be specific about which part of the page to focus on
    * **Consider pagination**: For large datasets, extract one page at a time
    * **Increase timeout**: Use `timeoutMs` parameter for complex extractions

    **Solution: Break down large extractions**

    <CodeGroup>
      ```typescript TypeScript theme={null}
      // Instead of extracting everything at once
      const allData = [];
      const pageNumbers = [1, 2, 3, 4, 5];

      for (const pageNum of pageNumbers) {
        await page.act(`navigate to page ${pageNum}`);
        
        const pageData = await page.extract({
          instruction: "extract product data from the current page only",
          schema: ProductPageSchema,
          timeoutMs: 60000 // 60 second timeout
        });
        
        allData.push(...pageData.products);
      }
      ```

      ```python Python theme={null}
      # Instead of extracting everything at once
      all_data = []
      page_numbers = [1, 2, 3, 4, 5]

      for page_num in page_numbers:
          await page.act(f"navigate to page {page_num}")
          
          page_data = await page.extract(
              "extract product data from the current page only",
              schema=ProductPageSchema,
              timeout_ms=60000  # 60 second timeout
          )
          
          all_data.extend(page_data.products)
      ```
    </CodeGroup>
  </Accordion>
</AccordionGroup>

## Next steps

<CardGroup cols={2}>
  <Card title="Act" icon="play" href="/v2/basics/act">
    Execute actions efficiently using observe results
  </Card>

  <Card title="Observe" icon="magnifying-glass" href="/v2/basics/observe">
    Analyze pages with observe()
  </Card>
</CardGroup>