chrome-automation

Automate Chrome browser tasks using agent-browser CLI. Navigate pages, fill forms, click buttons, take screenshots, extract data, and replay recorded workflows — all inside the user's real Chrome session.

Skill file

Preview skill file
---
name: chrome-automation
description: Automate Chrome browser tasks using agent-browser CLI. Navigate pages, fill forms, click buttons, take screenshots, extract data, and replay recorded workflows — all inside the user's real Chrome session.
---

# Skill: Chrome Automation (agent-browser)

Automate browser tasks in the user's real Chrome session via the [agent-browser](https://github.com/vercel-labs/agent-browser) CLI.

> **Prerequisite**: agent-browser must be installed and Chrome must have remote debugging enabled. See `references/agent-browser-setup.md` if unsure.

---

## Core Principle: Reuse the User's Existing Chrome

This skill operates on a **single Chrome process** — the user's real browser. There is no session management, no separate profiles, no launching a fresh Playwright browser.

### Always Start by Listing Tabs

Before opening any new page, **always list existing tabs first**:

```bash
agent-browser --auto-connect tab list
```

This returns all open tabs with their index numbers, titles, and URLs. Check if the page you need is already open:

- **If the target page is already open** → switch to that tab directly instead of opening a new one. The user likely has it open because they are already logged in and the page is in the right state.
  ```bash
  agent-browser --auto-connect tab <index>
  ```
- **If the target page is NOT open** → open it in the current tab or a new tab.
  ```bash
  agent-browser --auto-connect open <url>
  ```

### Why This Matters

- The user's Chrome has their cookies, login sessions, and browser state
- Opening a new page when one is already available wastes time and may lose login state
- Many marketing platforms (social media dashboards, ad managers, CMS tools) require login — reusing an existing logged-in tab avoids re-authentication

---

## Connection

Always use `--auto-connect` to connect to the user's running Chrome instance:

```bash
agent-browser --auto-connect <command>
```

This auto-discovers Chrome with remote debugging enabled. If connection fails, guide the user through enabling remote debugging (see `references/agent-browser-setup.md`).

---

## Common Workflows

### 1. Navigate and Interact

```bash
# List tabs to find existing pages
agent-browser --auto-connect tab list

# Switch to an existing tab (if found)
agent-browser --auto-connect tab <index>

# Or open a new page
agent-browser --auto-connect open https://example.com
agent-browser --auto-connect wait --load networkidle

# Take a snapshot to see interactive elements
agent-browser --auto-connect snapshot -i

# Click, fill, etc.
agent-browser --auto-connect click @e3
agent-browser --auto-connect fill @e5 "some text"
```

### 2. Extract Data from a Page

```bash
# Get all text content
agent-browser --auto-connect get text body

# Take a screenshot for visual inspection
agent-browser --auto-connect screenshot

# Execute JavaScript for structured data
agent-browser --auto-connect eval "JSON.stringify(document.querySelectorAll('table tr').length)"
```

### 3. Replay a Chrome DevTools Recording

The user may provide a recording exported from Chrome DevTools Recorder (JSON, Puppeteer JS, or @puppeteer/replay JS format). See [Replaying Recordings](#replaying-recordings) below.

---

## Step-by-Step Interaction Guide

### Taking Snapshots

Use `snapshot -i` to see all interactive elements with refs (`@e1`, `@e2`, ...):

```bash
agent-browser --auto-connect snapshot -i
```

The output lists each interactive element with its role, text, and ref. Use these refs for subsequent actions.

### Step Type Mapping

| Action | Command |
|--------|---------|
| Navigate | `agent-browser --auto-connect open <url>` (optionally `wait --load networkidle`, but some sites like Reddit never reach networkidle — skip if `open` already shows the page title) |
| Click | `snapshot -i` → find ref → `click @eN` |
| Fill standard input | `click @eN` → `fill @eN "text"` |
| Fill rich text editor | `click @eN` → `keyboard inserttext "text"` |
| Press key | `press <key>` (Enter, Tab, Escape, etc.) |
| Scroll | `scroll down <amount>` or `scroll up <amount>` |
| Wait for element | `wait @eN` or `wait "<css-selector>"` |
| Screenshot | `screenshot` or `screenshot --annotate` |
| Get page text | `get text body` |
| Get current URL | `get url` |
| Run JavaScript | `eval <js>` |

### How to Distinguish Input Types

- **Standard input/textarea** → use `fill`
- **Contenteditable div / rich text editor** (LinkedIn message box, Gmail compose, Slack, CMS editors) → click/focus first, then use `keyboard inserttext`

### Ref Lifecycle

Refs (`@e1`, `@e2`, ...) are **invalidated when the page changes**. Always re-snapshot after:
- Clicking links or buttons that trigger navigation
- Submitting forms
- Triggering dynamic content loads (AJAX, SPA navigation)

### Verification

After each significant action, verify the result:
```bash
agent-browser --auto-connect snapshot -i   # check interactive state
agent-browser --auto-connect screenshot     # visual verification
```

---

## Replaying Recordings

### Accepted Formats

1. **JSON** (recommended) — structured, can be read progressively:
   ```bash
   # Count steps
   jq '.steps | length' recording.json

   # Read first 5 steps
   jq '.steps[0:5]' recording.json
   ```

2. **@puppeteer/replay JS** (`import { createRunner }`)
3. **Puppeteer JS** (`require('puppeteer')`, `page.goto`, `Locator.race`)

### How to Replay

1. **Parse the recording** — understand the full intent before acting. Summarize what the recording does.
2. **List tabs first** — check if the target page is already open.
3. **Navigate** — execute `navigate` steps, reusing existing tabs when possible.
4. **For each interaction step**:
   - Take a snapshot (`snapshot -i`) to see current interactive elements
   - Match the recording's `aria/...` selectors against the snapshot
   - Fall back to `text/...`, then CSS class hints, then screenshot
   - **Do not rely on ember IDs, numeric IDs, or exact XPaths** — these change every page load
5. **Verify after each step** — snapshot or screenshot to confirm

---

## Iframe-Heavy Sites

`snapshot -i` operates on the main frame only and **cannot penetrate iframes**. Sites like LinkedIn, Gmail, and embedded editors render content inside iframes.

### Detecting Iframe Issues

- `snapshot -i` returns unexpectedly short or empty results
- Recording references elements not appearing in snapshot output
- `get text body` content doesn't match what a screenshot shows

### Workarounds

1. **Use `eval` to access iframe content**:
   ```bash
   agent-browser --auto-connect eval --stdin <<'EVALEOF'
   const frame = document.querySelector('iframe[data-testid="interop-iframe"]');
   const doc = frame.contentDocument;
   const btn = doc.querySelector('button[aria-label="Send"]');
   btn.click();
   EVALEOF
   ```
   Note: Only works for same-origin iframes.

2. **Use `keyboard` for blind input**: If the iframe element has focus, `keyboard inserttext "..."` sends text regardless of frame boundaries.

3. **Use `get text body`** to read full page content including iframes.

4. **Use `screenshot`** for visual verification when snapshot is unreliable.

### When to Ask the User

If workarounds fail after 2 attempts on the same step, pause and explain:
- The page uses iframes that cannot be accessed via snapshot
- Which element you need and what you expected
- Ask the user to perform that step manually, then continue

---

## Handling Unexpected Situations

### Handle Automatically (do not stop):
- Popups or banners → dismiss them (`find text "Dismiss" click` or `find text "Close" click`)
- Cookie consent dialogs → accept or dismiss
- Tooltip overlays → close them first
- Element not in snapshot → try `find text "..." click`, or scroll to reveal with `scroll down 300`

### Pause and Ask the User:
- Login / authentication is required
- A CAPTCHA appears
- Page structure is completely different from expected
- A destructive action is about to happen (deleting data, sending real content) — confirm first
- Stuck for more than 2 attempts on the same step
- All iframe workarounds have failed

When pausing, explain clearly: what step you are on, what you expected, and what you see.

---

## Key Commands Reference

| Command | Description |
|---------|-------------|
| `tab list` | List all open tabs with index, title, and URL |
| `tab <index>` | Switch to an existing tab by index |
| `tab new` | Open a new empty tab |
| `tab close` | Close the current tab |
| `open <url>` | Navigate to URL |
| `snapshot -i` | List interactive elements with refs |
| `click @eN` | Click element by ref |
| `fill @eN "text"` | Clear and fill standard input/textarea |
| `type @eN "text"` | Type without clearing |
| `keyboard inserttext "text"` | Insert text (best for contenteditable) |
| `press <key>` | Press keyboard key |
| `scroll down/up <amount>` | Scroll page in pixels |
| `wait @eN` | Wait for element to appear |
| `wait --load networkidle` | Wait for network to settle |
| `wait <ms>` | Wait for a duration |
| `screenshot [path]` | Take screenshot |
| `screenshot --annotate` | Screenshot with numbered labels |
| `eval <js>` | Execute JavaScript in page |
| `get text body` | Get all text content |
| `get url` | Get current URL |
| `set viewport <w> <h>` | Set viewport size |
| `find text "..." click` | Semantic find and click |
| `close` | Close browser session |

---

## Known Limitations

1. **Iframe blindness**: `snapshot -i` cannot see inside iframes. See [Iframe-Heavy Sites](#iframe-heavy-sites).
2. **`find text` strict mode**: Fails when multiple elements match. Use `snapshot -i` to locate the specific ref instead.
3. **`fill` vs contenteditable**: `fill` only works on `<input>` and `<textarea>`. For rich text editors, use `keyboard inserttext`.
4. **`eval` is main-frame only**: To interact with iframe content, traverse via `document.querySelector('iframe').contentDocument...`

---

## Multi-Platform Operations

When the user requests an action across **multiple platforms** (e.g., "publish this article to Dev.to, LinkedIn, and X"), do NOT attempt all platforms in a single conversation. Instead, launch **sequential subagents/workers**, one per platform. Selecting this skill is explicit permission to use subagents/workers for this multi-platform workflow.

### Why Subagents

Each platform operation consumes ~25-40K tokens (reference file + snapshots + interactions). Running 3-5 platforms in one context risks hitting the context limit and degrading late-platform accuracy. Each subagent/worker gets its own fresh context window.

### How to Execute

1. **Prepare the content** — confirm the post text, title, tags, and any platform-specific adaptations with the user.
2. **For each platform**, launch a subagent/worker with a prompt that includes:
   - The full content to publish
   - Instructions to read the relevant reference file (e.g., `Read /path/to/skills/chrome-automation/references/x.md`)
   - Instructions to read the agent-browser skill file for command reference
   - The specific task (post, comment, reply, etc.)
   - Any platform-specific instructions (e.g., "use these hashtags on LinkedIn")
3. **Run subagents/workers sequentially** (one at a time), because they all share the same Chrome browser via `--auto-connect`. Parallel subagents/workers would cause tab conflicts.
4. **After each subagent/worker completes**, report the result to the user before launching the next one.

### Prompt Template for Subagents

```
You are automating a browser task on [PLATFORM].

First, read these files for context:
- /absolute/path/to/skills/chrome-automation/references/[platform].md
- The installed agent-browser skill file, if available (agent-browser command reference)

Then connect to the user's Chrome browser using `agent-browser --auto-connect` and perform the following task:

[TASK DESCRIPTION]

Content to publish:
[CONTENT]

Important:
- Always list tabs first (`tab list`) and reuse existing logged-in tabs
- Re-snapshot after every navigation or action
- Confirm with the user before submitting/publishing (destructive action)
- If login is required or a CAPTCHA appears, stop and explain
```

### When NOT to Use Subagents

- **Single platform** — just do it directly in the current conversation.
- **Read-only tasks** (browsing, searching, extracting data) — context usage is lighter; a single conversation can handle 2-3 platforms.

---

## Platform References

When automating tasks on specific platforms, consult the relevant reference document for page structure details, common operations, and known quirks:

| Platform | Reference | Key Notes |
|----------|-----------|-----------|
| Reddit | [`references/reddit.md`](./references/reddit.md) | Custom `faceplate-*` components; `networkidle` never reached; unlabeled comment textbox; `find text` fails due to duplicate elements |
| X (Twitter) | [`references/x.md`](./references/x.md) | `open` often times out (use `tab list` to reuse existing tabs); click **timestamp** for post detail (not username); DraftJS contenteditable input (`data-testid="tweetTextarea_0"`); avoid `networkidle` |
| LinkedIn | [`references/linkedin.md`](./references/linkedin.md) | Ember.js SPA; Enter submits comments (use Shift+Enter for newlines); comment box and compose box share the same label; avoid `networkidle`; messaging overlay may block content |
| Dev.to | [`references/devto.md`](./references/devto.md) | Fast server-rendered HTML (Forem/Rails); standard `<textarea>` for comments/posts (Markdown); 5 reaction types; Algolia-powered search; `networkidle` works normally |
| Hacker News | [`references/hackernews.md`](./references/hackernews.md) | Minimal plain HTML; all form fields are unlabeled; `link "reply"` navigates to separate page; `networkidle` works instantly; rate limiting on posts/comments |

---

> For installation and Chrome setup instructions, see [`references/agent-browser-setup.md`](./references/agent-browser-setup.md).

Source

Creator's repository · zc277584121/marketing-skills

View on GitHub

Security

Security checks in progress
Results will appear here once audits complete
What this skill can do
Reads your filesConnects to the internetRuns code on your machine
Checked by 3 independent security firms
Does it try to trick the AI?Not yet checkedPending · Gen Agent Trust Hub
Does it sneak in hidden code?Not yet checkedPending · Socket
Does it have known bugs?Not yet checkedPending · Snyk