content-parser

Skill file

Preview skill file↓↑
---
name: content-parser
description: |
  Extract and parse content from URLs. Triggers on: user provides a URL to extract
  content from, another skill needs to parse source material, "parse this URL",
  "extract content", "解析链接", "提取内容".
metadata:
  openclaw:
    emoji: "🔗"
    requires:
      env: ["LISTENHUB_API_KEY"]
    primaryEnv: "LISTENHUB_API_KEY"
---

## When to Use

- User provides a URL and wants to extract/read its content
- Another skill needs to parse source material from a URL before generation
- User says "parse this URL", "extract content from this link"
- User says "解析链接", "提取内容"

## When NOT to Use

- User already has text content and doesn't need URL parsing
- User wants to generate audio/video content (not content extraction)
- User wants to read a local file (use standard file reading tools)

## Purpose

Extract and normalize content from URLs across supported platforms. Returns structured data including content body, metadata, and references. Useful as a preprocessing step for content generation skills or standalone content extraction.

## Hard Constraints

- No shell scripts. Construct curl commands from the API Reference (Inlined) section below
- See § API Reference (Inlined) below for API key and headers
- See § API Reference (Inlined) below for polling, errors, and interaction patterns
- URL must be a valid HTTP(S) URL
- Always read config following `shared/config-pattern.md` before any interaction
- Never save files to `~/Downloads/` or `.listenhub/` — save to the current working directory

<HARD-GATE>
Use the AskUserQuestion tool for every multiple-choice step — do NOT print options as plain text. Ask one question at a time. Wait for the user's answer before proceeding to the next step. After collecting URL and options, confirm with the user before calling the extraction API.
</HARD-GATE>

## Step -1: API Key Check

Follow `shared/config-pattern.md` § API Key Check. If the key is missing, stop immediately.

## Step 0: Config Setup

Follow `shared/config-pattern.md` Step 0 (Zero-Question Boot).

**If file doesn't exist** — silently create with defaults and proceed:
```bash
mkdir -p ".listenhub/content-parser"
echo '{"autoDownload":true}' > ".listenhub/content-parser/config.json"
CONFIG_PATH=".listenhub/content-parser/config.json"
CONFIG=$(cat "$CONFIG_PATH")
```
**Do NOT ask any setup questions.** Proceed directly to the Interaction Flow.

**If file exists** — read config silently and proceed:
```bash
CONFIG_PATH=".listenhub/content-parser/config.json"
[ ! -f "$CONFIG_PATH" ] && CONFIG_PATH="$HOME/.listenhub/content-parser/config.json"
CONFIG=$(cat "$CONFIG_PATH")
```

### Setup Flow (user-initiated reconfigure only)

Only run when the user explicitly asks to reconfigure. Display current settings:
```
当前配置 (content-parser)：
  自动下载：{是 / 否}
```

Then ask:

1. **autoDownload**: "自动保存提取的内容到当前目录？"
   - "是（推荐）" → `autoDownload: true`
   - "否" → `autoDownload: false`

Save immediately:
```bash
NEW_CONFIG=$(echo "$CONFIG" | jq --argjson dl {true/false} '. + {"autoDownload": $dl}')
echo "$NEW_CONFIG" > "$CONFIG_PATH"
CONFIG=$(cat "$CONFIG_PATH")
```

## Interaction Flow

### Step 1: URL Input

Free text input. Ask the user:

> What URL would you like to extract content from?

### Step 2: Options (optional)

Ask if the user wants to configure extraction options:

```
Question: "Do you want to configure extraction options?"
Options:
  - "No, use defaults" — Extract with default settings
  - "Yes, configure options" — Set summarize, maxLength, or Twitter tweet count
```

If "Yes", ask follow-up questions:
- **Summarize**: "Generate a summary of the content?" (Yes/No)
- **Max Length**: "Set maximum content length?" (Free text, e.g., "5000")
- **Twitter count** (only if URL is Twitter/X profile): "How many tweets to fetch?" (1-100, default 20)

### Step 3: Confirm & Extract

Summarize:

```
Ready to extract content:

  URL: {url}
  Options: {summarize: true, maxLength: 5000, twitter.count: 50} / default

  Proceed?
```

Wait for explicit confirmation before calling the API.

## Workflow

1. **Validate URL**: Must be HTTP(S). Normalize if needed (see `references/supported-platforms.md`)
2. **Build request body**:
   ```json
   {
     "source": {
       "type": "url",
       "uri": "{url}"
     },
     "options": {
       "summarize": true/false,
       "maxLength": 5000,
       "twitter": {
         "count": 50
       }
     }
   }
   ```
   Omit `options` if user chose defaults.
3. **Submit (foreground)**: `POST /v1/content/extract` → extract `taskId`
4. Tell the user extraction is in progress
5. **Poll (background)**: Run the following **exact** bash command with `run_in_background: true` and `timeout: 300000`. Note: status field is `.data.status` (not `processStatus`), interval is 5s, values are `processing`/`completed`/`failed`:

   ```bash
   TASK_ID="<id-from-step-3>"
   for i in $(seq 1 60); do
     RESULT=$(curl -sS "https://api.marswave.ai/openapi/v1/content/extract/$TASK_ID" \
       -H "Authorization: Bearer $LISTENHUB_API_KEY" \
       -H "X-Source: skills" 2>/dev/null)
     STATUS=$(echo "$RESULT" | tr -d '\000-\037\177' | jq -r '.data.status // "processing"')
     case "$STATUS" in
       completed) echo "$RESULT"; exit 0 ;;
       failed) echo "FAILED: $RESULT" >&2; exit 1 ;;
       *) sleep 5 ;;
     esac
   done
   echo "TIMEOUT" >&2; exit 2
   ```
6. When notified, **download and present result**:

   If `autoDownload` is `true`, generate a slug from the extracted title (falling back to domain name if no title). Follow `shared/config-pattern.md` § Artifact Naming for slug generation and dedup.

   - Write `{slug}.md` to the **current directory** — full extracted content in markdown
   - Write `{slug}.json` to the **current directory** — full raw API response data

   ```bash
   SLUG="{title-slug}"  # e.g. "topology-wikipedia"
   # Dedup: check if files exist
   BASE="$SLUG"; i=2
   while [ -e "${SLUG}.md" ] || [ -e "${SLUG}.json" ]; do SLUG="${BASE}-${i}"; i=$((i+1)); done
   echo "$CONTENT_MD" > "${SLUG}.md"
   echo "$RESULT" > "${SLUG}.json"
   ```

   Present:
   ```
   内容提取完成！

   来源：{url}
   标题：{metadata.title}
   长度：~{character count} 字符
   消耗积分：{credits}

   已保存到当前目录：
     {slug}.md
     {slug}.json
   ```

7. Show a preview of the extracted content (first ~500 chars)
8. Offer to use content in another skill (e.g. `/podcast`, `/tts`)

**Estimated time**: 10-30 seconds depending on content size and platform.

## API Reference (Inlined)

### Authentication

**Environment variable**: `LISTENHUB_API_KEY` (format: `lh_sk_...`)

Store in `~/.zshrc` (macOS) or `~/.bashrc` (Linux):

```bash
export LISTENHUB_API_KEY="lh_sk_..."
```

**How to obtain**: Visit https://listenhub.ai/settings/api-keys (Pro plan required).

**Base URL**: `https://api.marswave.ai/openapi/v1`

**Required headers** (every request):

```
Authorization: Bearer $LISTENHUB_API_KEY
Content-Type: application/json
X-Source: skills
```

The `X-Source: skills` header identifies requests as coming from Claude Code skills (CLI tool).

**curl template:**

```bash
curl -sS -X POST "https://api.marswave.ai/openapi/v1/{endpoint}" \
  -H "Authorization: Bearer $LISTENHUB_API_KEY" \
  -H "Content-Type: application/json" \
  -H "X-Source: skills" \
  -d '{ ... }'
```

For GET requests, omit `-d` and change `-X POST` to `-X GET`.

**Security notes:**
- Never log or display full API keys in output
- API keys are transmitted via HTTPS only
- Do not pass sensitive or confidential information as content input — it is sent to external APIs for processing

---

### POST /v1/content/extract

Create a content extraction task for a URL. Returns a `taskId` for polling.

**Request body:**

| Field | Required | Type | Description |
|-------|----------|------|-------------|
| source | **Yes** | object | Source to extract from |
| source.type | **Yes** | string | Must be `"url"` |
| source.uri | **Yes** | string | Valid HTTP(S) URL to extract content from |
| options | No | object | Extraction options |
| options.summarize | No | boolean | Whether to generate a summary |
| options.maxLength | No | integer | Maximum content length |
| options.twitter | No | object | Twitter/X specific options |
| options.twitter.count | No | integer | Number of tweets to fetch (1-100, default 20) |

**Response:**

```json
{
  "code": 0,
  "message": "success",
  "data": {
    "taskId": "69a7dac700cf95938f86d9bb"
  }
}
```

**Error codes:**

| Code | Meaning |
|------|---------|
| 29003 | Validation error (`"source.uri" is required`, `"source.uri" must be a valid uri`) |
| 21007 | Invalid API key |

---

### GET /v1/content/extract/{taskId}

Get extraction task status and results.

**Path params:**

| Param | Type | Description |
|-------|------|-------------|
| taskId | string | 24-char hex task ID |

**Response states:**

- **processing** — Task is still running
- **completed** — Extraction finished, data available
- **failed** — Extraction failed, check `failCode` and `message`

**Response (processing):**

```json
{
  "code": 0,
  "message": "success",
  "data": {
    "taskId": "69a7dac700cf95938f86d9bb",
    "status": "processing",
    "createdAt": "2025-04-09T12:00:00Z",
    "data": null,
    "credits": 0,
    "failCode": null,
    "message": null
  }
}
```

**Response (completed):**

```json
{
  "code": 0,
  "message": "success",
  "data": {
    "taskId": "69a7dac700cf95938f86d9bb",
    "status": "completed",
    "createdAt": "2025-04-09T12:00:00Z",
    "data": {
      "content": "Extracted text content...",
      "metadata": {
        "title": "Article Title",
        "author": "Author Name",
        "publishedAt": "2025-04-01T08:00:00Z"
      },
      "references": [
        "https://example.com/related-article"
      ]
    },
    "credits": 5,
    "failCode": null,
    "message": null
  }
}
```

**Response (failed):**

```json
{
  "code": 0,
  "message": "success",
  "data": {
    "taskId": "69a7dac700cf95938f86d9bb",
    "status": "failed",
    "createdAt": "2025-04-09T12:00:00Z",
    "data": null,
    "credits": 0,
    "failCode": "EXTRACT_FAILED",
    "message": "Unable to extract content from the provided URL"
  }
}
```

**Key fields:**

| Field | Type | Description |
|-------|------|-------------|
| status | string | `processing`, `completed`, or `failed` |
| data.data.content | string | Extracted text content |
| data.data.metadata | object | Page metadata (title, author, publishedAt) |
| data.data.references | array | Referenced URLs (array of strings) |
| credits | integer | Credits consumed |
| failCode | string | Error code (null on success) |
| message | string | Error message (null on success) |

**Error codes:**

| Code | Meaning |
|------|---------|
| 29003 | Invalid taskId format |
| 25002 | Task not found |

---

### Polling Pattern

5-second interval, 60 polls max. Run with `run_in_background: true` and `timeout: 300000`.

**Two-step pattern:**

1. **Submit (foreground)**: POST the creation request, extract `taskId` from the response.
2. **Poll (background)**: Run the polling loop with `run_in_background: true`. You will be notified automatically when it completes.

The exact polling bash command is already specified in the Workflow section (Step 5).

---

### Error Handling

**HTTP status codes:**

| Code | Meaning | Action |
|------|---------|--------|
| 200 | Success | Parse response body |
| 400 | Bad request | Check parameters |
| 401 | Invalid API key | Re-check `LISTENHUB_API_KEY` |
| 402 | Insufficient credits | Inform user to recharge |
| 403 | Forbidden | No permission for this resource |
| 429 | Rate limited | Exponential backoff, retry after delay |
| 500/502/503/504 | Server error | Retry up to 3 times |

**Retry strategy:**

- **429 rate limit**: Wait 15 seconds, then retry (exponential backoff)
- **5xx server errors**: Retry up to 3 times with 5-second intervals
- **Network errors**: Retry up to 3 times

**Application error codes:**

| Code | Meaning |
|------|---------|
| 21007 | Invalid user API key |
| 25429 | Rate limited (application-level) |

## Example

**User**: "Parse this article: https://en.wikipedia.org/wiki/Topology"

**Agent workflow**:
1. URL: `https://en.wikipedia.org/wiki/Topology`
2. Options: defaults (omit options)
3. Submit extraction

```bash
curl -sS -X POST "https://api.marswave.ai/openapi/v1/content/extract" \
  -H "Authorization: Bearer $LISTENHUB_API_KEY" \
  -H "Content-Type: application/json" \
  -H "X-Source: skills" \
  -d '{
    "source": {
      "type": "url",
      "uri": "https://en.wikipedia.org/wiki/Topology"
    }
  }'
```

4. Poll until complete:

```bash
curl -sS "https://api.marswave.ai/openapi/v1/content/extract/69a7dac700cf95938f86d9bb" \
  -H "Authorization: Bearer $LISTENHUB_API_KEY" \
  -H "X-Source: skills"
```

5. Present extracted content preview and offer next actions.

---

**User**: "Extract recent tweets from @elonmusk, get 50 tweets"

**Agent workflow**:
1. URL: `https://x.com/elonmusk`
2. Options: `{"twitter": {"count": 50}}`
3. Submit extraction

```bash
curl -sS -X POST "https://api.marswave.ai/openapi/v1/content/extract" \
  -H "Authorization: Bearer $LISTENHUB_API_KEY" \
  -H "Content-Type: application/json" \
  -H "X-Source: skills" \
  -d '{
    "source": {
      "type": "url",
      "uri": "https://x.com/elonmusk"
    },
    "options": {
      "twitter": {
        "count": 50
      }
    }
  }'
```

4. Poll until complete, present results.
Source

Creator's repository · marswaveai/skills
View on GitHub ↗
Security

Security checks in progress
Results will appear here once audits complete
What this skill can do
Reads your filesConnects to the internetRuns code on your machine
Checked by 3 independent security firms
Does it try to trick the AI?Not yet checkedPending · Gen Agent Trust Hub
Does it sneak in hidden code?Not yet checkedPending · Socket
Does it have known bugs?Not yet checkedPending · Snyk