Disco

Integration Options

- MCP server — remote server at https://disco.leap-labs.com/mcp, no install required. Best for datasets at a URL.
Python SDK — pip install discovery-engine-api. Use this for local files of any size. Runs on your machine and streams files directly — no base64, no size limits.

Quick rule: if the data is at a URL, use file_url in discovery_upload. If it's a local file, use the Python SDK — or if Python isn't available, upload directly via the presign API and pass the result to discovery_analyze. Don't use file_content (base64) unless the file is already in memory and tiny.

MCP Server

Add to your MCP config:

CODEBLOCK0

MCP Tools

Discovery workflow

Tool	Purpose
INLINECODE6	Upload a dataset. Supports URL download (`file_url`), local path (`file_path`), or base64 content (`file_content`). Returns a `file_ref` for use with `discovery_analyze`.
INLINECODE12

Submit a dataset for analysis using a file_ref from discovery_upload. Returns a run_id. | | discovery_status | Poll a running analysis by run_id. | | discovery_get_results | Fetch completed results: patterns, p-values, citations, feature importance. | | discovery_estimate | Estimate the credit cost before committing to a run. |

Account management

Tool	Purpose
INLINECODE20	Start account creation — sends verification code to email.
INLINECODE21

MCP Workflow

Analyses take 3–15 minutes. Do not block — submit, continue other work, poll for completion.

CODEBLOCK1

Getting Data In

Choose the right path for your situation:

Situation	Best approach
Data is at an http/https URL	INLINECODE29 in INLINECODE30
Local file, Python available

Data at a URL:

CODEBLOCK2

The server downloads the file directly — nothing passes through the agent or the model context. Works with public URLs, S3 presigned URLs, or any accessible http/https link.

Local file — Python SDK (recommended for any local file):

CODEBLOCK3

Handles upload, polling, and results in one call. No size limit. See the Python SDK section for full documentation.

Local file — MCP server running locally (cloned from GitHub, stdio transport):

If you've cloned the repo and are running server.py locally, the process can read your filesystem directly:

CODEBLOCK4

Reads the file locally and streams it directly to cloud storage — nothing passes through the model context. No size limit. file_path is silently ignored by the hosted server at disco.leap-labs.com/mcp — it only works with a locally-running server.

Local file — hosted MCP, direct upload (works from any language):

If you're using the hosted MCP server and Python isn't available, you can upload directly via the REST API in three steps, then pass the result to discovery_analyze as normal.

CODEBLOCK5

Pass the finalize response directly to discovery_analyze as file_ref. No size limit.

Small file — direct upload (single HTTP call, simpler than presign):

CODEBLOCK6

Pass the response directly to discovery_analyze as file_ref. Simpler than the 3-step presign flow but the entire file must fit in the request body. For large files, use presigned uploads or the Python SDK.

Last resort — tiny file already in memory:

Only use this if the file is already loaded into memory and none of the above options apply. The base64-encoded content passes through the model's context window, so this only works for very small files.

CODEBLOCK7

CODEBLOCK8

MCP Parameters

discovery_upload:

Provide exactly one of file_url, file_path, or file_content.

- file_url — http/https URL. The server downloads it directly. Best option for hosted MCP.
INLINECODE50 — Absolute path to a local file. Only works when the MCP server is running locally. Silently ignored by the hosted server.
INLINECODE51 — File contents, base64-encoded. Last resort only — the content passes through the model's context window, so this only works for very small files.
INLINECODE52 — Filename with extension (e.g. "data.csv"), used for format detection. Required with file_content. Default: "data.csv".

Returns a file_ref (pass it directly to discovery_analyze) and columns (list of column names and types, useful if you need to inspect before choosing a target column).

discovery_analyze:

- file_ref — File reference returned by discovery_upload. Required.
INLINECODE62 — The column to predict/explain
INLINECODE63 — 2 = default, higher = deeper analysis. Max: num_columns - 2
INLINECODE64 — "public" (free, results published) or "private" (costs credits)
INLINECODE67 — JSON object mapping column names to descriptions. Significantly improves pattern explanations — always provide if column names are non-obvious
INLINECODE68 — JSON array of column names to exclude from analysis (see Preparing Your Data below)
INLINECODE69 — Optional title for the analysis
INLINECODE70 — Optional description of the dataset
INLINECODE71 — false (default) or true. Slower and more expensive, but you get smarter pre-processing, literature context and novelty assessment. Public runs always use LLMs regardless of this setting. Tradeoffs when false: pattern descriptions are generic, novelty is not assessed (no citations), report summaries are omitted, ambiguous integer columns (e.g. "month" 1-12) may be misclassified as categorical, and text cluster names are generic.
INLINECODE74 — Optional author name for the dataset
INLINECODE75 — Optional URL of the original data source

No API key?

New account: Call discovery_signup with the user's email. This sends a verification code — the user must check their email. Then call discovery_signup_verify with the code to receive a disco_ API key. Free tier: 10 credits/month, unlimited public runs. No password, no credit card.

Existing account (lost key or new session): Call discovery_login with the user's email. Same OTP flow — sends a code, then call discovery_login_verify to get a new API key.

Insufficient credits?

1. Call discovery_estimate to show what it would cost
Suggest running publicly (free, but results are published and depth is locked to 2)
Or guide them through discovery_purchase_credits / INLINECODE83

Preparing Your Data

Before running an analysis, you must exclude columns that would produce meaningless findings. Disco finds statistically real patterns — but if the input includes columns that are definitionally related to the target, the patterns will be true by definition, not by discovery.

Always exclude these column types via excluded_columns:

1. Identifiers

Row IDs, patient IDs, UUIDs, accession numbers, sample codes. These are arbitrary labels with no analytical signal.

2. Data leakage

Columns that are the target column renamed, reformatted, or binned. Example: diagnosis_text when the target is diagnosis_code.

3. Tautological / definitional columns

This is the most important category. Columns that encode the same underlying construct as the target — through alternative classifications, component parts, or derived calculations. These produce findings that are trivially true.

Examples:

- FAERS data: If the target is serious, then serious_outcome (categories like death, disability, hospitalisation), not_serious, and death are all part of the same seriousness classification. A finding that "death predicts seriousness" is a tautology, not a discovery.
Clinical trials: If the target is response, then response_category, responder_flag, and RECIST_response are all encodings of the same outcome.
Financial data: If the target is profit, then revenue and cost together compose it (profit = revenue − cost).
Surveys: If the target is a composite index score, the sub-items that make up the index are tautological.
Derived columns: BMI when height and weight are present, age when birth_date is present.

How to identify them: Ask "is this column just a different way of expressing what the target already measures?" If yes, exclude it.

CODEBLOCK9

Python SDK

When To Use This Tool

Disco is not another AI data analyst that writes pandas or SQL for you. It is a discovery pipeline — it finds patterns in data that you, the user, and other analysis tools would miss because they don't know to look for them.

Use it when you need to go beyond answering questions about data, and start finding things nobody thought to ask:

- Novel pattern discovery — feature interactions, subgroup effects, and conditional relationships you wouldn't think to look for
Statistical validation — FDR-corrected p-values tested on hold-out data, not just correlations
A target column you want to understand — what really drives it, beyond what's obvious

Use Disco when the user says: "what's really driving X?", "are there patterns we're missing?", "find something new in this data", "what predicts Y that we haven't considered?", "go deeper than correlation", "discover non-obvious relationships"

Use pandas/SQL instead when the user says: "summarize this data", "make a chart", "what's the average?", "filter rows where X > 5", "show me the distribution"

What It Does (That You Cannot Do Yourself)

Disco finds complex patterns in your data — feature interactions, nonlinear thresholds, and meaningful subgroups — without requiring prior hypotheses about what matters. Each pattern is validated on hold-out data, corrected for multiple testing, and checked for novelty against academic literature with citations.

This is a computational pipeline, not prompt engineering over data. You cannot replicate what it does by writing pandas code or asking an LLM to look at a CSV. It finds structure that hypothesis-driven analysis misses because it doesn't start with hypotheses.

Getting an API Key

Programmatic (for agents): Two-step signup — send a verification code to the email, then submit it to receive the API key. The email must be real: the code is sent there and must be read to complete signup.

CODEBLOCK10

Existing account (lost key or new session): Same OTP flow via /api/login and /api/login/verify, or in the SDK:

CODEBLOCK11

Manual (for humans): Sign up at https://disco.leap-labs.com/sign-up, create key at https://disco.leap-labs.com/developers.

Installation

CODEBLOCK12

Quick Start

Disco runs are async and can take a while. Do not block on them — submit the run, continue with other work, and retrieve results when ready.

CODEBLOCK13

Inspecting Columns Before Running

If you need to see the dataset's columns before choosing a target column, upload first and inspect:

CODEBLOCK14

Running in the Background

If you need to do other work while Disco runs (recommended for agent workflows):

CODEBLOCK15

This is the preferred pattern for agents. engine.discover() is a convenience wrapper that does this internally with wait=True.

Non-async contexts: use engine.discover_sync() — same signature as discover(), runs in a managed event loop.

Example Output

Here's a truncated real response from a crop yield analysis (target column: yield_tons_per_hectare). This is what engine.discover() returns:

CODEBLOCK16

Key things to notice:

- Patterns are combinations of conditions (humidity AND wind speed), not single correlations
Specific threshold ranges (72-89%), not just "higher humidity is better"
Novel vs confirmatory: each pattern is classified and explained — novel findings are what you came for, confirmatory ones validate known science
Citations show what IS known, so you can see what's genuinely new
Summary gives the agent a narrative to present to the user immediately
report_url links to an interactive web report — drop this in your response so the user can explore visually. Private runs require sign-in — tell the user to sign in at the dashboard using the same email address the account was created with (email verification code, no password needed). Public runs are accessible to anyone.

Parameters

CODEBLOCK17

Tip: Providing column_descriptions significantly improves pattern explanations. If your columns have non-obvious names (e.g., col_7, feat_a), always describe them.

Cost

- Public runs: Free. Results published to public gallery. Locked to depth=2.
Private runs: Credits scale with file size, depth, and run configuration. $0.10 per credit. Use discovery_estimate to check cost before running.
API keys: https://disco.leap-labs.com/developers
Credits: https://disco.leap-labs.com/account

Paying for Credits (Programmatic)

Agents can attach a payment method and purchase credits entirely via the API — no browser required.

Step 1 — Get your Stripe publishable key

CODEBLOCK18

Or via REST:

CODEBLOCK19

Step 2 — Tokenize a card using the Stripe API

Use the publishable key to create a Stripe PaymentMethod. Card data goes directly to Stripe — Disco never sees it.

CODEBLOCK20

Step 3 — Attach the payment method

CODEBLOCK21

Or via REST:

CODEBLOCK22

Step 4 — Purchase credits

Credits are sold in packs of 100 ($10/pack, $0.10/credit).

CODEBLOCK23

Or via REST:

CODEBLOCK24

Subscriptions (optional)

For regular usage, subscribe to a paid plan instead of buying packs:

CODEBLOCK25

Requires a payment method on file. See GET /api/plans for full plan details.

Estimate Before Running

Before submitting a private analysis, estimate the credit cost:

CODEBLOCK26

Result Structure

CODEBLOCK27

Working With Results

CODEBLOCK28

Error Handling

CODEBLOCK29

All errors inherit from DiscoveryError and include a suggestion field with actionable instructions.

Expected Data Format

Disco expects a flat table — columns for features, rows for samples.

- One row per observation — a patient, a sample, a transaction, a measurement, etc.
One column per feature — numeric, categorical, datetime, or free text are all fine
One target column — the outcome to analyze. Must have at least 2 distinct values.
Missing values are OK — Disco handles them automatically. Don't drop rows or impute beforehand.

Supported formats: CSV, TSV, Excel (.xlsx), JSON, Parquet, ARFF, Feather. Max 5 GB.

Not supported: images, raw text documents, nested/hierarchical JSON, multi-sheet Excel (use the first sheet or export to CSV).

Displaying Results

When presenting Disco results, render interactive visualizations -- don't just dump text. Default order (adapt based on what the user asked):

1. Summary -- show summary.overview and summary.key_insights. Links to dashboard_urls.summary.url.

2. Pattern plots -- for the top patterns, render a violin plot: one violin per condition, one for all conditions combined, one for the overall dataset. Y-axis is the target variable. Shows how each condition narrows the distribution. Links to dashboard_urls.patterns.url.

3. Territory map -- 3D surface where X and Y axes are two features from a pattern's conditions, Z axis is the target. Shows the interaction landscape. Best when patterns involve feature interactions. Links to dashboard_urls.territory.url.

4. Feature importance -- horizontal waterfall bars floating from zero, sorted by absolute contribution. Links to dashboard_urls.features.url.

5. Correlation heatmap -- square matrix of feature correlations, sorted by correlation with target. Links to dashboard_urls.features.url.

Use judgment: if the user asked "what drives X?", lead with feature importance. If they asked "find something new", lead with novel patterns. If they're exploring interactions, lead with territory.

For exact colors, scales, and layout details, follow the full visualization spec: https://disco.leap-labs.com/visualization-spec

Always link to the relevant dashboard_urls page so users can explore the full interactive version.

Disco

集成选项

- MCP 服务器 — 远程服务器位于 https://disco.leap-labs.com/mcp，无需安装。最适合处理 URL 上的数据集。
Python SDK — pip install discovery-engine-api。用于处理任意大小的本地文件。 在您的机器上运行并直接流式传输文件 — 无需 base64，无大小限制。

快速规则： 如果数据位于 URL，则在 discoveryupload 中使用 fileurl。如果是本地文件，请使用 Python SDK — 或者如果 Python 不可用，则通过预签名 API 直接上传，并将结果传递给 discoveryanalyze。除非文件已加载到内存中且非常小，否则不要使用 filecontent（base64）。

MCP 服务器

添加到您的 MCP 配置：

json
{
mcpServers: {
discovery-engine: {
url: https://disco.leap-labs.com/mcp,
env: { DISCOVERYAPIKEY: disco_... }
}
}
}

MCP 工具

Discovery 工作流

工具	用途
discoveryupload	上传数据集。支持 URL 下载（fileurl）、本地路径（filepath）或 base64 内容（filecontent）。返回一个 fileref，用于 discoveryanalyze。
discoveryanalyze

账户管理

工具	用途
discoverysignup	开始创建账户 — 向邮箱发送验证码。
discoverysignup_verify

MCP 工作流

分析需要 3–15 分钟。不要阻塞 — 提交后，继续其他工作，轮询完成状态。

1. discoveryestimate → 检查信用点成本（私有运行务必执行此步骤）
discoveryupload → 上传数据集，获取 fileref
discoveryanalyze → 使用 fileref 提交分析，获取 runid
discovery_status → 轮询直到状态为 completed

返回：status, queueposition, currentstep, estimatedwaitseconds

5. discoverygetresults → 获取模式、摘要、特征重要性

数据导入

根据您的情况选择正确的路径：

情况	最佳方法
数据位于 http/https URL	discoveryupload 中的 fileurl
本地文件，Python 可用

URL 上的数据：

discoveryupload(fileurl=https://example.com/dataset.csv)
→ {file: {...}, columns: [{name: col1, type: continuous, ...}], rowCount: 5000}

discoveryanalyze(fileref=<上述结果>, target_column=outcome)

服务器直接下载文件 — 不经过代理或模型上下文。适用于公共 URL、S3 预签名 URL 或任何可访问的 http/https 链接。

本地文件 — Python SDK（推荐用于任何本地文件）：

python
from discovery import Engine

engine = Engine(apikey=disco...)
result = await engine.discover(data.csv, target_column=outcome)

一次调用处理上传、轮询和结果。无大小限制。完整文档请参阅 Python SDK 部分。

本地文件 — MCP 服务器在本地运行（从 GitHub 克隆，stdio 传输）：

如果您已克隆仓库并在本地运行 server.py，该进程可以直接读取您的文件系统：

discoveryupload(filepath=/home/user/data/dataset.csv)
→ {file: {...}, columns: [...], rowCount: 5000}

discoveryanalyze(fileref=<上述结果>, target_column=outcome)

在本地读取文件并直接流式传输到云存储 — 不经过模型上下文。无大小限制。托管服务器 disco.leap-labs.com/mcp 会静默忽略 file_path — 它仅适用于本地运行的服务器。

本地文件 — 托管 MCP，直接上传（适用于任何语言）：

如果您使用托管 MCP 服务器且 Python 不可用，可以通过 REST API 分三步直接上传，然后将结果正常传递给 discovery_analyze。

bash

1. 获取预签名上传 URL

curl -X POST https://disco.leap-labs.com/api/data/upload/presign \
-H Authorization: Bearer disco_... \
-H Content-Type: application/json \
-d {fileName: data.csv, contentType: text/csv, fileSize: 1048576}

→ {uploadUrl: https://storage.googleapis.com/..., key: uploads/abc/data.csv, uploadToken: tok_...}

2. 将文件直接 PUT 到云存储（uploadUrl 是预签名的 — 无需认证头）

curl -X PUT <步骤 1 中的 uploadUrl> \ -H Content-Type: text/csv \ --data-binary @data.csv

3. 完成上传

curl -X POST https://disco.leap-labs.com/api/data/upload/finalize \ -H Authorization: Bearer disco_... \ -H Content-Type: application/json \ -d {key: uploads/abc/data.csv, uploadToken: tok_...}

→ {ok: true, file: {...}, columns: [...], rowCount: 5000}

将 finalize 响应直接作为 fileref 传递给 discoveryanalyze。无大小限制。

小文件 — 直接上传（单次 HTTP 调用，比预签名更简单）：

bash
curl -X POST https://disco.leap-labs.com/api/data/upload/direct \
-H Authorization: Bearer disco_... \
-H Content-Type: application/json \
-d {fileName: data.csv, content: }

→ {ok: true, file: {...}, columns: [...], rowCount: 5000}

将响应直接作为 fileref 传递给 discoveryanalyze。比 3 步预签名流程更简单，但整个文件必须适合请求体。对于大文件，请使用预签名上传或 Python SDK。

最后手段 — 已加载到内存中的极小文件：

仅当文件已加载到内存中且上述选项均不适用时使用。base64 编码的内容会经过模型的上下文窗口，因此仅适用于非常小的文件。

python
import base64
content = base64.b64encode(open(data.csv, rb).read()).decode()

discoveryupload(filecontent=content, file_name=data.csv)
→ {file: {...}, columns: [...], rowCount: 500}

discoveryanalyze(fileref=<上述结果>, target_column=outcome)

MCP 参数

discovery_upload：

仅提供 fileurl、filepath 或 file_content 中的一个。

- file_url — http/https URL。服务器直接下载。托管

discovery-engine自动发现引擎

discovery-engine

Disco

Integration Options

MCP Server

MCP Tools

Discovery workflow

Account management

MCP Workflow

Getting Data In

MCP Parameters

No API key?

Insufficient credits?

Preparing Your Data

1. Identifiers

2. Data leakage

3. Tautological / definitional columns

Python SDK

When To Use This Tool

What It Does (That You Cannot Do Yourself)

Getting an API Key

Installation

Quick Start

Inspecting Columns Before Running

Running in the Background

Example Output

Parameters

Cost

Paying for Credits (Programmatic)

Estimate Before Running

Result Structure

Working With Results

Error Handling

Expected Data Format

Displaying Results

Links

Disco

集成选项

MCP 服务器

MCP 工具

Discovery 工作流

账户管理

MCP 工作流

数据导入

1. 获取预签名上传 URL

→ {uploadUrl: https://storage.googleapis.com/..., key: uploads/abc/data.csv, uploadToken: tok_...}

2. 将文件直接 PUT 到云存储（uploadUrl 是预签名的 — 无需认证头）

3. 完成上传

→ {ok: true, file: {...}, columns: [...], rowCount: 5000}

→ {ok: true, file: {...}, columns: [...], rowCount: 5000}

MCP 参数

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement