Disco
Integration Options
- - MCP server — remote server at
https://disco.leap-labs.com/mcp, no install required. Best for datasets at a URL. - Python SDK —
pip install discovery-engine-api. Use this for local files of any size. Runs on your machine and streams files directly — no base64, no size limits.
Quick rule: if the data is at a URL, use file_url in discovery_upload. If it's a local file, use the Python SDK — or if Python isn't available, upload directly via the presign API and pass the result to discovery_analyze. Don't use file_content (base64) unless the file is already in memory and tiny.
MCP Server
Add to your MCP config:
CODEBLOCK0
MCP Tools
Discovery workflow
| Tool | Purpose |
|---|
| INLINECODE6 | Upload a dataset. Supports URL download (file_url), local path (file_path), or base64 content (file_content). Returns a file_ref for use with discovery_analyze. |
| INLINECODE12 |
Submit a dataset for analysis using a
file_ref from
discovery_upload. Returns a
run_id. |
|
discovery_status | Poll a running analysis by
run_id. |
|
discovery_get_results | Fetch completed results: patterns, p-values, citations, feature importance. |
|
discovery_estimate | Estimate the credit cost before committing to a run. |
Account management
| Tool | Purpose |
|---|
| INLINECODE20 | Start account creation — sends verification code to email. |
| INLINECODE21 |
Complete signup by submitting the verification code. Returns API key. |
|
discovery_login | Get a new API key for an existing account — sends verification code to email. |
|
discovery_login_verify | Complete login by submitting the verification code. Returns a new API key. |
|
discovery_account | Check credits, plan, and usage. |
|
discovery_list_plans | View available plans and pricing. |
|
discovery_subscribe | Subscribe to or change plan. |
|
discovery_purchase_credits | Buy credit packs. |
|
discovery_add_payment_method | Attach a Stripe payment method. |
MCP Workflow
Analyses take 3–15 minutes. Do not block — submit, continue other work, poll for completion.
CODEBLOCK1
Getting Data In
Choose the right path for your situation:
| Situation | Best approach |
|---|
| Data is at an http/https URL | INLINECODE29 in INLINECODE30 |
| Local file, Python available |
Python SDK (
engine.discover(...)) |
| Local file, MCP server running locally |
file_path in
discovery_upload |
| Local file, hosted MCP, no Python | Direct upload API (3 steps — see below) |
| Small file, any language |
POST /api/data/upload/direct (single step — see below) |
| Tiny file already in memory |
file_content in
discovery_upload (last resort) |
Data at a URL:
CODEBLOCK2
The server downloads the file directly — nothing passes through the agent or the model context. Works with public URLs, S3 presigned URLs, or any accessible http/https link.
Local file — Python SDK (recommended for any local file):
CODEBLOCK3
Handles upload, polling, and results in one call. No size limit. See the Python SDK section for full documentation.
Local file — MCP server running locally (cloned from GitHub, stdio transport):
If you've cloned the repo and are running server.py locally, the process can read your filesystem directly:
CODEBLOCK4
Reads the file locally and streams it directly to cloud storage — nothing passes through the model context. No size limit. file_path is silently ignored by the hosted server at disco.leap-labs.com/mcp — it only works with a locally-running server.
Local file — hosted MCP, direct upload (works from any language):
If you're using the hosted MCP server and Python isn't available, you can upload directly via the REST API in three steps, then pass the result to discovery_analyze as normal.
CODEBLOCK5
Pass the finalize response directly to discovery_analyze as file_ref. No size limit.
Small file — direct upload (single HTTP call, simpler than presign):
CODEBLOCK6
Pass the response directly to discovery_analyze as file_ref. Simpler than the 3-step presign flow but the entire file must fit in the request body. For large files, use presigned uploads or the Python SDK.
Last resort — tiny file already in memory:
Only use this if the file is already loaded into memory and none of the above options apply. The base64-encoded content passes through the model's context window, so this only works for very small files.
CODEBLOCK7
CODEBLOCK8
MCP Parameters
discovery_upload:
Provide exactly one of file_url, file_path, or file_content.
- -
file_url — http/https URL. The server downloads it directly. Best option for hosted MCP. - INLINECODE50 — Absolute path to a local file. Only works when the MCP server is running locally. Silently ignored by the hosted server.
- INLINECODE51 — File contents, base64-encoded. Last resort only — the content passes through the model's context window, so this only works for very small files.
- INLINECODE52 — Filename with extension (e.g.
"data.csv"), used for format detection. Required with file_content. Default: "data.csv".
Returns a file_ref (pass it directly to discovery_analyze) and columns (list of column names and types, useful if you need to inspect before choosing a target column).
discovery_analyze:
- -
file_ref — File reference returned by discovery_upload. Required. - INLINECODE62 — The column to predict/explain
- INLINECODE63 — 2 = default, higher = deeper analysis. Max: num_columns - 2
- INLINECODE64 —
"public" (free, results published) or "private" (costs credits) - INLINECODE67 — JSON object mapping column names to descriptions. Significantly improves pattern explanations — always provide if column names are non-obvious
- INLINECODE68 — JSON array of column names to exclude from analysis (see Preparing Your Data below)
- INLINECODE69 — Optional title for the analysis
- INLINECODE70 — Optional description of the dataset
- INLINECODE71 —
false (default) or true. Slower and more expensive, but you get smarter pre-processing, literature context and novelty assessment. Public runs always use LLMs regardless of this setting. Tradeoffs when false: pattern descriptions are generic, novelty is not assessed (no citations), report summaries are omitted, ambiguous integer columns (e.g. "month" 1-12) may be misclassified as categorical, and text cluster names are generic. - INLINECODE74 — Optional author name for the dataset
- INLINECODE75 — Optional URL of the original data source
No API key?
New account: Call discovery_signup with the user's email. This sends a verification code — the user must check their email. Then call discovery_signup_verify with the code to receive a disco_ API key. Free tier: 10 credits/month, unlimited public runs. No password, no credit card.
Existing account (lost key or new session): Call discovery_login with the user's email. Same OTP flow — sends a code, then call discovery_login_verify to get a new API key.
Insufficient credits?
- 1. Call
discovery_estimate to show what it would cost - Suggest running publicly (free, but results are published and depth is locked to 2)
- Or guide them through
discovery_purchase_credits / INLINECODE83
Preparing Your Data
Before running an analysis, you must exclude columns that would produce meaningless findings. Disco finds statistically real patterns — but if the input includes columns that are definitionally related to the target, the patterns will be true by definition, not by discovery.
Always exclude these column types via excluded_columns:
1. Identifiers
Row IDs, patient IDs, UUIDs, accession numbers, sample codes. These are arbitrary labels with no analytical signal.
2. Data leakage
Columns that are the target column renamed, reformatted, or binned. Example:
diagnosis_text when the target is
diagnosis_code.
3. Tautological / definitional columns
This is the most important category. Columns that encode the same underlying construct as the target — through alternative classifications, component parts, or derived calculations. These produce findings that are trivially true.
Examples:
- - FAERS data: If the target is
serious, then serious_outcome (categories like death, disability, hospitalisation), not_serious, and death are all part of the same seriousness classification. A finding that "death predicts seriousness" is a tautology, not a discovery. - Clinical trials: If the target is
response, then response_category, responder_flag, and RECIST_response are all encodings of the same outcome. - Financial data: If the target is
profit, then revenue and cost together compose it (profit = revenue − cost). - Surveys: If the target is a composite index score, the sub-items that make up the index are tautological.
- Derived columns: BMI when height and weight are present, age when birth_date is present.
How to identify them: Ask "is this column just a different way of expressing what the target already measures?" If yes, exclude it.
CODEBLOCK9
Python SDK
When To Use This Tool
Disco is not another AI data analyst that writes pandas or SQL for you. It is a discovery pipeline — it finds patterns in data that you, the user, and other analysis tools would miss because they don't know to look for them.
Use it when you need to go beyond answering questions about data, and start finding things nobody thought to ask:
- - Novel pattern discovery — feature interactions, subgroup effects, and conditional relationships you wouldn't think to look for
- Statistical validation — FDR-corrected p-values tested on hold-out data, not just correlations
- A target column you want to understand — what really drives it, beyond what's obvious
Use Disco when the user says: "what's really driving X?", "are there patterns we're missing?", "find something new in this data", "what predicts Y that we haven't considered?", "go deeper than correlation", "discover non-obvious relationships"
Use pandas/SQL instead when the user says: "summarize this data", "make a chart", "what's the average?", "filter rows where X > 5", "show me the distribution"
What It Does (That You Cannot Do Yourself)
Disco finds complex patterns in your data — feature interactions, nonlinear thresholds, and meaningful subgroups — without requiring prior hypotheses about what matters. Each pattern is validated on hold-out data, corrected for multiple testing, and checked for novelty against academic literature with citations.
This is a computational pipeline, not prompt engineering over data. You cannot replicate what it does by writing pandas code or asking an LLM to look at a CSV. It finds structure that hypothesis-driven analysis misses because it doesn't start with hypotheses.
Getting an API Key
Programmatic (for agents): Two-step signup — send a verification code to the email, then submit it to receive the API key. The email must be real: the code is sent there and must be read to complete signup.
CODEBLOCK10
Existing account (lost key or new session): Same OTP flow via /api/login and /api/login/verify, or in the SDK:
CODEBLOCK11
Manual (for humans): Sign up at https://disco.leap-labs.com/sign-up, create key at https://disco.leap-labs.com/developers.
Installation
CODEBLOCK12
Quick Start
Disco runs are async and can take a while. Do not block on them — submit the run, continue with other work, and retrieve results when ready.
CODEBLOCK13
Inspecting Columns Before Running
If you need to see the dataset's columns before choosing a target column, upload first and inspect:
CODEBLOCK14
Running in the Background
If you need to do other work while Disco runs (recommended for agent workflows):
CODEBLOCK15
This is the preferred pattern for agents. engine.discover() is a convenience wrapper that does this internally with wait=True.
Non-async contexts: use engine.discover_sync() — same signature as discover(), runs in a managed event loop.
Example Output
Here's a truncated real response from a crop yield analysis (target column: yield_tons_per_hectare). This is what engine.discover() returns:
CODEBLOCK16
Key things to notice:
- - Patterns are combinations of conditions (humidity AND wind speed), not single correlations
- Specific threshold ranges (72-89%), not just "higher humidity is better"
- Novel vs confirmatory: each pattern is classified and explained — novel findings are what you came for, confirmatory ones validate known science
- Citations show what IS known, so you can see what's genuinely new
- Summary gives the agent a narrative to present to the user immediately
report_url links to an interactive web report — drop this in your response so the user can explore visually. Private runs require sign-in — tell the user to sign in at the dashboard using the same email address the account was created with (email verification code, no password needed). Public runs are accessible to anyone.
Parameters
CODEBLOCK17
Tip: Providing column_descriptions significantly improves pattern explanations. If your columns have non-obvious names (e.g., col_7, feat_a), always describe them.
Cost
- - Public runs: Free. Results published to public gallery. Locked to depth=2.
- Private runs: Credits scale with file size, depth, and run configuration. $0.10 per credit. Use
discovery_estimate to check cost before running. - API keys: https://disco.leap-labs.com/developers
- Credits: https://disco.leap-labs.com/account
Paying for Credits (Programmatic)
Agents can attach a payment method and purchase credits entirely via the API — no browser required.
Step 1 — Get your Stripe publishable key
CODEBLOCK18
Or via REST:
CODEBLOCK19
Step 2 — Tokenize a card using the Stripe API
Use the publishable key to create a Stripe PaymentMethod. Card data goes directly to Stripe — Disco never sees it.
CODEBLOCK20
Step 3 — Attach the payment method
CODEBLOCK21
Or via REST:
CODEBLOCK22
Step 4 — Purchase credits
Credits are sold in packs of 100 ($10/pack, $0.10/credit).
CODEBLOCK23
Or via REST:
CODEBLOCK24
Subscriptions (optional)
For regular usage, subscribe to a paid plan instead of buying packs:
CODEBLOCK25
Requires a payment method on file. See GET /api/plans for full plan details.
Estimate Before Running
Before submitting a private analysis, estimate the credit cost:
CODEBLOCK26
Result Structure
CODEBLOCK27
Working With Results
CODEBLOCK28
Error Handling
CODEBLOCK29
All errors inherit from DiscoveryError and include a suggestion field with actionable instructions.
Expected Data Format
Disco expects a flat table — columns for features, rows for samples.
- - One row per observation — a patient, a sample, a transaction, a measurement, etc.
- One column per feature — numeric, categorical, datetime, or free text are all fine
- One target column — the outcome to analyze. Must have at least 2 distinct values.
- Missing values are OK — Disco handles them automatically. Don't drop rows or impute beforehand.
Supported formats: CSV, TSV, Excel (.xlsx), JSON, Parquet, ARFF, Feather. Max 5 GB.
Not supported: images, raw text documents, nested/hierarchical JSON, multi-sheet Excel (use the first sheet or export to CSV).
Displaying Results
When presenting Disco results, render interactive visualizations -- don't just dump text. Default order (adapt based on what the user asked):
- 1. Summary -- show
summary.overview and summary.key_insights. Links to dashboard_urls.summary.url.
- 2. Pattern plots -- for the top patterns, render a violin plot: one violin per condition, one for all conditions combined, one for the overall dataset. Y-axis is the target variable. Shows how each condition narrows the distribution. Links to
dashboard_urls.patterns.url.
- 3. Territory map -- 3D surface where X and Y axes are two features from a pattern's conditions, Z axis is the target. Shows the interaction landscape. Best when patterns involve feature interactions. Links to
dashboard_urls.territory.url.
- 4. Feature importance -- horizontal waterfall bars floating from zero, sorted by absolute contribution. Links to
dashboard_urls.features.url.
- 5. Correlation heatmap -- square matrix of feature correlations, sorted by correlation with target. Links to
dashboard_urls.features.url.
Use judgment: if the user asked "what drives X?", lead with feature importance. If they asked "find something new", lead with novel patterns. If they're exploring interactions, lead with territory.
For exact colors, scales, and layout details, follow the full visualization spec: https://disco.leap-labs.com/visualization-spec
Always link to the relevant dashboard_urls page so users can explore the full interactive version.
Links
Disco
集成选项
- - MCP 服务器 — 远程服务器位于 https://disco.leap-labs.com/mcp,无需安装。最适合处理 URL 上的数据集。
- Python SDK — pip install discovery-engine-api。用于处理任意大小的本地文件。 在您的机器上运行并直接流式传输文件 — 无需 base64,无大小限制。
快速规则: 如果数据位于 URL,则在 discoveryupload 中使用 fileurl。如果是本地文件,请使用 Python SDK — 或者如果 Python 不可用,则通过预签名 API 直接上传,并将结果传递给 discoveryanalyze。除非文件已加载到内存中且非常小,否则不要使用 filecontent(base64)。
MCP 服务器
添加到您的 MCP 配置:
json
{
mcpServers: {
discovery-engine: {
url: https://disco.leap-labs.com/mcp,
env: { DISCOVERYAPIKEY: disco_... }
}
}
}
MCP 工具
Discovery 工作流
| 工具 | 用途 |
|---|
| discoveryupload | 上传数据集。支持 URL 下载(fileurl)、本地路径(filepath)或 base64 内容(filecontent)。返回一个 fileref,用于 discoveryanalyze。 |
| discoveryanalyze |
使用 discoveryupload 返回的 file
ref 提交数据集进行分析。返回一个 runid。 |
| discovery
status | 通过 runid 轮询正在运行的分析。 |
| discovery
getresults | 获取已完成的结果:模式、p 值、引用、特征重要性。 |
| discovery_estimate | 在提交运行前估算信用点成本。 |
账户管理
| 工具 | 用途 |
|---|
| discoverysignup | 开始创建账户 — 向邮箱发送验证码。 |
| discoverysignup_verify |
通过提交验证码完成注册。返回 API 密钥。 |
| discovery_login | 为现有账户获取新的 API 密钥 — 向邮箱发送验证码。 |
| discovery
loginverify | 通过提交验证码完成登录。返回新的 API 密钥。 |
| discovery_account | 查看信用点、套餐和使用情况。 |
| discovery
listplans | 查看可用套餐和定价。 |
| discovery_subscribe | 订阅或更改套餐。 |
| discovery
purchasecredits | 购买信用点包。 |
| discovery
addpayment_method | 添加 Stripe 支付方式。 |
MCP 工作流
分析需要 3–15 分钟。不要阻塞 — 提交后,继续其他工作,轮询完成状态。
- 1. discoveryestimate → 检查信用点成本(私有运行务必执行此步骤)
- discoveryupload → 上传数据集,获取 fileref
- discoveryanalyze → 使用 fileref 提交分析,获取 runid
- discovery_status → 轮询直到状态为 completed
返回:status, queue
position, currentstep,
estimated
waitseconds
- 5. discoverygetresults → 获取模式、摘要、特征重要性
数据导入
根据您的情况选择正确的路径:
| 情况 | 最佳方法 |
|---|
| 数据位于 http/https URL | discoveryupload 中的 fileurl |
| 本地文件,Python 可用 |
Python SDK(engine.discover(...)) |
| 本地文件,MCP 服务器在本地运行 | discovery
upload 中的 filepath |
| 本地文件,托管 MCP,无 Python | 直接上传 API(3 个步骤 — 见下文) |
| 小文件,任何语言 | POST /api/data/upload/direct(单步 — 见下文) |
| 已加载到内存中的极小文件 | discovery
upload 中的 filecontent(最后手段) |
URL 上的数据:
discoveryupload(fileurl=https://example.com/dataset.csv)
→ {file: {...}, columns: [{name: col1, type: continuous, ...}], rowCount: 5000}
discoveryanalyze(fileref=<上述结果>, target_column=outcome)
服务器直接下载文件 — 不经过代理或模型上下文。适用于公共 URL、S3 预签名 URL 或任何可访问的 http/https 链接。
本地文件 — Python SDK(推荐用于任何本地文件):
python
from discovery import Engine
engine = Engine(apikey=disco...)
result = await engine.discover(data.csv, target_column=outcome)
一次调用处理上传、轮询和结果。无大小限制。完整文档请参阅 Python SDK 部分。
本地文件 — MCP 服务器在本地运行(从 GitHub 克隆,stdio 传输):
如果您已克隆仓库并在本地运行 server.py,该进程可以直接读取您的文件系统:
discoveryupload(filepath=/home/user/data/dataset.csv)
→ {file: {...}, columns: [...], rowCount: 5000}
discoveryanalyze(fileref=<上述结果>, target_column=outcome)
在本地读取文件并直接流式传输到云存储 — 不经过模型上下文。无大小限制。托管服务器 disco.leap-labs.com/mcp 会静默忽略 file_path — 它仅适用于本地运行的服务器。
本地文件 — 托管 MCP,直接上传(适用于任何语言):
如果您使用托管 MCP 服务器且 Python 不可用,可以通过 REST API 分三步直接上传,然后将结果正常传递给 discovery_analyze。
bash
1. 获取预签名上传 URL
curl -X POST https://disco.leap-labs.com/api/data/upload/presign \
-H Authorization: Bearer disco_... \
-H Content-Type: application/json \
-d {fileName: data.csv, contentType: text/csv, fileSize: 1048576}
→ {uploadUrl: https://storage.googleapis.com/..., key: uploads/abc/data.csv, uploadToken: tok_...}
2. 将文件直接 PUT 到云存储(uploadUrl 是预签名的 — 无需认证头)
curl -X PUT <步骤 1 中的 uploadUrl> \
-H Content-Type: text/csv \
--data-binary @data.csv
3. 完成上传
curl -X POST https://disco.leap-labs.com/api/data/upload/finalize \
-H Authorization: Bearer disco_... \
-H Content-Type: application/json \
-d {key: uploads/abc/data.csv, uploadToken: tok_...}
→ {ok: true, file: {...}, columns: [...], rowCount: 5000}
将 finalize 响应直接作为 fileref 传递给 discoveryanalyze。无大小限制。
小文件 — 直接上传(单次 HTTP 调用,比预签名更简单):
bash
curl -X POST https://disco.leap-labs.com/api/data/upload/direct \
-H Authorization: Bearer disco_... \
-H Content-Type: application/json \
-d {fileName: data.csv, content: }
→ {ok: true, file: {...}, columns: [...], rowCount: 5000}
将响应直接作为 fileref 传递给 discoveryanalyze。比 3 步预签名流程更简单,但整个文件必须适合请求体。对于大文件,请使用预签名上传或 Python SDK。
最后手段 — 已加载到内存中的极小文件:
仅当文件已加载到内存中且上述选项均不适用时使用。base64 编码的内容会经过模型的上下文窗口,因此仅适用于非常小的文件。
python
import base64
content = base64.b64encode(open(data.csv, rb).read()).decode()
discoveryupload(filecontent=content, file_name=data.csv)
→ {file: {...}, columns: [...], rowCount: 500}
discoveryanalyze(fileref=<上述结果>, target_column=outcome)
MCP 参数
discovery_upload:
仅提供 fileurl、filepath 或 file_content 中的一个。
- - file_url — http/https URL。服务器直接下载。托管