Dataset Compilation
We built a comprehensive dataset of datacenter networking hardware companies using LLM-powered search and filtering. The pipeline recursively expands a topic description into diverse search queries, extracts structured data with batch inference, then validates each candidate with a second LLM pass. The final dataset contains 188 unique companies with 100% recall against the Gartner Magic Quadrant.
To run this yourself, sign up at app.doubleword.ai and generate an API key.
Why This Works
Suppose you're doing competitive analysis for a networking hardware company, or you're an investor trying to map the datacenter infrastructure landscape, or you're identifying acquisition targets in a fragmented market. You need a comprehensive list of companies in the space, not just the obvious names that show up on the first page of Google.
The traditional options are expensive. You can pay a junior analyst to spend a few days searching and compiling a spreadsheet, which typically costs $500-800 and produces maybe 50 companies before they run out of search variations to try. Or you can subscribe to a commercial database like Crunchbase or PitchBook, which costs thousands per year and tends to be strong on VC-backed startups but weak on established players, international companies, and the long tail of smaller vendors.
The approach here uses LLMs for both query generation and filtering. We start with a topic description and recursively expand it into diverse search queries: the LLM takes "datacenter networking hardware companies" and generates sub-queries that approach it from different angles, and each of those expands again until we have 100+ specific queries covering geographic regions, product categories, business models, and recent events. Each query surfaces candidates the others miss.
Raw extraction produces a lot of noise. Our initial pass found over 1,700 company mentions, but many of them were cloud providers, software vendors, and data center operators that mention networking hardware in their marketing without actually manufacturing it. Equinix, CyrusOne, and Azure all showed up. So we run a second LLM pass: for each candidate, we do a fresh web search and ask the model to classify whether the company actually makes networking hardware or is something else entirely. This filtering step removed 62 false positives while preserving every company from the Gartner Magic Quadrant.
Results
We compiled a dataset of datacenter networking hardware vendors using this pipeline:
| Stage | Count |
|---|---|
| Search queries generated | 125 |
| Unique URLs searched | 714 |
| Company mentions extracted | 1,762 |
| After LLM deduplication | 285 |
| After validation filtering | 223 |
| Unique companies (final) | 188 |
The filtering removed 22% of candidates, mostly data center operators (Equinix, CyrusOne), cloud providers (Azure), and software/services companies that showed up because their marketing mentions networking hardware. We run deduplication twice: once on the raw extraction results, and again on the filtered list. The second pass catches duplicates like "Hewlett Packard Enterprise" and "Hewlett-Packard Enterprise (HPE)" that both passed the filter independently.
Here's what got cut and what survived:
| Excluded (62 companies) | Kept (188 unique) |
|---|---|
| DC operators: Equinix, CyrusOne, Aligned | Network equipment: Cisco, Juniper, Arista, Extreme |
| Cloud providers: Azure | Hardware OEMs: Dell, HPE, Supermicro |
| Software/services: DXC, Ensoft, SolveDirect | White-box vendors: Celestica, Edgecore, Accton |
| Silicon + hardware: Nvidia, Broadcom, Intel, Marvell | |
| Enterprise: Alcatel-Lucent Enterprise, H3C, Nokia |
Validation Against Gartner
To verify we weren't just finding well-known names, we checked recall against the Gartner Magic Quadrant for Data Center Networking:
| Quadrant | Companies | Found | Passed Filter |
|---|---|---|---|
| Leaders | Cisco, Juniper, Arista, Huawei | ✓ | ✓ |
| Challengers | HPE, NVIDIA | ✓ | ✓ |
| Visionaries | Dell, Nokia | ✓ | ✓ |
| Niche Players | H3C, Extreme, Alcatel-Lucent Enterprise | ✓ | ✓ |
We found all 11 vendors and the filtering step classified each one correctly, giving us 100% recall against the analyst benchmark.
Cost Breakdown
| Step | API | Requests | Cost |
|---|---|---|---|
| Generate queries | Doubleword (real-time) | ~50 | $0.10 |
| Initial search | Serper | 125 | $0.13 |
| Extract companies | Doubleword (batch) | 1,216 | $0.40 |
| LLM deduplication | Doubleword (batch) | 7 | $0.02 |
| Validation search | Serper | 285 | $0.29 |
| Validation classify | Doubleword (batch) | 285 | $0.09 |
| Final deduplication | Doubleword (batch) | 3 | $0.01 |
| Total | ~$1.05 |
Doubleword batch pricing is 50% off standard rates (pricing). Serper charges $0.001 per search.
For comparison:
| Approach | Cost | Coverage |
|---|---|---|
| Junior analyst (2 days) | $500-800 | Finds the obvious names |
| Crunchbase subscription | $5,000+/year | Good for VC-backed startups, weak on established players |
| LLM + search (this approach) | ~$1 | 188 validated companies, 100% Gartner recall |
The cost difference is roughly 500x. You can run hundreds of queries, validate every candidate, and iterate on the pipeline freely.
Replication
The pipeline has five commands, each with explicit --input and --output flags so you can see exactly what flows where.
Setup
cd dataset-compilation && uv sync
export DOUBLEWORD_API_KEY="your-key" # from app.doubleword.ai
export SERPER_API_KEY="your-key" # from serper.devStep 1: Generate Search Queries
uv run dataset-compilation generate-queries \
--topic "datacenter networking hardware companies" \
--max-depth 3 \
--output queries.jsonThe command recursively expands the topic into diverse search queries. Starting from the topic description, the LLM generates 3-5 sub-queries that approach it from different angles (geography, product type, company type, etc.). Each sub-query expands again until the queries are specific enough to search. A depth-3 tree typically yields 100-150 queries covering:
- Geographic regions: "datacenter switch manufacturers in Taiwan", "Israeli networking hardware startups"
- Product categories: "100GbE NIC vendors", "open source network operating systems"
- Business models: "white-box switch ODMs", "networking hardware for hyperscale datacenters"
- Temporal: "datacenter hardware acquisitions 2024", "emerging networking startups Series A"
Query diversity matters more than volume. The recursive expansion ensures we probe different dimensions of the space rather than generating variations on the same query.
Step 2: Run Searches
uv run dataset-compilation search \
--input queries.json \
--output search_results.json \
--results-per-query 50This runs each query through Serper. We use 50 results per query to get broad coverage.
Step 3: Extract Companies
uv run dataset-compilation extract \
--input search_results.json \
--output companies_raw.jsonFor each search result, we extract structured company data: name, description, products, headquarters, and website. Since we're processing 1,200+ pages, we use batch inference at 50% off real-time pricing. The command waits for the batch to complete by default.
Step 4: First Dedupe Pass
uv run dataset-compilation dedupe \
--input companies_raw.json \
--output companies.jsonRaw extraction produces multiple mentions of the same company with variant names ("Cisco" vs "Cisco Systems" vs "Cisco Systems, Inc."). The LLM clusters these duplicates semantically, which handles cases like "HPE" vs "Hewlett Packard Enterprise" that string normalization would miss.
Step 5: Filter and Validate
uv run dataset-compilation filter \
--input companies.json \
--output companies_filtered.jsonThis step does the heavy lifting. Raw extraction found "Amazon" and "Microsoft" because their marketing pages mention networking hardware, even though they don't manufacture it. For each candidate, we run a fresh web search and ask the LLM to classify whether the company actually makes networking hardware or is something else entirely.
The command also saves companies_filtered_excluded.json with the LLM's reasoning for each rejection.
Step 6: Second Dedupe Pass
uv run dataset-compilation dedupe \
--input companies_filtered.json \
--output companies_final.jsonWe run dedupe again on the smaller, cleaner filtered list. This catches remaining duplicates like "NVIDIA" and "Nvidia Corporation" that both passed the filter independently.
Validate Against Ground Truth
uv run dataset-compilation validate \
--input companies_final.json \
--ground-truth ground_truth.jsonThis checks recall against known companies (like the Gartner Magic Quadrant) and identifies any remaining duplicate clusters in the output.
Adapting to Other Domains
The pipeline works for any "find all X that match Y" task. Some examples:
| Topic | What the filter validates |
|---|---|
| AI chip startups | Actually builds silicon, not just AI software |
| European fintech companies | Offers financial products, not B2B SaaS |
| Open source database companies | Has an open source project, not just "open" marketing |
| Climate tech hardware | Manufactures physical products, not carbon credits |
The important thing is defining a clear filter criterion that separates true positives from noise. The LLM handles this classification well when you give it fresh web search context about each candidate.
Limitations
Filtering accuracy. The LLM classifier isn't perfect. We manually reviewed a sample and found a ~5% error rate, mostly false positives (companies incorrectly kept). For high-stakes applications, you might want a human review step for low-confidence classifications.
Deduplication. LLM-based deduplication catches most variants ("HPE" = "Hewlett Packard Enterprise"), but edge cases slip through. Running the dedupe step twice helps, but for critical applications you should spot-check the output for remaining duplicates.
Search depth. We used 125 queries × 50 results. More aggressive expansion (deeper trees, more results per query) would find more companies, though with diminishing returns as coverage saturates.
Temporal coverage. The pipeline captures what's currently indexed. Recently founded companies or those with minimal web presence will be missed.
Conclusion
The pipeline runs LLMs at two stages. First, during query expansion and extraction: we turn a topic description into 125 diverse search queries, then extract structured company data from 1,200+ search results. Second, during filtering: we run a fresh web search for each candidate and ask the LLM to classify whether it actually matches the topic. This second pass cut 62 false positives (data center operators, cloud providers, companies with insufficient web presence) while keeping all 11 vendors from the Gartner Magic Quadrant.
Batch inference makes this practical. Processing 1,500+ LLM requests at real-time API rates would cost several dollars and take hours. Batch pricing brings the total under $1, and results come back in minutes. At that price point, you can run the pipeline monthly to catch new entrants, or adapt it to adjacent markets.