DoublewordDoubleword

Classifying Security Vulnerabilities

Classification is everywhere: categorizing support tickets, labeling documents, tagging content, sorting alerts by severity. It's often the first thing teams try to automate. The question is whether LLMs can do it well enough, and whether it's economical at scale.

We tested this on a hard problem—classifying security vulnerabilities by type—using 4,642 real vulnerabilities from CVEfixes. Qwen3-30B achieved 46.5% accuracy on grouped classification (Memory Safety, Pointer, Integer, etc.) at a cost of $0.40. Running twice and using agreement as a calibration signal pushes accuracy to 66% on the 58% of samples where both runs agree.

To run this yourself, sign up at app.doubleword.ai and generate an API key.

Why This Matters

At $0.40 per run, you can afford to try things. Fine-grained CWE classification doesn't work—20% accuracy isn't useful. Grouped classification does, at least for Memory Safety where the model hits 82%. Running twice gives you a confidence signal. These are things you'd want to know before committing to a classification approach, and batch inference lets you learn them for under a dollar.

Results

We used CVEfixes, a dataset of functions from real vulnerability-fixing commits. Each function is labeled with a CWE type—4,642 samples across 24 CWEs.

Fine-grained classification is hard

ModelAccuracy (24 classes)Cost
GPT-5.219.5%$8.00
Qwen3-30B19.2%$0.40
GPT-5-mini17.7%$0.80
Qwen3-235B16.2%$1.20

Random baseline: 4.2%. All models are ~4-5x better than random, but 20% accuracy isn't useful for production. The models confuse similar CWEs—CWE-125 (out-of-bounds read) vs CWE-787 (out-of-bounds write) requires understanding whether the bug allows reading or writing, and they often get this wrong.

Grouped classification works

Grouping into broader categories improves accuracy substantially:

ModelAccuracy (8 groups)Memory SafetyPointerIntegerCost
Qwen3-30B46.5%82.2%34.9%14.3%$0.40
Qwen3-235B38.6%59.5%33.2%10.7%$1.20
GPT-5-mini38.3%56.0%47.0%21.1%$0.80
GPT-5.235.2%44.6%41.5%27.7%$8.00

Random baseline: 12.5%. Qwen3-30B hits 46.5%, driven by 82% accuracy on Memory Safety (buffer overflows, out-of-bounds access). Memory Safety is half the dataset, so this specialization pays off in the aggregate numbers.

Which model to use

NeedModelAccuracyCost
Best valueQwen3-30B46.5% grouped$0.40
Balanced across categoriesGPT-5.235.2% grouped$8.00

Qwen3-30B is best if your vulnerabilities are mostly memory safety issues—common in C/C++ codebases. GPT-5.2 is more balanced across categories but costs 20x more with lower overall accuracy.

Calibration: Run Twice

At $0.40 for 4,600 samples, you can run twice and use agreement as a confidence signal.

AgreementSamplesAccuracy
Both runs agree58%66%
Runs disagree42%Flag for review

When both runs agree, accuracy jumps to 66%. When they disagree, flag for manual review. Two runs cost $0.80 total—still under a dollar for 4,600 samples with a calibration signal included.

Scaling

VolumeQwen3-30B (1 run)Qwen3-30B (2 runs)
1,000 samples$0.09$0.18
10,000 samples$0.86$1.72
100,000 samples$8.60$17.20

Error Analysis

Top confusions on fine-grained classification:

ActualPredicted AsCount
CWE-787 (OOB Write)CWE-125 (OOB Read)370
CWE-20 (Input Validation)CWE-125 (OOB Read)362
CWE-119 (Buffer Overflow)CWE-125 (OOB Read)313

The model over-predicts CWE-125, the most common class in the dataset. These confusions all fall within Memory Safety, which is why grouped classification works better.

Replication

cd bug-detection-ensemble
uv sync

# Download CVEfixes (~2GB SQLite database)
uv run bug-ensemble fetch-cvefixes

# Set API key
export DOUBLEWORD_API_KEY="your-key"

# Run classification
uv run bug-ensemble classify -m 30b -o results/run1
uv run bug-ensemble classify -m 30b -o results/run2

# Check batch status
uv run bug-ensemble status -o results/run1 --wait

# Analyze results
uv run bug-ensemble classify-analyze -o results/run1

Available Models

AliasModel
30bQwen3-30B-A3B-Instruct
235bQwen3-235B-A22B-Instruct

Customizing Categories

Edit src/classify.py to change the groupings:

CWE_GROUPS = {
    "Memory Safety": ["CWE-125", "CWE-787", "CWE-119", "CWE-120", "CWE-122"],
    "Pointer/Lifetime": ["CWE-476", "CWE-416", "CWE-415", "CWE-763"],
    "Integer": ["CWE-190"],
    "Resource": ["CWE-400", "CWE-401", "CWE-772"],
    "Input Validation": ["CWE-20", "CWE-22", "CWE-78"],
    "Concurrency": ["CWE-362"],
    "Control Flow": ["CWE-617", "CWE-835", "CWE-674"],
    "Other": ["CWE-59", "CWE-295", "CWE-269", "CWE-200"],
}

Limitations

Category imbalance. Memory Safety is 49% of the dataset. Models that specialize in Memory Safety look better than balanced models.

Label quality. CVEfixes labels come from CVE metadata, not expert annotation.

Generalization. Results are specific to C/C++ vulnerabilities from open-source projects.


Data: CVEfixes v1.0.7, 4,642 C/C++ functions across 24 CWE types. Costs use Doubleword batch pricing.