New Claude Haiku 4.5 and OpenAI GPT-5 Pro models have been released. I wanted to test them for code analysis for my csfa.sh nftables wrapper script and GitHub workflow action test against other LLM models I use. This is for code analysis and not code generation. Code analysis would be useful for understanding code bases, writing documentation, troubleshooting code and planning.
CSFA (v1.3.1) is a CSF-like wrapper for nftables that provides familiar ConfigServer Security & Firewall commands mapped to modern nftables equivalents. The project uses a single Bash script (csfa.sh) that manages firewall rules through a dedicated inet table called "csfa".
I have paid subscriptions and accounts with:
OpenAI ChatGPT Plus
Claude AI Max $100
Gemini AI Pro
T3 Chat
OpenRouter AI
KiloCode
I tested 27 AI LLM models for code analysis and summaries and then used Claude Code Sonnet 4.5 to evaluate and rank all 27 AI LLM model responses.
The 27 AI LLM models evaluated are (including costs for usage):
Claude Code Sonnet 4.5 included in subscription cost
OpenAI Codex GPT-5 Medium Thinking included in subscription cost
OpenAI ChatGPT GPT-5 Thinking included in subscription cost
Claude Code Opus 4.1 included in subscription cost
Claude AI Web Opus 4.1 Thinking included in subscription cost
You can easily replicate these tests asking AI LLM models to summarize/analyse your code bases/scripts and save their responses to markdown files. Then feed their responses into AI LLM models for evaluation.
| Rank | AI Model | Accuracy | Thoroughness | Overall Score | Key Strength |
|------|----------|----------|--------------|---------------|--------------|
| 1 | Claude Code Haiku 4.5 | 100/100 | 100/100 | 100.0/100 | Executive summary, comprehensive architecture, production readiness scoring |
| 2 | Claude Code Sonnet 4.5 | 99/100 | 100/100 | 99.5/100 | Test coverage matrix, bug fix analysis, performance characteristics |
| 3 | Claude Code Opus 4.1 | 98/100 | 97/100 | 97.5/100 | Architecture decisions, design patterns, innovation areas |
| 4 | OpenRouter Qwen3 Max | 93/100 | 93/100 | 93.0/100 | Outstanding formatting, emoji organization, clear recommendations |
| 5 | KiloCode Claude Sonnet 4 | 92/100 | 93/100 | 92.5/100 | Strong code references, excellent line numbers, architectural overview |
| 6 | Kilo Code OpenAI GPT-5 Pro | 91/100 | 90/100 | 90.5/100 | Concise yet comprehensive, excellent technical accuracy |
| 7 | Qwen3 Next 80B A3B Thinking | 90/100 | 89/100 | 89.5/100 | Strong technical accuracy, good depth, temp rule coverage |
| 8 | Kilo Code xAI Grok Code Fast 1 | 89/100 | 88/100 | 88.5/100 | Mermaid diagrams, workflow visualization, usage examples |
| 9 | Google Gemini 2.5 Pro Web | 88/100 | 87/100 | 87.5/100 | Great narrative style, clear explanations, systemd focus |
| 10 | Claude AI Web Opus 4.1 Thinking | 87/100 | 86/100 | 86.5/100 | Clear organization, OUTPUT cleanup emphasis, good feature list |
| 11 | Qwen Plus 0728 (thinking) | 85/100 | 84/100 | 84.5/100 | Excellent CI workflow, phase breakdown, validation methodology |
| 12 | Qwen3 30B A3B Thinking 2507 | 84/100 | 83/100 | 83.5/100 | Good organization, accurate descriptions, matrix testing |
| 13 | Kilo Code Gemini 2.5 Flash Lite | 92/100 | 84/100 | 88.0/100 | Well-organized, accurate technical details, clear sections |
| 14 | Kilo Code Gemini 2.5 Flash Preview | 91/100 | 83/100 | 87.0/100 | Good structure, technical depth, workflow breakdown |
| 15 | Kilo Code DeepSeek V3.2 Exp | 90/100 | 80/100 | 85.0/100 | Structured analysis, code references, clear architecture |
| 16 | KiloCode Sonoma Dusk Alpha | 89/100 | 78/100 | 83.5/100 | Concise yet comprehensive, clear command breakdown |
| 17 | KiloCode Sonoma Sky Alpha | 87/100 | 65/100 | 76.0/100 | Good for quick reference, accurate basic information |
| 18 | KiloCode MoonshotAI Kimi K2 0905 | 88/100 | 72/100 | 80.0/100 | Brief but accurate, clear feature list, correct terminology |
| 19 | KiloCode Qwen3 Coder | 86/100 | 70/100 | 78.0/100 | Clear and organized, concise overview, identifies components |
| 20 | KiloCode Mistral Medium 3.1 | 89/100 | 76/100 | 82.5/100 | Good concise analysis, effective code refs, workflow summary |
| 21 | Kilo Code Code-Supernova | 85/100 | 75/100 | 80.0/100 | Good high-level summary, CSF identification, systemd understanding |
| 22 | Kilo Code Grok 4 Fast | 82/100 | 81/100 | 81.5/100 | Feature coverage, clear structure, command understanding |
| 23 | Kilo Code OpenRouter zAI GLM 4.6 | 80/100 | 72/100 | 76.0/100 | Technical accuracy, feature list, basic understanding |
| 24 | OpenAI Codex GPT-5 Medium Thinking | 90/100 | 60/100 | 75.0/100 | Accurate terminology, concise bullet format, good for reference |
| 25 | OpenAI ChatGPT GPT-5 Thinking | 88/100 | 55/100 | 71.5/100 | Concise format, main features covered, clear structure |
| 26 | Kilo Code OpenRouter Grok 4 Fast | 70/100 | 45/100 | 57.5/100 | Basic features identified, systemd understanding | TRUNCATED |
| 27 | DeepSeek V3.1 | 40/100 | 30/100 | 35.0/100 | None | CRITICAL FAILURE |
Model Performance Summary
Comprehensive performance breakdown for all 27 AI models evaluated. Scores include overall rating, accuracy, thoroughness, technical coverage (out of 29 critical points), key strengths, and weaknesses.
| Rank | Model Name | Overall Score | Accuracy | Thoroughness | Tech Coverage | Key Strengths | Key Weaknesses |
|------|------------|---------------|----------|--------------|---------------|---------------|-----------------|
| 1 | Claude Code Haiku 4.5 | 100.0/100 | 100/100 | 100/100 | 28/29 (97%) | Executive summary, 13 component breakdown, feature timeline v1.2.0→v1.3.1, dual parsing with examples, systemd integration docs, OUTPUT chain fix coverage, production readiness matrix, CI phase breakdown, 5 implementation patterns | None identified |
| 2 | Claude Code Sonnet 4.5 | 99.5/100 | 99/100 | 100/100 | 27/29 (93%) | 1,274 lines comprehensive, 5-phase test matrix, 13 edge cases, feature breakdown, bug fix analysis, performance metrics, parser comparison, architecture patterns, flock mechanism, recommendations | Slight verbosity, no visual diagrams |
| 3 | Claude Code Opus 4.1 | 97.5/100 | 98/100 | 97/100 | 25/29 (86%) | Architecture decisions, single-script benefits, dual parsing rationale, 7 innovations identified, production readiness checklist, dependencies clear, code quality assessment, systemd transient units, atomic updates, IPv4/IPv6 support | Less test detail than Sonnet, missing performance section |
| 4 | OpenRouter Qwen3 Max | 93.0/100 | 93/100 | 93/100 | 19/29 (66%) | Emoji-based organization, clear purpose statements, systemd integration coverage, v1.3.0/v1.3.1 features, CSF compatibility understanding, comprehensive tables, OUTPUT chain analysis, clear recommendations | Verbosity without insight, formal audience concerns |
| 5 | KiloCode Claude Sonnet 4 | 92.5/100 | 92/100 | 93/100 | 18/29 (62%) | Exceptional file:line code references, detailed function breakdown, architectural overview, strong technical accuracy, good markdown links, clear feature categorization | Less narrative flow, systemd explanation gaps |
| 6 | Kilo Code OpenAI GPT-5 Pro | 90.5/100 | 91/100 | 90/100 | 17/29 (59%) | Concise comprehensive explanation, excellent technical accuracy, all major features, clear command reference, dual parsing handling, good organization | Less detailed than Tier 1, fewer code examples |
| 7 | Qwen3 Next 80B A3B Thinking | 89.5/100 | 90/100 | 89/100 | 15/29 (52%) | Very detailed analysis, temp rules explanation, systemd coverage, clear command examples, solid technical depth, good overall structure | Poor organization, limited v1.3.0+ features |
| 8 | Kilo Code xAI Grok Code Fast 1 | 88.5/100 | 89/100 | 88/100 | 23/29 (79%) | ONLY Mermaid diagrams, command dispatch flow visualization, comprehensive feature analysis, strong systemd timer details, limitations discussion, usage examples | Very lengthy, section redundancy |
| 9 | Google Gemini 2.5 Pro Web | 87.5/100 | 88/100 | 87/100 | 16/29 (55%) | Great narrative style, clear systemd explanations, well-formatted sections, good workflow breakdown, user-friendly for broader audiences | Less technical depth, missing v1.3.1 specifics |
| 10 | Claude AI Web Opus 4.1 Thinking | 86.5/100 | 87/100 | 86/100 | 13/29 (45%) | Clear section organization, strong OUTPUT cleanup emphasis, good test coverage, comprehensive feature list, balanced depth | Architectural depth gaps, fewer code examples |
| 11 | Qwen Plus 0728 (thinking) | 84.5/100 | 85/100 | 84/100 | 17/29 (59%) | Excellent CI workflow coverage, detailed test strategy, phase-by-phase breakdown, good validation methodology, clear structure | Less comprehensive on script features, missing implementation |
| 12 | Qwen3 30B A3B Thinking 2507 | 83.5/100 | 84/100 | 83/100 | 13/29 (45%) | Good organization, accurate descriptions, solid temp rule understanding, matrix testing mentioned, clear explanations | Limited advanced features, missing v1.3.1 depth |
| 13 | Kilo Code Gemini 2.5 Flash Lite | 88.0/100 | 92/100 | 84/100 | 15/29 (52%) | Well-organized structure, accurate technical details, good JSON tracking coverage, clear section delineation, solid basics | Less comprehensive than top tier, advanced gaps |
| 14 | Kilo Code Gemini 2.5 Flash Preview | 87.0/100 | 91/100 | 83/100 | 14/29 (48%) | Good structure, technical depth, accurate core functionality, clear workflow breakdown, well-formatted | Missing test phase details, limited v1.3.1 |
| 15 | Kilo Code DeepSeek V3.2 Exp | 85.0/100 | 90/100 | 80/100 | 12/29 (41%) | Structured analysis with code references, clear architectural overview, accurate version identification, good basics | Brief on advanced features, missing handle details |
| 16 | KiloCode Sonoma Dusk Alpha | 83.5/100 | 89/100 | 78/100 | 11/29 (38%) | Concise yet comprehensive, covers essential functionality, clear command breakdown, good basics | Limited technical analysis, missing OUTPUT cleanup |
| 17 | KiloCode Sonoma Sky Alpha | 76.0/100 | 87/100 | 65/100 | 8/29 (28%) | Good for quick reference, accurate basic information, concise format | Minimal depth, no advanced coverage, lacking examples |
| 18 | KiloCode MoonshotAI Kimi K2 0905 | 80.0/100 | 88/100 | 72/100 | 9/29 (31%) | Brief but accurate on basics, clear feature list, correct technical terminology | Major depth gaps, no examples, missing critical features |
| 19 | KiloCode Qwen3 Coder | 78.0/100 | 86/100 | 70/100 | 9/29 (31%) | Clear and organized, concise overview, identifies key components, good structure | Minimal technical detail, missing v1.3.0+ features |
| 20 | KiloCode Mistral Medium 3.1 | 82.5/100 | 89/100 | 76/100 | 11/29 (38%) | Good concise analysis, effective code refs, clear file relationships, good workflow summary | Lacks technical depth, misses advanced features |
| 21 | Kilo Code Code-Supernova | 80.0/100 | 85/100 | 75/100 | 10/29 (34%) | Good high-level summary, correct CSF identification, systemd timer understanding, reasonable basics | Missing OUTPUT cleanup, lacking CI depth |
| 22 | Kilo Code Grok 4 Fast | 81.5/100 | 82/100 | 81/100 | 11/29 (38%) | Feature coverage, clear structure, command understanding, concise format, good organization | Limited depth on advanced features, missing specifics |
| 23 | Kilo Code OpenRouter zAI GLM 4.6 | 76.0/100 | 80/100 | 72/100 | 8/29 (28%) | Technical accuracy, feature list coverage, basic understanding, clear terminology | Very brief, lacks depth, minimal examples |
| 24 | OpenAI Codex GPT-5 Medium Thinking | 75.0/100 | 90/100 | 60/100 | 8/29 (28%) | Accurate technical terminology, concise bullet format, good for quick reference, precise language | Too brief, missing detailed examples, insufficient depth |
| 25 | OpenAI ChatGPT GPT-5 Thinking | 71.5/100 | 88/100 | 55/100 | 7/29 (24%) | Concise bullet format, covers main features, clear structure, good basics | Incomplete OUTPUT chain, minimal depth, missing advanced features |
| 26 | Kilo Code OpenRouter Grok 4 Fast | 57.5/100 | 70/100 | 45/100 | 5/29 (17%) | Basic feature identification, systemd timer understanding | Response truncated mid-sentence, incomplete analysis, no CI coverage, missing v1.3.1 |
| 27 | DeepSeek V3.1 | 35.0/100 | 40/100 | 30/100 | 1/29 (3%) | None | CRITICAL FAILURE: Misinterpreted as "Code Security and Formatting Analysis", completely wrong purpose, generic assumptions, inaccurate descriptions |
Performance Tier Breakdown
🏆 Elite Tier (95-100): 3 models
Claude Code Haiku 4.5 (100.0), Claude Code Sonnet 4.5 (99.5), Claude Code Opus 4.1 (97.5)
Kilo Code OpenAI GPT-5 Pro (90.5), Qwen3 Next 80B (89.5), Kilo Code xAI Grok (88.5), Google Gemini 2.5 Pro (87.5), Claude AI Web Opus (86.5), Qwen Plus 0728 (84.5)
Characteristics: Solid analysis, good organization, adequate depth, some feature gaps
🥉 Good Tier (80-84): 5 models
Qwen3 30B A3B (83.5), Kilo Code Gemini 2.5 Flash Lite (88.0), Kilo Code Gemini 2.5 Flash Preview (87.0), Kilo Code DeepSeek V3.2 (85.0), KiloCode Sonoma Dusk (83.5)
Characteristics: Competent analysis, accurate basics, missing advanced features
📋 Adequate Tier (75-79): 5 models
KiloCode Qwen3 Coder (78.0), Kilo Code Code-Supernova (80.0), KiloCode MoonshotAI Kimi (80.0), Kilo Code Grok 4 Fast (81.5), KiloCode Mistral (82.5)
Characteristics: Basic accuracy, limited depth, suitable for quick reference
⚠️ Weak Tier (70-74): 3 models
Kilo Code OpenRouter zAI GLM (76.0), OpenAI Codex GPT-5 Medium (75.0), OpenAI ChatGPT GPT-5 (71.5)
Characteristics: Brief but accurate, insufficient detail, too concise
❌ Poor Tier (Below 70): 2 models
Kilo Code OpenRouter Grok 4 Fast (57.5), DeepSeek V3.1 (35.0)
Characteristics: Truncated/incomplete or fundamentally incorrect
Key Performance Insights
Highest Accuracy: Claude Code Haiku 4.5 (100/100)
Perfect technical accuracy with comprehensive coverage
Most Thorough: Claude Code Haiku 4.5 (100/100)
575+ lines with detailed analysis covering all aspects
Best Technical Coverage: Claude Code Haiku 4.5 (28/29 = 97%)
Only missing one obscure technical point
Most Innovative: Kilo Code xAI Grok Code Fast 1
Only model with Mermaid flow diagrams for visualization
Best Formatting: OpenRouter Qwen3 Max
Outstanding visual hierarchy with emojis and section markers
Most Concise Excellence: KiloCode Claude Sonnet 4
Strong technical depth with efficient code references
Biggest Disappointment: DeepSeek V3.1
Fundamental misunderstanding of project purpose (35/100)
Concurrency Safety: Only 8/27 models (30%) mentioned flock-based file locking
Test Phase 3: Only 8/27 models (30%) explained the 45-second wait test for OUTPUT cleanup
Performance Characteristics: Only 3/27 models (11%) - Claude models exclusively
Key Strengths by Category
Best Understanding of csfa.sh Architecture
1. Claude Code Haiku 4.5 - 13-component breakdown with precise line references + feature timeline (v1.2.0→v1.3.1)
1. Claude Code Sonnet 4.5 - Detailed subsystem breakdown with 6 architectural decisions + extensive code examples
(Tie: Complementary strengths - Haiku excels at component organization, Sonnet at technical rationale)
3. Claude Code Opus 4.1 - 3. Claude Code Opus 4.1 - Strong design pattern focus
Best Understanding of CI/CD Workflow
Claude Code Haiku 4.5 - Phase-by-phase breakdown with verification
Qwen Plus 0728 - Detailed test strategy with methodology
Claude Code Sonnet 4.5 - Comprehensive test matrix with edge cases
Best Dual Parsing Mode Coverage
Claude Code Sonnet 4.5 - Detailed JSON vs. text comparison
Claude Code Haiku 4.5 - Clear mode selection logic and fallback
Kilo Code OpenAI GPT-5 Pro - Concise, accurate explanation
Best v1.3.0/v1.3.1 Feature Coverage
Claude Code Haiku 4.5 - Complete feature timeline (v1.2.0→v1.3.0→v1.3.1) + dedicated Phase 5 testing for OUTPUT cleanup
Claude Code Sonnet 4.5 - Comprehensive bug fix analysis with Issue-Cause-Fix-Impact structure
Claude Code Opus 4.1 - Innovation areas clearly documented
Most Unique Insights
Kilo Code xAI Grok Code Fast 1: Mermaid diagrams for visualization
OpenRouter Qwen3 Max: Emoji-based organization system
Claude Code Sonnet 4.5: Performance metrics and self-evaluation
Claude Code Haiku 4.5: Production readiness scoring
Claude Code Opus 4.1: Design rationale documentation
Conclusion
The evaluation reveals significant variation in AI model performance on technical code analysis. The Claude Code models (Haiku, Sonnet, Opus) dominate with scores of 100, 99.5, and 97.5, demonstrating exceptional capability in:
Understanding complex infrastructure systems (nftables, systemd)
Organizing information logically with clear hierarchies
Maintaining accuracy while providing comprehensive coverage
For production-critical technical analysis: Claude Code models (Haiku 4.5, Sonnet 4.5, Opus 4.1) are unambiguously superior, with xAI Grok and Qwen3 Max offering strong alternatives.
Models to avoid: DeepSeek V3.1 (35/100 - critical failure) and truncated Grok 4 Fast (57.5/100).