Documentation Index
Fetch the complete documentation index at: https://mintlify.com/gadievron/raptor/llms.txt
Use this file to discover all available pages before exploring further.
Overview
RAPTOR’s CodeQL integration provides fully autonomous semantic analysis with automatic language detection, build system detection, database creation, and security query execution.
Architecture
CodeQL analysis consists of multiple specialized components:
packages/codeql/
├── agent.py # Main orchestrator
├── language_detector.py # Auto-detect languages
├── build_detector.py # Detect build systems
├── database_manager.py # Create/cache databases
├── query_runner.py # Execute security queries
├── dataflow_validator.py # Validate dataflow paths
├── dataflow_visualizer.py # Generate path visualizations
└── autonomous_analyzer.py # LLM-powered analysis
Language Detection
Automatic Detection
The language detector scans repositories and assigns confidence scores:
from packages.codeql.language_detector import LanguageDetector
detector = LanguageDetector(repo_path)
detected = detector.detect_languages(min_files=3)
for lang, info in detected.items():
print(f"{lang}: {info.file_count} files (confidence: {info.confidence:.2f})")
Detection Algorithm
Confidence scoring factors:
- File extensions (base: 0.3)
- Build files (+0.2 per file, max +0.4)
- Structural indicators (+0.1 per indicator, max +0.3)
- File count ratio (up to +0.3)
Supported Languages
CodeQL-supported languages:
| Language | Extensions | Build Files | Indicators |
|---|
| Java | .java | pom.xml, build.gradle | src/main/java/ |
| Python | .py | setup.py, pyproject.toml | __init__.py |
| JavaScript | .js, .jsx, .mjs | package.json, yarn.lock | node_modules/ |
| TypeScript | .ts, .tsx | tsconfig.json | src/, dist/ |
| Go | .go | go.mod, go.sum | main.go, cmd/ |
| C/C++ | .c, .cpp, .h, .hpp | CMakeLists.txt, Makefile | src/, include/ |
| C# | .cs | .csproj, .sln | Properties/, bin/ |
| Ruby | .rb | Gemfile, Rakefile | lib/, spec/ |
| Swift | .swift | Package.swift, Podfile | Sources/, Tests/ |
| Kotlin | .kt, .kts | build.gradle.kts | src/main/kotlin/ |
Language Filtering
Filter to CodeQL-supported languages only:
supported = detector.filter_codeql_supported(detected)
# Automatically excludes unsupported languages with warning
Build System Detection
Automatic Build Detection
The build detector identifies appropriate build commands:
from packages.codeql.build_detector import BuildDetector
detector = BuildDetector(repo_path)
build_system = detector.detect_build_system('java')
print(f"Type: {build_system.type}") # maven/gradle/make
print(f"Command: {build_system.command}") # mvn clean compile
print(f"Working dir: {build_system.working_dir}")
Supported Build Systems
Java:
- Maven:
pom.xml → mvn clean compile -DskipTests
- Gradle:
build.gradle → gradle clean build -x test
C/C++:
- CMake:
CMakeLists.txt → cmake . && make
- Make:
Makefile → make
- Autotools:
configure → ./configure && make
JavaScript/TypeScript:
- npm:
package.json → npm install && npm run build
- Yarn:
yarn.lock → yarn install && yarn build
Go:
- Go modules:
go.mod → go build ./...
Python/Ruby:
- No-build mode (interpreted languages)
Custom Build Commands
Override auto-detection:
python3 raptor_codeql.py \
--repo /path/to/code \
--languages java \
--build-command "mvn clean compile -DskipTests -Dcheckstyle.skip"
Database Creation
Autonomous Database Creation
CodeQL databases are created with automatic caching:
from packages.codeql.database_manager import DatabaseManager
manager = DatabaseManager()
results = manager.create_databases_parallel(
repo_path,
language_build_map,
force=False # Use cached if available
)
for lang, result in results.items():
if result.success:
print(f"✓ {lang}: {result.database_path}")
else:
print(f"✗ {lang}: {result.errors}")
Database Caching
Databases are cached to avoid redundant creation:
# Cache key: repo_hash + language
cache_key = f"{repo_hash}_{language}"
db_path = RaptorConfig.CODEQL_DB_DIR / cache_key
if db_path.exists() and not force:
logger.info(f"Using cached database: {db_path}")
return DatabaseResult(cached=True, database_path=db_path)
Database Structure
codeql_dbs/
└── a7f8e92_java/
├── db-java/
├── src.zip
├── codeql-database.yml
└── log/
├── database-create.log
└── ext/
Cache Management
# Configuration from core.config:
CODEQL_DB_CACHE_DAYS = 7 # Keep for 7 days
CODEQL_DB_AUTO_CLEANUP = True # Auto-cleanup old DBs
Query Execution
Security Suites
RAPTOR uses CodeQL’s security suites:
# Standard security suite
suite_name = f"{language}-security-queries"
# Extended security suite (more queries, slower)
if use_extended:
suite_name = f"{language}-security-extended"
Parallel Query Execution
codeql database analyze \
/path/to/db \
--format=sarif-latest \
--output=results.sarif \
--threads=0 \
--ram=8192 \
java-security-queries
Query Configuration
From core.config.RaptorConfig:
CODEQL_RAM_MB = 8192 # 8GB RAM for analysis
CODEQL_THREADS = 0 # Use all available CPUs
CODEQL_MAX_PATHS = 4 # Max dataflow paths per query
CODEQL_ANALYZE_TIMEOUT = 2400 # 40 minutes
Dataflow Validation
Dataflow Path Structure
CodeQL findings include source-to-sink dataflow paths:
@dataclass
class DataflowPath:
source: DataflowStep # Where tainted data originates
sink: DataflowStep # Where dangerous operation occurs
intermediate_steps: List[DataflowStep] # Data transformations
sanitizers: List[str] # Validation functions in path
rule_id: str
message: str
LLM-Powered Validation
Go beyond static detection to validate exploitability:
from packages.codeql.dataflow_validator import DataflowValidator
validator = DataflowValidator(llm_client)
validation = validator.validate_finding(sarif_result, repo_path)
if validation.is_exploitable:
print(f"Exploitable (confidence: {validation.confidence:.2f})")
print(f"Attack complexity: {validation.attack_complexity}")
if validation.bypass_strategy:
print(f"Bypass: {validation.bypass_strategy}")
Validation Criteria
The validator checks:
- Sanitizers: Are they truly effective?
- Reachability: Is the path reachable in practice?
- Barriers: Are there hidden constraints?
- Complexity: What’s the real attack difficulty?
Validation Output
@dataclass
class DataflowValidation:
is_exploitable: bool
confidence: float # 0.0-1.0
sanitizers_effective: bool
bypass_possible: bool
bypass_strategy: Optional[str]
attack_complexity: str # "low", "medium", "high"
reasoning: str
barriers: List[str]
prerequisites: List[str]
Dataflow Visualization
Generate visual representations of dataflow paths:
from packages.codeql.dataflow_visualizer import DataflowVisualizer
visualizer = DataflowVisualizer()
visualizer.generate_visualization(
sarif_result,
repo_path,
output_dir / "visualizations"
)
Output formats:
- GraphViz DOT - Graph structure
- PNG - Rendered visualization
- HTML - Interactive web view
CLI Usage
Fully Autonomous Scan
Auto-detect everything:
python3 raptor_codeql.py --repo /path/to/code
Specify Languages
Target specific languages:
python3 raptor_codeql.py \
--repo /path/to/code \
--languages java,python
Extended Security Suite
Use more comprehensive queries:
python3 raptor_codeql.py \
--repo /path/to/code \
--extended
Force Database Rebuild
Ignore cache:
python3 raptor_codeql.py \
--repo /path/to/code \
--force
Scan Only (No LLM Analysis)
Skip autonomous analysis phase:
python3 raptor_codeql.py \
--repo /path/to/code \
--scan-only
Custom CodeQL CLI Path
python3 raptor_codeql.py \
--repo /path/to/code \
--codeql-cli /custom/path/to/codeql
Autonomous Analysis
Two-Phase Workflow
Phase 1: Scanning
- Detect languages
- Detect build systems
- Create databases
- Execute security queries
- Generate SARIF output
Phase 2: Analysis
- LLM-powered finding analysis
- Dataflow path validation
- Exploitability scoring
- PoC generation
- Exploit compilation
Autonomous Analyzer
Deep analysis of findings:
from packages.codeql.autonomous_analyzer import AutonomousCodeQLAnalyzer
analyzer = AutonomousCodeQLAnalyzer(
llm_client,
exploit_validator,
multi_turn_analyzer
)
analysis = analyzer.analyze_finding_autonomous(
sarif_result,
sarif_run,
repo_path,
out_dir
)
if analysis.exploitable:
print(f"Exploitability score: {analysis.analysis.exploitability_score:.2f}")
if analysis.exploit_code:
print(f"Exploit generated: {len(analysis.exploit_code)} bytes")
Output Structure
out/codeql_project_20260304_123456/
├── codeql_report.json # Complete workflow results
├── java_results.sarif # Per-language SARIF
├── python_results.sarif
├── databases/
│ ├── db-java/ # CodeQL databases
│ └── db-python/
├── autonomous/
│ ├── finding_0000_analysis.json # LLM analysis per finding
│ ├── finding_0001_analysis.json
│ └── visualizations/
│ ├── dataflow_0000.png
│ └── dataflow_0000.dot
└── exploits/
├── exploit_0000.c # Generated exploits
└── exploit_0000_compiled
Workflow Results
CodeQLWorkflowResult
@dataclass
class CodeQLWorkflowResult:
success: bool
repo_path: str
timestamp: str
duration_seconds: float
languages_detected: Dict[str, LanguageInfo]
databases_created: Dict[str, DatabaseResult]
analyses_completed: Dict[str, QueryResult]
total_findings: int
sarif_files: List[str]
errors: List[str]
Accessing Results
from packages.codeql.agent import CodeQLAgent
agent = CodeQLAgent(repo_path)
result = agent.run_autonomous_analysis()
print(f"Languages: {len(result.languages_detected)}")
print(f"Findings: {result.total_findings}")
print(f"Duration: {result.duration_seconds:.1f}s")
for sarif in result.sarif_files:
print(f" - {sarif}")
Best Practices
Cache databases: Database creation is expensive (5-30 minutes). Let CodeQL cache databases between runs unless source code changes.
Resource requirements: CodeQL analysis needs significant resources. Configure CODEQL_RAM_MB based on your system (minimum 4GB, recommended 8GB).
Build requirements: Compiled languages (Java, C/C++, C#) require build tools installed. CodeQL traces compilation to understand code structure.
Troubleshooting
Database Creation Fails
# Check CodeQL CLI
codeql version
# Validate build command manually
cd /path/to/code
mvn clean compile -DskipTests
# Check logs
cat codeql_dbs/*/log/database-create.log
No Dataflow Paths
If queries return findings without dataflow:
- Ensure
--format=sarif-latest is used
- Check
codeFlows field in SARIF output
- Some queries don’t produce dataflow (e.g., pattern-based)
Out of Memory
Increase CodeQL RAM allocation:
# In core/config.py:
CODEQL_RAM_MB = 16384 # 16GB
See Also