I actually blame the SAST clients for this, since they are the ones asking (and paying for) the wrong question:
"How can you 'vendor xyz' scan my million lines of code application"
Instead they should be asking:
"When you scan my code, can you connect the traces?"
'Connecting the traces' means that you are able to scan parts of the application separately and then connect them at a later stage.
And this is not just needed for scalability purposes, large code-bases will have tons of air-gaps created by interface-driven/WebServices/Message-Queues architectures, with usually the 'formula' that connects these layers existing in: Xml/Config files, Code Attributes, Live Binding, Reflection mappings, etc...
So without understanding these mappings, scanning a large code base means that there will be massive gaps in scanning that code.
In fact, one of the ways the commercial scanning engines are able to scan big code-bases is by starting to be more 'aggressive' in how they handle traces and what type of analysis they do (usually refereed by 'dropping traces in the floor'). And in the cases where the scanning engines do find large sets of findings or traces (lets say 10,000 findings with 50+ entries), their GUIs are absolutely not able to handle them (try to load 100,000 or 1M traces in those GUI :) ). Ironically, if they are not able to find that amount of traces, they are probably not having enough code coverage :). See If you not blowing up the database, you're not testing the whole app for a similar DAST analogy.
The need to scan in a modular way is very important from a scalability and from an accuracy point of view.
For example the approach that I took when creating O2's SAST engine was to create files that contain all relevant code, and then only scan then, see O2 .NET SAST Engine: MethodStream and CodeStrams for a WebService Method for what this looks like.
To see a practical example of what I mean by 'Trace Connection' or 'Trace Joining', look at this screenshot taken from this video: O2 Video - Demo Script - HacmeBank Full PoC
In this trace you will see 2 very important 'trace connections':
- Url to Entry point - the URL of the vulnerability was mapped to the method that contains the Source of Tainted data
- WebServices invocation - A webservice call that was mapped from the Invoke (on the Web Tier) to the WebMethod (on the WebServices layer).
Unfortunately (for the current SAST vendors who are still trying to create the 'one click scan engine'), this means that we will need to be able to customise and adapt the scanning engine/rules. We also will need to create specialized tools/scripts per framework, which in essence will describe its behaviour.
Note that this is not easy to do, we will need very powerful APIs that exposes the SAST engine capabilities/rules/data. For example, in the past I had to build entire O2 modules just to handle these type of activities.
Like the video below shows, this is like a move from Billions to Trillions (and you can't build a bridge to it). I also like the concept that 'Nature uses Layered Complexity', which is exactly what we need to do on SAST (i.e. we need to rules for every layer, and scan its behaviour separately)
Trillions from MAYAnMAYA on Vimeo.