Monday 14 January 2013

Creating a Server-Side Google Analytics data submitter (in VisualStudio using C# REPL)

Following from Sending messages to TeamCity and UnitTest to check if the Google Analytics file has changed post, now that we have the ga.js hosted on TeamMentor’s code base (used to handle the original client side request), we need to add the ability to submit server-side data to Google Analytics (GA).

To do that we will need to replicate the client-site (JavaScript based) request sent to GA (Google Analytics) servers.

The way GA works, is that the ga.js file creates a gif request containing the current user’s data (see references at the end of this post for more details):

image

Here is the raw GET request submitted:
http://www.google-analytics.com/__utm.gif?utmwv=5.3.8&utms=2&utmn=1976204625&utmhn=teammentor-33-ci.azurewebsites.net&utmcs=ISO-8859-1&utmsr=1440x852&utmvp=1012x516&utmsc=32-bit&utmul=en-us&utmje=1&utmfl=11.5%20r31&utmdt=Login_Page&utmhid=535686790&utmr=-&utmp=%2FteamMentor&utmac=UA-XXXXXX-X&utmcc=__utma%3D82202515.932366965.1357920813.1357920813.1357957478.2%3B%2B__utmz%3D82202515.1357920813.1.1.utmcsr%3Dlocalhost%3A3187%7Cutmccn%3D(referral)%7Cutmcmd%3Dreferral%7Cutmcct%3D%2FteamMentoradasd%3B&utmu=q~
Fiddler’s WebForms makes it easier to read:

image

So to replicate this event (and add an entry to GA), all we need to do is to sent a GET request, which can be easily done using O2 C# REPL:

image

Here is it better formatted:

image

Refactoring the values as variables:

image

Here is the same request with some more refactoring and variables description (from here):

image

The most interesting fields are the utmdt (Page Title) and utmp (Current Page), which if we change like this:

image

Will show (in real-time) like this at GA's dashboard:

image

This Active Page and Page Title fields can be used to hold interesting state data:

image

Note how in the screenshot below, the first batch has a 'fixed Page value and unique Title value', and the 2nd batch has 'unique Page value and fixed Title value'.

image

After a bit of experimentation, here are the four minimum fields that need to be submitted:

image

One of the fields required is the utmac , which is a UrlEncoded cookie value. This is what it looks like (after decoding it)

image

Here are the values (which have a kinda weird separation)

image

Only the __utma needs to have a value:

image

In fact only this value is needed:

image

Next, let’s deconstruct the __utma value based on the info from:

image

Which looks like this:

image

Here is a version with no hard-coded values:

image

Note: that randomNumber and the timestamp are used to create the cookie value (i.e. utmac). which is basically the same as a SessionId (from GA's  point of view). This means that if we make this value random (like above) we will get a new session per request (see below):

image

Here is the code required to make a valid request (uncomment 2nd part to add Campaign data)

image

Note that the randomNumber and timeStamp values can be anything (it looks like they are used to create a unique cookie value used to identify the session at GA (Google Analytics))

image

Finally here is the smallest version of the code required to send data to GA:

image

Finally here is a refactored version with Lamda methods to register the pages:

image

The script shown above creates data like this (below), when executed a couple times  (note how it easy it was to register 40 events):

image

VisualStudio/Dev Note: this entire script was developed inside VisualStudio using the C# REPL environment and a browser window (which created a very powerful, fast and effective development environment):

image

Security note:
  • there is no authentication and authorization (from GA point of view)
  • GA doesn't allow data from being removed from its database
  • the key piece of information required to send bogus data to any google site is the GA Account id (which is publicly exposed by any site that uses)
  • what prevents the creation of thousands or millions of fake GA log entries?
  • what is google's mitigation against this?
  • why has this not been exploited? (or has it?)
GIST:

See this https://gist.github.com/4525269 for multiple versions of the code shown in the post

References: