GA file download proxy php script

Documenting that I was able to do a file download proxy for google analytics(GA). Basically I was using the below apache mod_rewrite codes to forward an internal request to a php page which provided the ga javascript to google analytics.

The php code was mostly copied from some other examples on the internet regarding the similar issues of tracking/analyzing rss or bookmark links (direct file links).

The drawback that I ran into was that there does not seem to be any way that I can tell to either spoof the proxy as the original requesting IP or forward the requesting IP info to the ga javascript at this time (see my question posted at http://groups.google.com/group/analytics-help-tracking/browse_thread/thread/910dab671c556b3d ). This limits the proxy to using the 'setVar' javascript method only which allows forwarding of the requestor IP as a 'segment' variable, but not used in the usual way which allows for visitor tracking,mapping,etc.

So until GA provides a way(they could do this by adding more 'set' methods to their javascript for the proxy) to forward requesting IP's, I'll continue to need to use custom analysis for directly linked files or services.

On a side note, the mod_rewrite code was very difficult to figure out for forwarding service requests which dealt with handling the query string - I ended up with some mistaken recursive calls that the only way to avoid I could tell was to do some substitions to stop the recursive effect.

Note that the internal proxy server IP is referenced as part of the rewrite condition to prevent a recursive effect on the php proxy call to the same requested file internal to the server.

Jeremy

changes to httpd.conf (reload http after changing) - change pathing as needed to tracker.php, note that only a few filetype/suffixes are listed, but others would follow the same pattern - note that the tracker.php can help determine if a filetype should be treated as an attachment or not

    RewriteCond %{HTTP_HOST} ^.*
    RewriteCond %{REMOTE_ADDR} !(129.252.37.88)
    RewriteCond %{REQUEST_URI} ^(.*).ppt$
    RewriteRule ^(.*).ppt$ http://carocoops.org/obskml/scripts/tracker.php?url=$1&filetype=ppt [L]

    RewriteCond %{HTTP_HOST} ^.*
    RewriteCond %{REMOTE_ADDR} !(129.252.37.88)
    RewriteCond %{REQUEST_URI} ^(.*).pdf$
    RewriteRule ^(.*).pdf$ http://carocoops.org/obskml/scripts/tracker.php?url=$1&filetype=pdf [L]

    RewriteCond %{HTTP_HOST} ^.*
    RewriteCond %{REMOTE_ADDR} !(129.252.37)
    RewriteCond %{REMOTE_ADDR} !(127.0.0.1)
    RewriteCond %{QUERY_STRING} ^(.*)getmap(.*)$
    RewriteRule ^(.*)$ http://carocoops.org/obskml/scripts/tracker.php?$1q_mark%1service_1%2 [L]

php proxy tracker (tracker.php) http://trac.secoora.org/datamgmt/browser/docs/usc/docs/various/tracker.php

Update September 9, 2008: Figured out that I can apply a regular expression string exclude filter pattern like below in the GA settings for the filter field 'User Defined' which I'm currently populating with the forwarded user IP address. This should help my results from being bloated by internal file process file references or other bot search IP addresses (or address ranges) that should be excluded from the report results

'user defined' field set in header.php

<!-- Google Analytics traffic tracking script -->
<?php
$var_referer=$_SERVER['HTTP_REFERER']; //referer url

$http_user_agent=$_SERVER['HTTP_USER_AGENT'];
$http_user_agent=preg_replace('/\s+/','_',$http_user_agent); //replace spaces
$http_user_agent=preg_replace('/;/','_',$http_user_agent); //replace semicolons - problematic for javascript ?

$var_uservar='ip='.$_SERVER['REMOTE_ADDR'].'&agent='.$http_user_agent; //enter your own user defined variable
?>

<script type="text/javascript">
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
</script>
<script type="text/javascript">
var pageTracker = _gat._getTracker("UA-XXXXXXX-X");
pageTracker._trackPageview();
pageTracker._setVar("<? echo $var_uservar ?>");
</script>

Exclude regular expression(regex) like below

^129\.252\.37\.88$

GA stats usage

Once you're able to log into the account, you'll want to limit your date range to after say September 25 - it's been a learning curve in handling downloaded file statistics using GA - first the download proxy script was developed documented at http://trac.secoora.org/datamgmt/wiki/JCothranGAProxy The basics of the php proxy script at this point is that it logs the requesting IP and agent(information about the browser process) in the GA 'user defined' variable or string. This allows me to use GA to segment the page requests (go to a particular page report and choose Dimension->User Defined Value) to see the associated IP listed like say below

Dimension: User Defined Value 
Visits Pages/Visit Avg. Time on Site % New Visits Bounce Rate 1. 
ip=198.151.13.10&agent=Mozilla/5.0_(Windows__U__Windows_NT_5.1__en-US__rv:1.9.0.1)_Gecko/2008070208_Firefox/3.0.1 
80 1.00 00:00:00 0.00% 100.00% 2. 
ip=198.151.13.10&agent=Mozilla/4.0_(compatible_) 
45 1.00 00:00:00 0.00% 100.00% 3. 
ip=198.151.13.10&agent=Mozilla/5.0_(Windows__U__Windows_NT_5.1__en-US__rv:1.9.0.2)_Gecko/2008091620_Firefox/3.0.2 
9 1.00 00:00:00 0.00% 100.00% 4. 
ip=198.151.12.8&agent=Mozilla/5.0_(Windows__U__Windows_NT_5.1__en-US__rv:1.9.0.1)_Gecko/2008070208_Firefox/3.0.1 
2 1.00 00:00:00 0.00% 100.00% 

You can also use the 'Find page containing' text search at the bottom of each page results to further subset results. Having the IP and agent allows me to use the following GA filters in combination with the 'user defined' field to exclude internal IP network references from USC(nautilus, sumwalt 239), UNC(cromwell, cormp, etc) or web spiders(bots). I've also recently added a 'search and replace' filter on the WMS BBOX request to help group multiple requests for the same layer with just different bounding boxes.

Anything the PHP webpage session var holds http://us.php.net/manual/en/reserved.variables.server.php can be appended as part of the 'user defined' string for filtering later. In the 'user defined' string I did make space and semicolon character substitutions to get it to work.

#GA Filters

#Exclude - User Defined
129\.252\.37\.88|129\.252\.37\.86|129\.252\.139
bot|Slurp|Validator|compass
152\.2\.92\.48|152\.20\.240\.9

#Search and Replace - Request URI
BBOX=.*$
BBOX

I've been using the below website http://whatismyipaddress.com/staticpages/index.php/lookup-ip to get an idea of the identities/location behind the requesting IP's - unfortunately manually for each page request that is a slow tedious process. I see where GA lets you download individual page stats, wish I had a way of easily dumping/downloading all the GA tracked/filtered data back to my scripts for some additional automated summary and IP lookup type tasks.