Taking Surftrackr data directly from the logfile

Mar 14th 2008

Grant (Surftrackr user) emailed to ask how to take data directly from the logfile, without the need to upload it or use the live-logging feature. Funny how the most obvious ideas are often the most elusive: I've made the change and the updated package is available from the download site.

Setting up 

The data acquisition is handled by the logfiles.py script in the media directory. There is a new option in settings.py to activate it:

DATA_FROM_LOG = False

Set that to True to tell logfiles.py you want it to read data directly from the Squid or Dan's Guardian logfile. The logfiles.py script will still search for any uploaded files (in case you still need to use that option) but will also go through the log specified in LOGFILE. Make sure you include the full path and name of the file:

LOGFILE = '/var/log/squid/access.log'

You will need to set up a cron entry to run logfiles.py at appropriate intervals. How often will depend on how busy your proxy server is, so you might need to adjust to your requirements. Here's the kind of thing you should put into your crontab:

*/15 * * * * cd /usr/local/surftrackr/media && /usr/bin/python logfiles.py

That will run the command every 15 minutes (type "man 5 crontab" for more info).

Security considerations

Generally, it's not a good idea to run a script like logfiles.py as the root user, since it might have 'blemishes' which could cause security problems. You're taking data from a source (the logfile) which in turn takes its data from an untrusted source: your users' web surfing habits. It's not impossible that a carefully-crafted web request in your logs could cause logfiles.py to choke and malfunction in unpredictable ways. Any such problems will be magnified significantly if logfiles.py is running as root.

Unfortunately, the squid logfiles on many Unix systems will be inaccessible to ordinary users. Here's a directory listing of /var/log from my Linux box:

-rw-r----- 1 mysql  adm        20 2008-03-06 06:25 mysql.log.7.gz
drwxr-sr-x 2 news   news     4096 2008-01-10 11:04 news
-rw-r--r-- 1 root   root        0 2008-05-12 12:55 pycentral.log
drwxr-x--- 2 squid  squid    4096 2008-03-14 06:25 squid
-rw-r----- 1 syslog adm     20977 2008-03-14 16:30 syslog

You'll notice that the permissions on the squid directory don't permit anyone but the 'squid' user to access the directory's contents. You could change them using a command like this:

chmod 755 squid

But then you'd also need to change the access.log permissions as well, and they might be reset when logrotate archives the log and creates a new one - a process which can happen quite regularly on a carefully maintained Unix system.

The correct answer is to create a user with permissions to read the squid logfiles, and set up a cron entry for that user, as shown above. For the directory listing shown, it's clear that a user in the 'squid' group will be able to read the squid directory, so creating a user in that group would be an ideal solution. Doing this requires a bit of 'local knowledge' on your part, since it varies from one Unix system to another. On my Centos system, I can use this command:

useradd -g squid surftrackr

That creates a surftrackr user in the squid group.  Then I can type su - surftrackr to 'become' that user, and set up the cron entry as described above.

If you're using file uploading in addition to taking data directly from the logfile, you'll also need to ensure that the 'media/logs' directory is writable by your Apache user (which needs to write files into that directory). Additionally, logs and logs-processed need to be writable by the user logfiles.py will be run as, since that user needs to be able to scan the logs directory for files, then move them into logs-processed when they have been loaded into the Surftrackr database. Doing this is left as an exercise for the reader, but email me if you're not sure.

Simon Burns
14 March, 2008