How to install Django and Surftrackr

Jan 27th 2008

At first glance, it would appear complicated to install a Surftrackr app, but this is very far from the truth. Squidalyser, Surftrackr's predecessor, was written in Perl and required the installation of several modules, including DBD/DBI for database access, and the often difficult-to-install GD modules to produce the graphics. Most of the Squidalyser support emails I received were about installing, not using, Squidalyser. Compared to this, Surftrackr's installation requirements are, in fact, fairly modest, and shouldn't prove too difficult for most systems administrators. As a bonus, setting up Django on your system will allow you to use many other Django apps, and (if you're a developer) start using Python and Django to develop your own projects.

Introduction

Update: Full story & pics available (in PDF format). This is a walk-through of the installation process, showing (in words and pictures) how to take a Linux box from its newly-installed state to fully-running Surftrackr installation.

2nd Update: Yes, you can install Surftrackr under Microsoft Windows. Full instructions are available for download in PDF format. 

This document is split into several parts, detailing how to install Django and set up Surftrackr. If you find I've missed anything, email me with the details and I'll fix it.

  1. Requirements.
  2. Installing mod_python for Apache.
  3. Installing Google Charts, MySQLdb and pyparsing.
  4. Installing Django.
  5. Downloading and unpacking Surftrackr.
  6. Configuring Apache and .htaccess.
  7. Setting up the database.
  8. Editing settings.py.
  9. Editing logfiles.py.
  10. Sync'ing the database.
  11. Running logfiles.py to enter log data into the DB.
  12. Setting up cron.

Requirements

To install Django and Surftrackr, you will need the following software on your system:

Installing mod_python for Apache

There are basically two ways to do this: the easy way, and the slightly harder way. The easy way involves using your package manager: yum, apt-get, synaptic or equivalent on Linux, or the port utility provided with MacPorts on Mac OS. The basic command-line variants work like this:

The search command is recommended because the package name might differ slightly on your platform, eg lib-mod_python or mod_python25. You should run the install command as the root user or prepend sudo to it. If you're lucky, mod_python (or a package with a name similar to that) will be downloaded, installed and activated in your Apache config. All dependencies will also be taken care of, right down to installing Apache if required.

If you're unlucky, you'll need to do it manually. In fact, it's not too difficult: my httpd.conf contains just this line:

LoadModule python_module /usr/lib/apache2/modules/mod_python.so

However, it might be a bit more involved than that, so consult the official mod_python site for the details.

Installing Google Charts, MySQLdb and pyparsing

While you've got apt-get (or equivalent) warmed up and ready to go, see if it will provide the two required Python modules for you:

Since the Python module for Google charts is still fairly new, you might not find it in your software repository. My Ubuntu system failed to find it, and it's usually pretty good about these things. Check the developer's website for installation instructions.  These basically come down to "easy_install pygooglechart" or (if you don't have setuptools for Python installed) download the code, unpack it and enter "python setup.py install" from the directory which is created when you unpack.

Later versions of Surftrackr also use the pyparsing module, which can be installed using "easy_install pyparsing". If that doesn't work, see the note on the developers' website for a more specific command which should work.

Update, 18 March 2008: You will also need to install python-dateutil, available via "easy_install python-dateutil". 

Installing Django

Installing Django is fairly easy, so just follow the instructions on their website. Please install the latest version of the software from subversion, since it will offer the latest features and bug-fixes and is unfailingly stable. Also, according to reports I've had, Surftrackr will probably not work with a non-development version of Django.

Downloading and unpacking Surftrackr

Surftrackr can be downloaded from the downloads website. On a Unix/Linux system, here's how you would unpack each archive:

I would recommend assigning ownership of the surftrackr directory to the user your Apache daemon runs as. Type 'ps aux' and note the output:

www 28522 0.0 5.8 46716 30000 ? S 06:25 0:03 /usr/sbin/apache2 -k start
www 28523 0.0 5.6 45944 29132 ? S 06:25 0:03 /usr/sbin/apache2 -k start
www 28524 0.0 5.8 46904 30112 ? S 06:25 0:04 /usr/sbin/apache2 -k start
www 28525 0.0 5.9 47572 30920 ? S 06:25 0:06 /usr/sbin/apache2 -k start

Your output will vary slightly and so, possibly, will the name of your Apache process. It might be called httpd, httpd2 or apache. Anyway, the important column is the first one, which tells me my Apache daemon runs as the www user. I would therefore use this command to assign ownership of the surftrackr directory to that user:

chown -R www: surftrackr

If you don't want the web user to own all of the surftrackr directory, it's vital it has read-write access to 'surftrackr/media/logs' and 'surftrackr/media/logs-processed':

chown www: surftrackr/media/logs
chown www: surftrackr/media/logs-processed

Configuring Apache and .htaccess

Apache 

Find a place where you'd like to keep your Django apps. On my system, this is /data, but if you don't want to create a non-standard directory, try /usr/local/django or /home/django. I'm going to describe my httpd.conf settings, and how to configure .htaccess to enable your Django directory. I prefer using .htaccess since it allows non-root users to change the configuration (which may or may not be a good thing; you decide) and doesn't require restarting Apache whenever any changes are made.

My httpd.conf looks like this:

<VirtualHost *>
    ServerAdmin surftrackr@gmail.com
    DocumentRoot "/data/surftrackr"

    ServerName demo.surftrackr.org
    ServerAlias demo.surftrackr.net
    ServerAlias demo.surftrackr.com

    ErrorLog /var/log/apache2/surftrackr-error_log
    CustomLog /var/log/apache2/surftrackr-access_log common

    <Directory "/data/surftrackr">
        AllowOverride All
        Options None
        Order allow,deny
        Allow from all
    </Directory>

    <Directory "/data/surftrackr/media">
        Options -Indexes
        LimitRequestBody 5242880
    </Directory>
</VirtualHost>

You'll note I'm using a VirtualHost directive, since my server also runs other sites like this blog, the developer website for Surftrackr and the downloads site. You won't need this if you're running just one site (Surftrackr) on your server, but enabling virtual hosts is not a bad idea anyway. It gives you the option to expand the number of sites, and create additional Surftrackr installations, maybe one for each department in your organisation. More information about virtual hosts can be found on the Apache website.

The section for /data/surftrackr/media is not strictly necessary, but prevents people getting direct access to your media directory, which will be set up when you unpack the Surftrackr archive. The LimitRequestBody directive prevents very large files being uploaded to Surftrackr, but on a production site you might want to allow such uploads, so omit the directive if you have (or expect to have) some whopping squid logfiles to upload.

The key clause for /data/surftrackr is AllowOverride All, which allows the .htaccess file in the Surftrackr directory to alter the configuration as required: make sure you include this directive.

.htaccess

The .htaccess file is located in the Surftrackr directory, and has these directives:

SetHandler python-program
PythonHandler django.core.handlers.modpython
SetEnv DJANGO_SETTINGS_MODULE surftrackr.settings
PythonDebug On
PythonPath "['/data'] + sys.path"
Order deny,allow
Allow from all

This needs an explanation:

You only need to alter the PythonPath line (and maybe SetEnv but there are good reasons to avoid doing this) to suit your system setup. If you unpacked Surftrackr into /usr/local/surftrackr, your .htaccess would look like this:

SetHandler python-program
PythonHandler django.core.handlers.modpython
SetEnv DJANGO_SETTINGS_MODULE surftrackr.settings
PythonDebug On
PythonPath "['/usr/local'] + sys.path"
Order deny,allow
Allow from all

That's enough to get Surftrackr, mod_python and Django playing nicely together. However, the surftrackr directory includes a media directory, which serves static files like images, style-sheets and JavaScript. This directory should not, therefore, be handled by Django and so it has its own .htaccess file which contains this one line:

SetHandler None

That disables Django in the media directory and all its sub-directories. You don't need to change it.

Setting up the database 

Django uses MySQL or PostgreSQL, but stick to MySQL since Surftrackr uses some custom queries (for performance) and these are almost certainly only usable on MySQL. The SQL file you use to prime the database is also specific to MySQL (it's a dump of my database) and PostgreSQL will choke on it.

Developers' Note: If you're bold you might ignore this advice and go with PostgreSQL anyway. You'll need to modify the relatively simple queries in 'surftrackr/log/views.py', which are found in the routines 'count_websites_for_user', 'get_sitenames_for_user', 'count_words_in_lexicon' and 'count_weblogs_for_filetype'. When you've got it working, dump your PostgreSQL database and send it to me, along with the modifications to the queries! I'll include them in the subsequent release of Surftrackr.

Create a new database

You'll need to substitute your own usernames and passwords here, but here's how to set up a blank database for MySQL:

mysql -u root -p
[Enter your root MySQL password]
create database surftrackr default character set 'utf8';
grant all privileges on surftrackr.* to surftrackr@localhost identified by 'Eefoh3au';
flush privileges;
exit

Editing settings.py

The settings.py file in the Surftrackr directory is central to the application. You will need to edit the following items according to your set up:

DEMO_MODE This is a setting I added to make managing the demo website easier. It disables some features like flagging users, etc, and might be useful if you want to let your users tinker with the app without the chance they'll break it. Set it to False for production environments. 
ADMINSYou don't need to set this but you can, if you want, add a list of site admin names and email addresses. This is not the same as creating an admin user (which you'll do in the next step) so leaving this item blank will not affect your Surftrackr installation.
DATABASE_NAME
The database name you used when setting up MySQL, above. Use 'surftrackr' unless you have a compelling reason not to.
DATABASE_USER
The database user-name.
DATABASE_PASSWORDThe database user's password.
TIME_ZONE
Your time-zone, such as 'America/Chicago' or 'Europe/Madrid'.
LANGUAGE_CODE
Your language. Note that Surftrackr does not yet have any localisation options, although this setting might affect the Django-specific parts of the app.
MEDIA_ROOT
If you installed Surftrackr to (for example) '/usr/local/surftrackr', set this to '/usr/local/surftrackr/media'.
MEDIA_URL
Chances are, this is your base website address plus the '/media' path. So if you're running Surftrackr on your LAN on IP address 192.168.200.150, set this parameter to 'http://192.168.200.150/media'. The Django developers included this option since some intensely high-traffic sites might serve media files from a separate server to the app itself.
ADMIN_MEDIA_PREFIX
Set this to '/media/'.
TEMPLATE_DIRS
Set this to the base Surftrackr installation path + '/templates', eg '/usr/local/surftrackr/templates'.
SECRET_KEY
This is some random gibberish which you should change to a different string for each site you run, at least where you value security. Might not be so important in a LAN/office environment, but altering it is probably a good idea. Do not change the length, just over-write some of the characters.

Editing logfiles.py

In the surftrackr/media directory is a file called logfiles.py which processes uploaded logfiles. Edit it and change the value on the first line ('shebang' line) to match your python binary location. Also adjust the line sys.path.append('/data/') to reflect the location of your surftrackr directory, eg sys.path.append('/usr/local/'). If you are going to use any of the additional utilities in the 'surftrackr/utils' directory, edit each one in turn and, again, set sys.path.append to your local directory name.

Sync'ing the database

From the surftrackr directory, type ./manage.py syncdb and note any errors which occur. Check the settings.py file and the MySQL username/password if there are any problems. If you've set up the site correctly, you should be guided through creating an admin username and password with which you can log into the Django admin interface.

Then prime the database like this:

mysql surftrackr -u surftrackr -p < surftrackr/utils/surftrackr.sql
[Enter the surftrackr user's password] 

That sets up the required information like mime-types, top-level Internet domains, etc. 

 

Running logfiles.py to enter log data into the DB

Log into the admin interface via your web browser. It will be located at http://yoursite/admin and you use the username and password you set in the previous step (Sync'ing the database).

In the Preferences section, click on General Settings  and then the settings object:

The general settings object

Click on the hyper-link ('14 days') and you'll see a screen like this:

Changing the settings

You can use the Browse... button to upload a squid or Dan's Guardian logfile. It will be uploaded to a sub-directory of 'surftrackr/media/logs', but only if you changed the onwership of that directory according to the instructions above, in step 4. Click on Save when you have selected the logfile you want to upload.

Note that the logfile should either be in CLF (httpd) format, or squid native format:

Common Log Format
127.0.0.1 - - [22/Dec/2007:00:52:31 +0000] "GET http://database.clamav.net/daily.cvd HTTP/1.1" 200 353821 TCP_CLIENT_REFRESH_MISS:DIRECT
Squid Native Format
1198949569.523    169 127.0.0.1 TCP_MISS/200 2813 GET http://www.google.co.uk/ - DIRECT/208.69.34.230 text/html

 

Dan's Guardian allows you to set the format of the log in dansguardian.conf: use squid format.

When the logfile has been uploaded, change to the 'surftrackr/media' directory (at the Unix command line) and start the log-processing:

python logfiles.py

This could take a while, depending on the performance of your system. If everything goes well (no errors) you can start using Surftrackr via your web browser. Errors can be tracked by going to the Messages menu item from the Surftrackr front-page:

The 'Messages' screen

Items in red are show-stoppers, and I'd appreciate it if you could send me a screen-shot or cut-n-paste of the message. Items in orange are warnings, so the line will be skipped but processing of other lines will continue. Green items are just for information. Once you see "Finished processing ..." on the screen, you're ready to use Surftrackr.

Setting up cron

To run the logfiles.py script regularly, set up a crontab entry for an appropriate user:

*/5 * * * * cd /data/surftrackr/media && /usr/bin/python logfiles.py

That will run logfiles.py every five minutes by cd'ing into the surftrackr directory and running python with logfiles.py as an argument. Adjust paths to suit your situation.

Note that you don't have to worry about processes "over-lapping": if processing of the uploaded file(s) hasn't finished by the time the next process is kicked off from cron, the file won't be processed a second time. Surftrackr has a simple file-locking mechanism which should prevent this from happening. Even if it did happen, Surftrackr is written in such a way that any lines already stored in the database won't be added again. The same goes for users and workstations: those which are known already won't be added again.

Errors? Omissions? Got a suggestion? Let me know!

If you find any errors or omissions, please contact me and let me know. I'm also very interested in your impressions of Surftrackr, and your suggestions for improving it.

Simon Burns, 27 Jan 2008.