T O P
11/22/2002 - DESPERATELY SEEKING WEB LOG FILE STANDARDS
Traffick Online Zine - Cory Kleinschmidt
As any webmaster or search engine marketer knows, you can't measure the success
of your web site or of your online marketing campaigns without knowing your
site statistics. And the only way to know your stats is to dig deep into the
bowels of your server's log files. But once you get in, you might not make it back!
Every page viewed on your site, every visit to your site, every referring URL,
and hundreds of other bits of information, is stored in these labyrinthine text
files that can grow to be hundreds of megabytes in size. It's nearly impossible
to decipher these log files yourself, which is why software companies have created
versatile -- and pricey -- programs to extract the useful nuggets of information
contained therein.
Because many web visitors come from big ISPs who use proxy servers rather than fixed IP addresses to provide connectivity and route web traffic, the "referrer" URL may be misleading. Many computers may share the same IP address, if, for example, they all arrived through the same aol.com or earthlink.com proxy server, which can support hundreds of dynamic dial-up connections. This "referrer" may also be inaccurate because the earliest visit within a two-hour time frame from a specific IP address may not be the one that actually generated the customer order.
Here's a basic
definition of log files (also called extended log files), courtesy of the
World Wide Web Consortium (or W3C), the Internet standards group:
"An extended log file contains a sequence of lines containing ASCII
characters terminated by either the sequence LF or CRLF. Log file generators
should follow the line termination convention for the platform on which they
are executed. Analyzers should accept either form. Each line may contain either
a directive or an entry.
Entries consist of a sequence of fields relating to a single HTTP transaction.
Fields are separated by whitespace, the use of tab characters for this purpose
is encouraged. If a field is unused in a particular entry dash "-"
marks the omitted field."
WebTrends is the industry
standard log file software, and is used by thousands of companies around the
world. Some hosting companies offer WebTrends reports to every site hosted on
their servers free of charge, or perhaps for a small fee. There are also companies
such as HitBox who offer
free ASP -- application service provider -- hosted services that track your
site stats in exchange for placing ads on your site.
Any hosting company worth its salt will give you access to your raw log files,
which you can download and then analyze on your own using software from vendors
like 123LogAnalyzer, SurfStats,
and Sawmill.
These log file analyzers are usually good at giving you a general picture of
the overall health of your site, but if you analyze your log files with more
than one log analysis tool and you'll see that the world of site stats is a
murky one filled with competing standards, conflicting definitions of basic
terminology and few easy methods of understanding what the numbers mean.
One log file tool may report 100,000 page views for your site in a month's
time, and another may report just 80,000. I talked with several of these vendors,
and they all gave a litany of possible reasons for the discrepancy in page views:
some log file software counts failed pages as page views, some have different
definitions for what constitutes a page view or user session, or maybe the other
program's parser -- which is the part of the software that scans the log file
entries -- isn't up to snuff. But, none of the companies would admit that their
software could do a better job in reporting numbers.
Referrers are those URLs that lead a user to your site or caused the browser to request something from your server. The vast majority of requests are made from your own URLs, since most HTML pages contain links to other objects such as graphics files. If one of your HTML pages contains links to 10 graphic images, then each request for the HTML page will produce 10 more hits with the referrer specified as the URL of your own HTML page. A *Referrer* is the URL of the last web page the user was on before coming to your web site. Therefore, this information reports the web pages that link to your site. Technically, if a user types in your URL directly, the browser will not record a referrer. In addition, the referrer variable is a new feature that was not supported in all browser types and versions. The recent browsers support the referrer attribute.
So, why does it seem so impossible for different log file programs to report
numbers consistently? Here are some of the reasons:
1. There is no standard log file format.
Log file formats come in many different flavors. There are the different formats
based on Microsoft's Internet Information Server (IIS); there's one for the
free web server called Apache; and there are different formats for proxy servers
(which act as Internet access gateways for networks).
2. There is no standard method for interpreting and parsing log files.
Many log file analyzers, OpenWebScope,
for example, report the useless term "hits" as the more-useful-to-know
term "page views." There are also different definitions of what constitutes
a visitor session. Some programs say that if a visitor to your site is inactive
for 10 minutes or more and then they come back, they count as a new visitor.
Obviously users shouldn't be counted twice, but when you deal with dynamic IP
addresses, it's hard to know if the user with IP address 64.217.243.22 was the
same person as it was 10 minutes ago.
3. There is no standard way to track and measure success. Unless you contract with a specialized software company to track banner ad campaigns,
PPC campaigns, and other sales promotions, you will have a difficult time calculating
ROI, tracking referrers, and so on. Log files do record the pass-through URL
parameters appended to links, but the data is so generic as to be almost useless.
more info