For the past few weeks we have been working with a customer to evaluate the traffic capacity of their Web infrastructure – they had been experiencing unusual performance issues. We uncovered some interesting results that highlight the need to continually analyze your Web traffic logs and not just rely on Google Analytics or other live visitor traffic products. Although they're great for understanding how visitors interact with your site, traffic analytics should not be the sole data point for infrastructure capacity planning – or evaluating your site visitor experience.

During our initial meeting with the customer they shared Google Analytics data indicating they were receiving 80,000 to 100,000 daily page views. Based on their deployment, I knew they had plenty of capacity to handle this load. The following week we collected more data and when we analyzed the results, we were a bit surprised with our findings.

Working with their IT organization, we validated that their Web servers were configured to record the correct data. They were using IIS, so we made sure the servers were logging the following: Date; Time; Client IP Address; User Name; Sever Port; Method; URI Stem; URI Query; Protocol Status; Win32 Status; Bytes Sent (*); Bytes Received (*); Time taken (*); User Agent; Referrer (*); The (*) properties are not enabled by default in IIS.

The Web server logs are like finger prints that tell you what is happing on your Website. Unlike Google Analytics, the Web server logs will record all traffic. In a later post, I will write about some tools we use to analyze the log files here at Systems Alliance. We know logfiles will provide total traffic data like hits and page views, but they also provide context for understanding response performance. The “time taken” property will be more valuable since most Websites are powered by an application server such as PHP, ASP.NET, Java and ColdFusion and no longer serving static HTML files. This can tell you how your application server is performing and provides a better understanding of your visitor’s experience on your site.

So, we analyzed our customer's traffic for six days and compiled the high-level data in the chart below. The “Total Hits” column is all requests including page, applications, images, documents, multi-media, etc. The “Page and App URL” column is request for pages and specific applications that power their Website. The “Visitor Page View” column is requests made by a browser.

DateTotal HitsPage & App URLVisitor Page ViewsVisitor %
Friday1,976,340193,71775,39538.9%
Saturday1,069, 750156,09137,33923.9%
Sunday1,222,118186,09941,55822.3%
Monday2,296,985208,26990,46443.4%
Tuesday2,286,560209,44089,06042.5%
Wednesday2,272,165205,57687,64542.6%

We immediately focused on the huge difference between “Page and App URL” column vs. the “Visitor Page Views” column. After reviewing the details of the data, we realized the site was more popular with search engine crawlers than site visitors. Further analysis revealed the customer’s internal Google Search Appliance (GSA) was generating the majority of the non-visitor traffic! We’re now helping them evaluate their GSA configuration to understand if the GSA is caught in some sort of loop or if it really needs to crawl the entire site so regularly.

The lesson here is clear. Evaluating visitor traffic without considering the raw server log files may not present a completely accurate view of site traffic (marketers take note!). Beyond helping you understand the capacity of your Web server and network infrastructure, this data can expose problem that are negatively affecting the user experience such as long-running requests.

Happy Halloween!  Don’t ignore those “creepy” crawlers…..