This is Professor Michael Rappa from North Carolina State University in Raleigh, North Carolina. And I’m here to speak with you about my course, Managing the Digital Enterprise.
One important consequence of the emergence of the digital enterprise is that everything that occurs is recorded in server logs and databases. The kinds of information that are collected, and how that information can be used is the topic for my conversation today. So imagine, if you will, as you go through the course of your various activities, let’s say on the web with various web sites, with each click of the mouse, you are, in a sense, leaving a digital trail, the constellation of servers that you’re touching upon. And to be sure when you think about web sites, for example, like Amazon, and the large number of people who use those sites each day, each click of that mouse is creating a data stream.
And that data stream ends up being quite an enormous amount of data. And what kind of data are being collected? And given the vast amount, how can it be usefully analyzed to improve upon what we do as a digital enterprise, to understand our customers better, to personalize the experience for them? There are a whole host of potential opportunities, as well as important questions about to what extent users or customers understand the nature of this process, and to what extent they have control over the data that’s being collected about them in their interactions with these enterprises. And also obviously questions of presumptions of privacy and the security of that information. And those are issues that we’ll touch upon in later conversations in the course.
What I want to focus on today is simply what types of data are collected, and to what use is that data made in organizations today?
So starting with the types of data that are collected, there are really two major distinctions that can be made. One is a passive data collection process. And what I mean by this is data that’s collected in the background by a server as a user is clicking their way through those resources.
And so if the server is properly configured, it will collect what are called access logs, recording what files are being accessed by users. It will record error logs, and make notations about when problems occur in not being able to fulfill certain requests. And it might also be configured to collect what’s called a cookie log. And cookies are a term that pops up from time to time.
It’s important that one has a clear understanding about what a cookie is. A cookie is simply a small stream of data that’s passed between a web site and a user’s browser. So Internet Explorer, or Firefox, or Safari, whatever the user might be using. When a user comes to your web site, the first thing a web site will do if it’s configured to do so is to, in a sense, communicate with that browser, and ask it to return a cookie, this stream of data that was set by the server. And if the person’s never visited the site before, then it passes a cookie to the browser. If the person has visited the site before, then that cookie is returned to the server.
And in doing so, it does a few things. And perhaps most importantly, not only does it recognize the individual as someone who’s returning from a previous visit, but also it may enable the information that’s being collected from that point forward to be tied to a specific person. It might also do some other things like help automatically log the person in if a log in is required. And to the extent that it recognizes a user, it will allow any kind of preferences or personalization features to register.
Now in addition to this passive data that’s being collected in the background, there are also opportunities to actively collect data about users. And so when a web site presents a user with survey forms for demographic data, or when a user makes a purchase and payment, or provides shipping information, or when they make some kinds of personalization to the web site all of that is an active process by the user, in which they’re making a decision at various points in time to share data about themselves.
There is both this passive process in the background, and then the active process that the user is sharing data. And to the extent that an enterprise is configured to do so, it may marry these two forms of data together. So it may enable us to kind of analyze both the passive and the active data, and to give us a sense of the user, a profile of the user, in terms of their demographics and their usage habits. The passive and the active data may also be joined together in situations where a user must authenticate themselves on a server.
So when you go to a web site even web sites where you’re not purchasing anything, but simply they’re asking you to log in as a unique user, that log in enables that passive data to be tied not only to yourself and whatever other demographic information has been collected about you, but ties you into that entire past history of usage, whenever else you have used that web site. And so you can imagine then, if you go to a web site and you’re logged in as a user, every one of those log ins over a period of time are tied together by that authentication.
And so you can imagine if you thought for a second about your other interactions, your other commercial or non-commercial interactions with businesses and organizations to the extent now that you’re engaging with those organizations in a digital environment, you’re leaving a kind of record of each one of those interactions. And so over time, that becomes a fairly substantial amount of data about the extent and nature of your interaction with that organization.
And this is really what’s fundamentally important and different about the digital enterprise that you need to recognize, that there is an unprecedented amount of data that’s being recorded, and being recorded and stored in databases pretty much indefinitely about the nature of your interaction.
Let’s start with the passive information that’s being collected by the server in the so-called access logs. First and foremost is the remote host, which is simply the IP number or domain name of the server from which the visitor is arriving from. And so if you are a cable modem user, or if you’re working in a corporate, or business, or university environment, each place that you visit, you’re leaving a stamp of the server that you’re originating from. And so for example an N.C. State student visiting some distant location on the web is going to leave a remote host ncsu.edu and whatever other specific information about the computer location that they’re using, on that distant server.
Second, if an authentication is required, then the user name which you authenticate with is then stamped in the server log. Also the specific date and time of this access is going to be recorded, and then what specific information is being requested. So when you go to some place on the web and you pull up a web page, you’re making a request for a specific file. Actually, you’re typically making a request for several files, that is the page itself, and then any of the images or other resources that are associated with that page.
If you deconstructed a typical web page, what you would find is that it’s a collection of files, both text the HTML file, or whatever other language it’s programmed in and also the associated images that go along with that page. And so when you make a request for a page, you’re actually making a request, or you can be actually making a request for many files. And then also recorded is how the server responded to that request, so some kind of a status code. If you requested a resource that was there, then a status code records that basically it served up the page okay. If the resource wasn’t there on the server that you requested, then it records an error code. The server may also be configured to record the amount of content that was transferred during that request, so how many bites of information were transferred.
Perhaps one of the more interesting pieces of information that gets exchanged and placed into the access log is called the “refer.” And a refer is the url, or the address of the web page that had the link to the page that you are requesting. So for example, if you’re on a particular web page, and you click on a link to go to some other destination another page, another web site. Then the log on that distant location is going to receive the address or url of the web page that you clicked through to arrive there. This is called the “referral,” or “refer.” And that is logged in the access file as part of the request.
If you take a moment to think about that, an access log has some pretty interesting information, in terms of where users may be coming from in order to arrive at a particular web site. And so as you might imagine, that information is some of the kinds of data that people would want to analyze very closely. But also to the extent that the referring site may be a search engine, there may be other information embedded in that referral as well.
So for example, if you go to a search engine like Google, and you search on a particular word or phrase, and the results are returned to you from Google. To the extent that you click on one of those resulting links, the site that you arrive at is going to receive a referral that not only identifies Google as the web site that you came from, but it’s also going to have embedded in that url the actual search terms that you used to follow that link. So whatever you typed into the bar at Google’s site, or Yahoo, or whatever search engine it is, that that information can also be forwarded as part of that referral link when you click through the results of your search.
What this means is that when one looks at the access laws for a particular server, you’re going to see not only the sites that may have sent your users to your location, but also specific terms and phrases that those users may have used in a search engine, in order to locate your enterprise. The server is also recording information about how much time it took to process the request, what kind of browser, and operating system, and the version of the operating system that the user was using to access your site, as well as any information about a cookie that’s being passed between the browser and the server.
All of this passive information that’s being collected may have some value to it. Users may be more or less identifiable through this process, depending on how much information is getting passed between the servers. And really for much of the past decade, people have been trying to come up with ways to take this raw data and turn it into something meaningful, so that web site operators can not only understand their users better, and what their interested in, why they come to a web site, but also as a way of optimizing the overall architecture of a site. And also in just the sort of technical operations, to be sure that whatever the demand may be among the users, that from an operating point of view, we’re able to meet that demand.
Now the methods for analyzing passive data, and especially when we connect passive data with the actively collected data, are becoming more and more sophisticated over time. But just taking the basic passive data, there have been a number of kinds of measures and terminology which has come into place. And one of the problems that we’re presented with today is that there are not standardized definitions to many of these measures. And so a fair bit of confusion occurs when people talk about overall usage metrics about web sites.
But let me go through just some of the terminology that no doubt you’re hearing about, and what it might actually mean. So people frequently talk about users how many users they have on the web site. If you require authentication to use a web site, then you might have a fairly good measure of the number of unique users that you have on a site.
But without that specific kind of authentication, one has to understand that not all users, that is not all accesses or requests being made to web sites are actually in fact humans or people. There’s a lot of automated processing that goes on around the web. And so a fair amount of requests maybe anywhere between 10 to 20% of the requests that are being made on a particular site are automated requests by search engine bots and crawlers.
When one talks about how many visits are being made to their web site, then one wants to account for that visit as being some actual user or customer who’s come to the web site. And to the extent that you can identify that user through authentication, or a cookie, or some other specific detail, that you can identify unique users or unique visitors who come to the site.
Now sometimes the word “visit” is also used interchangeably with the term “session.” And so we might want to be able to enumerate how many sessions we logged on a particular day, or week, or month, how the number of sessions is changing over time, to give us a sense of just what our growth and usage looks like.
Another thing that tends to be measured is so-called page views. And this means every time a request is made for a specific page by a visitor and depending on what kind of site it is, you may be more or less interested in the number of page views and how many page views are made during a particular visit or session by a user. If you’re serving up content, like a news site, then maximizing the number of page views may be an important goal for you. Because each page view has revenue generating advertising on it. And so part of what you’re trying to do is to get users to become more engaged in the site.
But it’s important to recognize that not every site is about maximizing page views. Too many page views during a particular session might be indicative of a user who’s lost or confused, or unable to solve a problem, or trying to find specific information that they’re looking for. So if you’re a retail site like Amazon, minimizing the number of page views, or a low number of page views during the session may, in fact, be the goal, because it means the person came. They found what they were looking for. And they hopefully checked out, and purchased the item, and ended the session.
In that regard, if one thinks about a session a user coming to a web site, and then clicking through a series of pages one might look at that as kind of time ordered series, and call it a click stream, or a click path that gives you a view of what a user does over a discreet period of time. And understanding the efficiency of that click stream, and what click streams end up in important results like making a purchase, become a potentially important goal for a web site operator. So a click stream and the session that it encompasses is a discreet period of time. And it’s important and useful to know what kind of time interval our users are experiencing on the site, and what the outcomes of that session are, what the goal of our outcomes are for users.
Another term that frequently comes up in conversations about user metrics and analytics is the term “hit.” You know how many “hits” does my web site have? And that can be a point of confusion for people. Technically speaking, when we refer to hits, we’re really referring to each file that’s transferred upon a user’s request. And so as I said earlier, when you go to a particular destination, a particular web page, frequently you’re pulling up not only the page, being the text itself that document but also all of the associated images.
And so, for example, when you come to digitalenterprise.org, you come to the home page, you’re requesting one file. But along with that file comes about two dozen images that are associated with the page. And so when someone visits the home page for digitalenterprise.org, the server log is actually recording about 25 hits at that time. And so you might ask well how useful is that information? It’s just really mostly images that are being served up. And the answer is not necessarily very interesting, unless you’re trying to understand server load issues, and some of the technical things going on in the background. Then you might want to know how many hits you’re serving up.
So when we take all this information together, it has more or less meaningfulness, depending on what you try to do with it. And there have been a number of efforts, both commercial and non-commercial, to take basic server log data, and sort of slice it and dice it in certain ways to try to give it meaning. And I think we’ve come a long way in that process, but my general feeling is a word of caution in this regard. To the extent that you’re not using the kinds of sophisticated tools that help us segment out actual individual users, one can draw some misleading conclusions from server logs.
And so I think just to be careful, when you hear people talking about page views, and hits, and visits, and sessions, and so forth, that it’s more often than not that a fair amount of cloudiness is drawn into the conversation. Because this data, although it’s extremely precise, in terms of what the computer’s doing, the conclusions that we draw from the data, the analysis that we perform on this data, can cause a certain degree of fogginess in our conclusions.
So if we think about some of this data from a business point of view, obviously we’re not interested in hits at all. Page views, and hits, and identifying actual users become a more important goal. And to go further, if we can identify those customers who perhaps are our best customers, in terms of their overall volume of business, or who are our loyal customers who are the people who return with some degree of frequency as a customer, and what their value is to our business as we move up that pyramid, in terms of getting more precise data about customers who are critical to our business, or users who are critical, then of course, there’s more value there.
And to the extent that we do this, we might be able to measure at each point in time, for example, how many visitors become customers, how many customers become a loyal part of our customer base. And also, what kind of turnover do we have at various stages? What are the percentage of visitors who abandon the web site before becoming customers? What percentage of customers do we lose over time? What percentage of our loyal customer base turns over, over a period of time? All of those things can now be looked at more carefully, because of the data that are being collected over our web site.
Some of the more important kinds of things that we want to know about. Our users how much reach do we have in the marketplace? How many people do we reach? What kind of acquisition rate do we have, in terms of engaging users in various kinds of processes, like signing up for a newsletter, taking a survey, joining a discussion list, filling out some kind of form, downloading a document or a demo? How many people, or how many visitors do we convert into actual customers? What percentage of those customers do we retain over time? And then what kind of loyalty do we experience among repeat customers? That is those people who visit frequently, and often recommend our site to other individuals.
To the extent that we can, we might want to measure certain kinds of turnover variables, like abandonment how many people put things in a shopping cart, but don’t make a purchase? What kind of attrition do we have among our customers? So how many cease to return over time? What kind of turn do we have in that loyal customer base, in terms of an attrition rate? We might measure recency. So how long has it been since a customer last visited? And then we might try to do something to get them to come back, like offer some sort of promotion. What’s the frequency rate? So how often do customers visit or purchase in a given time period? What’s the monetary value of customers, in terms of how much they spend over time? What kind of duration do they have on the web site? What kind of yield do we have, in terms of the percentage of customers who make it from one level to another in this hierarchy?
Some of the more advanced kinds of metrics that we might look into are things like acquisition cost, as we saw with ad words. You know what’s the cost of an ad relative to the click through rate? What’s the cost per conversion? Again, the cost that we’re spending on the ad, relative to the amount of sales that we have, or some kind of conversion experience that we want the user to go through. We might look at user behavior calculations. Like “stickiness” is a term you might have heard before, in terms of the frequency of return and duration, given a user group. Or how slippery a site might be, in terms of having low “stickiness.” We might look at velocity, in terms of the amount of time it takes for a user to go through that kind of customer life cycle that we talked about.
All of this becomes much more interesting from the point of view of an organization that is taking the passive data, tying it to an actual individual, in terms of their authentication, and then adding into that all of the kinds of actively collected data that they may be drawing from a user. In a sense, you’re getting a relatively rich profile of your users. And that may help you understand your customers better, and hopefully improve the product or service that you’re delivering, and optimizing, given whatever the customer needs and demands are.
So let me sum up here for a moment. Obviously, there’s a great wealth of information and data about customers and users that can be collected as people interact with you in the digital world. There are commercial and non-commercial tools out there that help you analyze this data, as well as for very large enterprises, home-grown tools which enable you to understand what it is that your customers are looking for, how they interact with you, and how you might be able to optimize that interaction.
It’s important to recognize that this can be an enormous amount of data, much more data than we’ve ever had before. And so it’s not easy to analyze this information. And to the extent that we don’t take care in the analysis, we can obviously draw faulty conclusions about what’s going on. So it’s really imperative to understand the nature of the data that are being collected, how that data is being aggregated into certain kinds of statistics, and really what kind of meaning can be drawn from that data. All of that is still an enormous challenge.
And even some of the best commercial tools out there really only represent the kind of first generation of analytics that we might want in thinking about our customers. And although the managers of digital enterprises recognize that there is a lot of value to be gained from this data, we still have a long way, I think, before we can be confident that the data are being analyzed carefully, and leading to meaningful results. Clearly the most sophisticated digital enterprises are moving along quite quickly to do just precisely that. And so the Amazons, and Ebays, and Googles of the world are doing everything they can to understand and draw meaning from this data.
As we move into the future, I think we’re going to be faced with a different kind of challenge that being one of too much data, too much information, a fairly high noise to signal ratio. And so the kinds of things that we can do to build analytics that help us really understand the nature of our customers, and how to server them better is still going to be a challenge as we move forward.
This is Professor Michael Rappa. Until next time, wishing you all the best.
(Music)
[End of Audio]
Unedited transcript of audio podcast produced on September 19, 2005.
Audio source file: http://digitalenterprise.org/podcasts/analytics.mp3
Michael Rappa is the Alan T. Dickson Distinguished University Professor of Technology Management at North Carolina State University.
For more information, please visit: digitalenterprise.org
Copyright 2006 Michael Rappa. All rights reserved. Please do not reproduced, distribute or quote without written permission of the author.