Tagged: data

Analyzing Browser History

I recently participated in First Five – a Tumblr where guests list the first five websites they visit daily (my five here). Similar to recent contributor Luke Robert Mason, the concept seems foreign to me. As a poster child for consumption via aggregation, apps, and streams, I do not pull up my bookmarks in the morning as though unfolding the daily newspaper. Rather than opting to compensate for this by providing (as many contributors seem to) my favorite five, I decided to provide a strict, data driven answer to the question – of a sample of the first five URLs I type into my browser every morning, which are the most common? Although my content consumption is divided heavily between apps on my mobile device, desktop and browser based apps on my laptop, I chose, for time and feasibility’s sake to focus on my browser history. I hypothesized that the data would show a few major content sources, mostly browser based channels (such as Twitter and Prismatic), followed by a long tail of heterogeneous content they directed me to.

I use Chrome, so the clear route was to analyze the SQLite database where Chrome stores it’s history. On a Mac, this is located at ~/Library/Application Support/Google/Chrome/Default/History  | Had I prior experience analyzing SQLite with Python, I could have written something that worked with this file directly. This not being the case, I exported a CSV of results from the following query:

 

SELECT datetime(((visits.visit_time/1000000)-11644473600), "unixepoch"), urls.url,
FROM urls, visits WHERE urls.id = visits.url;

 

I cleared my history near the end of May, so this yielded about three months (a pithy 7.9MB) of data in the following format:

 

"2012-09-01 20:03:15","http://example.com/therest/ofthe/url"

 

I wrote a few lines of Python that look at each row of the CSV and add each day’s first five unique hostnames to a dictionary. At the end, each hostname is counted, and the results are printed to stdout.
 
The resultant data needed to be tidied up a bit – there were analogous hostnames such as drive.google.com and docs.google.com, which could be consolidated. As well, I use Twitter via a desktop client, not Twitter.com. T.co, the hostname of Twitter’s url shortener scored very highly, but rather than trace these back to their original URLs, I opted to count these as Twitter.com visits. Interestingly, Netflix ranked highly with a 15% share. I don’t watch Netflix in the morning, rather this registered due to the fact that Netflix is often the last website in my browser at night. In the morning, when waking my computer, the open tab refreshes thus gaining a post 6am, pre 11am entry in the database. I chose to remove this from the results.

browser data

The data seems to at least somewhat reflect my hypothesis: Twitter, email (mainly listservs and News.me), and Prismatic all being aggregators, followed by a long tail of diverse content sources. A clear next step would be to analyze the from_visit field of visits in the long tail, to see if indeed the referring visits trace back to the top aggregators. All in all, the exercise does seem to illustrate stream-based browsing habits, and the idea that more and more content is fluid – less and less tied to specific websites as vessels.