I have realized that like most users of Wikipedia, I do a lot of Wikipedia page-hopping [1]. Wikipedia is sort of addictive that way. You start reading about a piece of Flamenco music and after twenty minutes find yourself staring at the page about ETA, a Basque nationalist organization. So I decided to figure out how I exactly do I get lost in the huge list of interconnected articles. I use Chromium, and it stores its history in a SQLite3 database file. I wrote a small Ruby script that parses the history, splits them to chunks of articles accessed per day, and filter only Wikipedia links from this.
This is basically what I had to do:
Query the db for the last visit time and URLs.
Chromium (and Google Chrome) stores timestamps of page visits in a not so obvious format. They basically store time stamps as the number of micro seconds expired since Jan 01, 1601
Splitting the URLs into chunks accessed per day involved calculating the number of micro-seconds in a day and splitting the URLs based on this. Ruby’s Array#group_by is really handy here.
Analysis of the URLs involves filtering only the URLs that contain “wikipedia”
There is a caveat here, as redirects to Wikipedia from both Google and Facebook contain the string “wikipedia” in their URLs. These need to be filtered out.
The analysis of my Wikipedia history showed me some interesting things. For example, when I was reading Michael J. Arlen’s Passage to Ararat, I spent a lot of time on Wikipedia, hopping between pages about Armenian history and culture. This is what the list of Wikipedia pages on that day look like:
http://en.wikipedia.org/wiki/TRS-80
http://en.wikipedia.org/wiki/Aunt_Sally
http://en.wikipedia.org/wiki/Pai
http://en.wikipedia.org/wiki/Pai_(surname)
http://en.wikipedia.org/wiki/Gowd_Saraswat_Brahmins
http://en.wikipedia.org/wiki/Girish_Karnad
http://en.wikipedia.org/wiki/Konkani_people
http://en.wikipedia.org/wiki/Roots_(book)
http://en.wikipedia.org/wiki/Mountains_of_Ararat
http://en.wikipedia.org/wiki/Armenian_Highland
http://en.wikipedia.org/wiki/Searches_for_Noah%27s_Ark
http://en.wikipedia.org/wiki/Tiberian_Hebrew
http://en.wikipedia.org/wiki/Mount_Judi
http://en.wikipedia.org/wiki/Islamic_view_of_Noah
http://en.wikipedia.org/wiki/Greater_Armenia_(political_concept)
http://en.wikipedia.org/wiki/Coat_of_Arms_of_Armenia
http://en.wikipedia.org/wiki/Turkish_War_of_Independence
http://en.wikipedia.org/wiki/Kuva-yi_Milliye
http://en.wikipedia.org/wiki/Turkish-Armenian_War
http://en.wikipedia.org/wiki/Anatolia
http://en.wikipedia.org/wiki/Bursa
http://en.wikipedia.org/wiki/Armenia
http://en.wikipedia.org/wiki/Hayk
http://en.wikipedia.org/wiki/Ecbatana
http://en.wikipedia.org/wiki/Goat_meat
http://en.wikipedia.org/wiki/Kid
http://en.wikipedia.org/w/index.php?title=Special%3ASearch&search=xenophon
http://en.wikipedia.org/wiki/Xenophon
http://en.wikipedia.org/wiki/International_Mother_Language_Day
http://en.wikipedia.org/wiki/Debian
http://en.wikipedia.org/wiki/Language_Movement_Day
When I was reading about Data warehousing, this is how the hopping happened:
http://en.wikipedia.org/wiki/ROLAP
http://en.wikipedia.org/wiki/Dimension_(data_warehouse)
http://en.wikipedia.org/wiki/Extract,_transform,_load
http://en.wikipedia.org/wiki/Mondrian_OLAP_server
http://en.wikipedia.org/wiki/OLAP
http://en.wikipedia.org/wiki/Comparison_of_OLAP_Servers
http://en.wikipedia.org/wiki/Pentaho
http://en.wikipedia.org/wiki/Multidimensional_Expressions
http://en.wikipedia.org/wiki/Decision_science
http://en.wikipedia.org/wiki/Star_schema
http://en.wikipedia.org/wiki/Snowflake_schema
http://en.wikipedia.org/wiki/Sarkar
http://en.wikipedia.org/wiki/Fact_table
http://en.wikipedia.org/wiki/OLTP
http://en.wikipedia.org/wiki/Ralph_Kimball
http://en.wikipedia.org/wiki/Bill_Inmon
http://en.wikipedia.org/wiki/Decision_support
http://en.wikipedia.org/wiki/Heart_of_Midlothian_F.C.
http://en.wikipedia.org/wiki/The_Heart_of_Midlothian
I am still trying to make more sense of the links that I clicked away and the articles I read when I was page hopping.
The Ruby script that parses Chromium history and figures out the Wikipedia links is below:
#!/usr/bin/env ruby
# Ruby script to parse Chromium (or Google Chrome) history to identify Wikipedia pages read per day.
# usage: ./wikipedia_history.rb <location of Chromium history db>
# The Chromium history db can be usually found under ~/.config/chromium/Default
require 'rubygems'
require 'sqlite3'
US_IN_A_DAY = 24 * 60 * 60 * 1000000
SITE = "wikipedia"
module ChromiumHP
class DbConnection
def initialize db_name
@db_name = db_name
end
def urls_history
db = SQLite3::Database.new @db_name
urls = db.execute("SELECT last_visit_time, url from urls ORDER BY last_visit_time;").map do |t, u|
{:last_visit_time => t, :url => u}
end
db.close
urls
end
end
class Parser
def initialize db_name
@db_name = db_name
end
def chunks days
@history ||= get_history
parts = @history.group_by do |h|
h[:last_visit_time] / (days * US_IN_A_DAY)
end
parts.map { |k, group| group }
end
private
def get_history
DbConnection.new(@db_name).urls_history
end
end
class Analyzer
def self.graph chunks
chunks.map do |c|
c.find_all do |entry|
url = entry[:url]
url.include?(SITE) &&
!url.include?("facebook") &&
!url.include?("google")
end
end.find_all do |c|
!c.empty?
end.sort_by do |c|
c.length
end.map do |c|
c.map do |entry|
entry[:url]
end
end
end
end
end
history_loc = ARGV.first
abort "Error: Pass the chromium history location as parameter" if history_loc.nil?
daily_chunks = ChromiumHP::Parser.new(history_loc + "/History").chunks(1)
ChromiumHP::Analyzer.graph(daily_chunks).each do |entries|
puts entries
puts ""
end
If you have questions or comments about this blog post, you can get in touch with me on Twitter @sdqali.
If you liked this post, you'll also like...
- Net::HTTP and the simplest of explanations
- Implementing feature toggles for a Spring Boot application - Part 4
- Implementing feature toggles for a Spring Boot application - Part 3
- Implementing feature toggles for a Spring Boot application - Part 2
- Implementing feature toggles for a Spring Boot application - Part 1
- Setting up a secure etcd cluster behind a proxy
- Handling Deserialization errors in Spring Redis Sessions
- CSRF Protection with Spring Security and Angular JS
- Controlling Redis auto-configuration for Spring Boot Session
- JWT authentication with Spring Web - Part 5
- JWT authentication with Spring Web - Part 4
- JWT authentication with Spring Web - Part 3
- JWT authentication with Spring Web - Part 2
- JWT authentication with Spring Web - Part 1
- JSON logging for Spring applications
- Injecting dependencies into a Spring @Configuration
- Filtering responses in Spring MVC
- Deprecating domain events in Axon
- Programmable exit codes for spring command line applications - 2
- Programmable exit codes for Spring command line applications
- Using custom arguments in Spring MVC controllers
- Authentication for Apache Camel HTTP components
- Thoughts on Open Graph tags
- Integration testing Spring command line applications
- Integration testing challenges for non-web Spring applications
- How thinking of Documentation as Legislation helped me become a better programmer
- Implementing custom annotations for Spring MVC
- Validating RequestParams and PathVariables in Spring MVC
- Testing async responses using MockMvc
- Running multiple applications in the same Tomcat installation
- Making sense of Cloud Foundry security group declarations
- Configuring Cloud Foundry Java Memory Parameters
- Importing the Yelp dataset into MongoDB
- Clojure Dojo - Levenshtein edit distance
- A simple JMeter test with login
- Implementing Rate Limiting in Rails - Part 2
- Implementing Rate Limiting in Rails - Part 1
- Python Hack - Dynamically override an object's attribute
- Fitting an Image in to a Canvas object
- Accessing Environment Variables in Gradle
- Reading user input in Gradle scripts
- Ruby, Named Capture Groups and Local Variables
- Named Capture Groups in Regular Expressions
- Decomposing URLs in Python
- Shared history in Bash
- Managing Gemsets in Rbenv
- Looking up Compiler params used to compile a Ruby version
- Navigating Stacktraces in Emacs
- Python's bool type
- Graph databases 1 - Modeling
- Validating JSON in Emacs
- Emacs hack: Viewing Git logs while composing commit messages
- Configure Git's comment character
- My experience working remotely
- Oh I can build it in...
- JavaScript, clipboard access and hidden flash widgets
- Reducing Emacs startup time while committing
- My first Firefox plugin: GetCache - View cached version of the current page
- GetCache - A Chrome plugin to view cached version of the current page
- On REST, Content-Type, Google Chrome and Caching
- How Browsers Detect If You Are Offline
- D3.js Workshop
- Visualisation - How European clubs dominate their leagues
- Understanding Python's "with" statement
- Heredocs in Ruby and Python
- Micro Journal - simple Git-backed journal in Python
- VodQA NCR: Maintaining Large Test Suites
- Know Your Tools - Don't Shoot Yourself in the Foot
- Managing security certificates from the console - on Windows, Mac OS X and Linux
- Indian and Pakistani cricketers - who make better debuts?
- Fixing Flyspell for Emacs in Mac OS X
- Finding un-merged commits with git cherry
- Bullet proof Jenkins setup
- Why your project should have a Getting Started guide.
- Debugging: C Sharp's HttpWebRequest, 100-Continue and nginx
- Empathy Log Parser
- Binary Signature Art
- Java Arrays in JRuby
- Autorun.py - Execute stuff on file change