mkrzmr.github.io

Michael Kurzmeier, 3rd year Phd Digital Humanities @Maynooth University

View on GitHub

Research blog

This blog keeps track of my research of defaced web pages.

** 06/02/2020 ** Start importing blog entries

13/02/2020

The first round of scoping material is well underway. From what I can say now, about 3% of the total material is in scope. Looking at a dataset of approx. 15000 pages at the moment, this means about 450 defaced web sites to be in scope. This is a consequence of the initial inverse (from old to new) crawl. As defacements are usually submitted in bulks, there are always huge chucks of non-qualifying defacements found. To meet with supervisor to discuss a more focused approach.

04/03/2020

THe first round of scoping is completed and first tests on data exptraction have been run. I am using the Natural Language Toolkit for Python to clean and analyze the pages. There is the idea of moving on to SQL to maange the data more efficiently.