This blog keeps track of my research of defaced web pages.
** 06/02/2020 ** Start importing blog entries
The first round of scoping material is well underway. From what I can say now, about 3% of the total material is in scope. Looking at a dataset of approx. 15000 pages at the moment, this means about 450 defaced web sites to be in scope. This is a consequence of the initial inverse (from old to new) crawl. As defacements are usually submitted in bulks, there are always huge chucks of non-qualifying defacements found. To meet with supervisor to discuss a more focused approach.
THe first round of scoping is completed and first tests on data exptraction have been run. I am using the Natural Language Toolkit for Python to clean and analyze the pages. There is the idea of moving on to SQL to maange the data more efficiently.