Jul
6
2015

My internship and the Cookie Audit tool

I’m currently in the fifth week of my eight week internship with the UWP tech team. So far it’s been a really enjoyable experience.

Having just finished my third year studying Informatics at the University, my main area of interest is within Natural Language Processing – the process of making computers understand and use language. So far my time has been very enjoyable. I was even surprised to find that I enjoy the 9-5 working experience. I find that a structured day makes me more productive, but when the day is over I can stop thinking about work. At University, on the other hand, there is always something gnawing at you, a reading or exercise that you should have done, or a piece of coursework that needs to be started.

My goal with this internship has been to gain experience, improve my skills, and to learn something new. With every success and failure over the past 5 weeks, I believe that I’ve already achieved a great deal. I’ve had the opportunity to work with different technologies in order to understand how to use them together and have gained a lot of experience in the process. It has been extremely helpful to have a concrete goal in sight as this has forced me to identify and select the most appropriate technology for the job, rather than simply basing my objective on the technology that I’m using. Over the next few weeks I’ll be preparing to hand my work over to the team for further use and development. This will be an interesting experience as most of my previous work has either been for myself or for a marker, and therefore did not need to be maintainable.

Most of my eight weeks with the University Website Programme will have been spent on a single project – the Cookie Audit tool. The purpose of this tool is to find pages on the University website that set privacy invasive cookies. These are cookies that can be used to track your browsing habits. The EU and the UK have introduced legislation trying to restrict their usage. The UWP has been working towards realising this on the University websites. Widgets have been developed to supply cookie free versions of common services, such as embedding Google Maps, Youtube, or Twitter in your page.

More on cookies

University Website Programme widgets

In order to make things easier, the Cookie Audit tool was developed. There is an existing version of this tool that has been out of use while the university websites get migrated to the new content management system (CMS). This was built a number of years ago and was better suited to the old CMS. However, it seemed quite arduous to operate, requiring several steps to set up and then to again in order to get the correct output.

What is a web crawler?

My job is to create the next generation of cookie auditor. I set myself two non-functional goals when creating my crawler. Firstly, it should be simple to run. There should be a fast and obvious way to specify what website to crawl. My second goal was to output the data in a way that was easy to use without having to do something like converting between file types.

To build my web crawler I used a framework called Scrapy, written for Python. Scrapy is a framework for building web crawlers. It has a wide array of functionality, dealing with the overhead of crawling such as fetching web pages. It also allows you to easily specify what to extract from a web page and what to do with the data.

Scrapy

Python, the programming language

Scrapy allowed me to get stuck into the development process right away, making it easy to implement the basics while still allowing enough freedom to do the more complicated things. There are some things it can’t do so well, like render JavaScript. It is possible, but seemed like a bad idea when crawling thousands of pages as it slows down the process.

That being said, I still believe I will be able to achieve my non-functional goals. I have managed to save all of the data to a database which allows for powerful queries and easy access. It doesn’t perfectly align with my goal, as we often want the data in an excel sheet once we’re done, but it is a compromise worth making. I am confident, however, that my crawler will be both intuitive and easy to run. There are many ways of running the crawler and my dream is to have a simple web interface where you only have to fill in a few fields and then press go.

Comments


Add comment: