Ilse Ras reports on her research on British newspapers

My role in the project is to collect and analyse a corpus of British newspaper articles, published between 2000 and 2016, on the topic of human trafficking.

A corpus is nothing more than a collection of texts, and corpus linguistics is the field that uses corpora (the plural of corpus) to draw conclusions about language use.

Collecting corpora is not always a straightforward endeavour, in particular when the topic under investigation is sometimes misunderstood by the public, the media, and even legislators. For instance, human trafficking may also be known as ‘modern slavery’, and may encompass such crimes such as organ harvesting, forced labour, and domestic servitude. Furthermore, there is an ongoing debate about whether sex work should always be considered a form of exploitation and trafficking. Finally, it may not always be clear whether someone has been trafficked (i.e. moved using coercion or deception for the purposes of exploitation), or smuggled (i.e. voluntarily but irregularly moved across borders), in particular as those who volunteer, or more accurately pay, to be smuggled across borders are also often deceived and exploited, both along the way and at the end point.

I used the Lexis Nexis database, mainly for practical reasons: it has a great number of British newspaper articles, I’ve used it before, and its output is compatible with a special Python script that Chris Norton (University of Leeds) wrote for me a few years ago, that separates articles and organises them in a useful folder structure.

One way of collecting articles is to go into Lexis Nexis, input some very broad but relevant search terms, and manually select all articles that actually discuss the topic under investigation. This is what I did for my PhD research. However, this is an extremely time-consuming method, as it requires wading through several millions of articles in order to select maybe 50 – 90 thousand.

I only have three months in which to complete my part of the project, and it would be a waste of time to spend all three months just collecting data.

But that’s not necessary. Gabrielatos, then at the University of Lancaster, published a paper in 2007 outlining the data collection method used to create a corpus for the RASIM-project (more information here:

http://ucrel.lancs.ac.uk/projects/rasim/). Gabrielatos’ (2007) method entails collecting a sample corpus using two ‘core’ search terms; generating a list of possible additional search terms from this sample corpus; testing these possible additional search terms, and then using those that pass the test (as well as the core search terms) to collect the full corpus. It’s a far less time-consuming method, and although there is a slightly increased risk of collecting articles that aren’t strictly relevant, it is also more systematic and therefore replicable than just manually collecting articles.

So that’s what I did. I first created three sample corpora:

1. One at the start of the period (1/1/00-30/9/00)

2. One at the end of the period (1/1/16-30/9/16)

3. One in the middle of the period (1/1/08-30/9/08)

I used a handful of core search terms, rather than two. These included ‘slavery’, ‘forced labour’, ‘sexual exploitation’, and ‘human trafficking’.

Of these search terms, ‘slavery’ produced the highest number of articles that did not also mention any of the other search terms, presumably because historical slavery is often still considered a different thing than modern slavery (which is in itself worthy of examination). As such, this search term is the threshold against which all other potential search terms are tested.

I used these three sample corpora to create key word lists. A key word list shows which words are used much more often in the sample corpus compared to another, reference, corpus. Rank 60 is the cut-off point for selecting additional search terms from these key word lists, so only the top 60 words of every list were tested against the threshold set by ‘slavery’. I also asked my co-investigators to send me lists of words that they thought could be useful search terms, and tested those, too.

Eventually, I used the core search terms, and the additional search terms that passed the threshold test, and collected the full corpus, which currently consists of slightly over 80 thousand articles. Chris’s Python script initially only recognised articles published by seven of the major British newspapers – which is what it was originally intended to do. So I adapted it to recognise the other news sources that we wanted to include in this project.

The next step is to actually conduct analyses of this corpus, and I will be back to update you on that in December.

- Ilse Ras

References:

Gabrielatos, C. 2007. Selecting query terms to build a specialised corpus from a restricted-access database. ICAME Journal 31, pp.5-43 (available here: http://clu.uni.no/icame/ij31/ij31-page5-44.pdf)

Representation of transnational human trafficking in present-day news media, true crime, and fiction

Search This Blog

Ilse Ras reports on her research on British newspapers

Labels

Comments

Post a Comment

Popular posts from this blog

Dr Nina Muždeka explains what she will examine in her research

PROJECT UPDATE

Dr Charlotte Beyer's research