Quantitive analysis and tidy data
As Ted Underwood points out in his text Where to start with text mining.; “Quantitative analysis starts to make things easier only when we start working on a scale where it’s impossible for a human reader to hold everything in memory.” Distant reading requires large amounts of data, which can aid the qualitative close reading. Text mining is a useful tool when the amount of data is far too great for us to grasp using our brain (or as Underwood calls it ‘wrinkled protein sponge’). I have previously primarily focused on qualitative close reading, not only because it has suited what I work with, but that large quantities of data seems daunting. Underwood makes a great point of the usage of context required for qualitative close readings, which is thus aided by larger quantitative mining.
A majority of the time is as Hadley Wickham underscores spent on preparing the data for analysis (1). Hadley continues to point out how datasets often ‘breaks the rules’ of tidy data and very rarely are data sets ready to be analyzed. I instantly saved this list, breaking down the most common faults with data sets:
‘• Column headers are values, not variable names.
• Multiple variables are stored in one column. • Variables are stored in both rows and columns. • Multiple types of observational units are stored in the same table. • A single observational unit is stored in multiple tables.’ (Wickham, 6).
I have used Excel and other tools to analyze data sets before, but have always had difficulty in how to structure data to get the desired outcome. It has always been with some meddling that is forgotten after and can’t be replicated (also going back to previous weeks readings, to remember to write down the process, in order to be able to replicate a process and also what to avoid). My focus has primarily been on qualitative analysis, where large data sets has been more of a nuisance. I agreed with the discussions during class, whereas the question of what these large data sets actually could be used for? As Pamela Fletcher and Anne Helmreich, with David Israel and Seth Erickson project Local/Global: Mapping Nineteenth-Century London's Art Market, argues: ‘some questions cannot be answered—or even posed—without using larger data sets’. For me and my research, using large data sets is not only about finding and presenting answers, but also to discover other questions to work on. At first, it was difficult thinking about how I could work with large sets of data myself. One of the projects I have been working on is the Canadian influence on Inuit art practice from the 1930’s up until today; specifically in Cape Dorset. Qualitative analysis has been made, such as interviews and fieldwork with practicing artists, but there is still much more archival sources that would be incredibly interesting to study further. This would entail archives with global sales records, newspaper articles and governmental records; a vast amount of data to go through. Being able to search for keywords without going through all of the information myself would be a great advantage, and maybe help in discovering new questions and interesting perspective to focus on from a more qualitative perspective.
Some of the tools we have gone through during this week's class, such as _Voyant_; a platform that enables keyword search and comparison of texts would be useful for upcoming projects. An example would be to do a content analysis where keywords could be analyzed in relation to time; when were certain keywords used more and not etc. From this a more qualitative discourse analysis could be concluded thanks to the distant reading done with Voyant or similar a similar tool. Similar to the Google tool Ngram Viewer, which enables you to see usage of phrases in corpus of literature, where you can also focus on specific periods of time. Using these kinds of tools makes it easy to get a broader grasp of word usage that I could see myself using in first step of analysis. Important to keep in mind is also what Underwood points out; these kind of tools may give you the impression that you don’t need to do any programming of your own, due to the large body of tools already out there. However, these available tools offer more of a scope of what is possible, but with own projects, it will most likely require you to programme in order for you to effectively focus your methodological approach.
Hadley Wickham, “Tidy Data,” Journal of Statistical Software, Submitted. http://vita.had.co.nz/papers/tidy-data.pdf.
Pamela Fletcher and Anne Helmreich, with David Israel and Seth Erickson, “Local/Global: Mapping Nineteenth-Century London’s Art Market,” Nineteenth Century Art Worldwide 11:3 (Autumn 2012). http://www.19thc-artworldwide.org/index.php/autumn12/fletcher-helmreich-mapping-the-london-art-market Ted Underwood, “Where to Start with Text Mining,” The Stone and the Shell. http://tedunderwood.com/2012/08/14/where-to-start-with-text-mining/