The Digital Collections Catalyst project (2021)
for the State Library of Queensland
What we search for reveals something of ourselves: our interests, our fears, our curiosity, or simply what we have forgotten. And it's not just what we search for, as how we search reveals something of ourselves as well. So when viewed in aggregate, what do the - searches (and counting) of the State Library of Queensland catalogue reveal about the events, topics, and concerns that have been on the minds of Queenslanders over the past decade or so? Do we ask more questions in winter? Do we make more typos in January? Did we swear more in 2020? Are we becoming more anxious? What aspects are constant, and what aspects are more prone to change?
This project allows you to generate topographic maps from the words and phrases that appeared in the searches made by people using the library catalogue since April 2012 (when the current tracking data began). And by looking at these maps - with their peaks, their valleys, their plateaus, and their plains - we get an imperfect glimpse of the collective interests and concerns of those using the catalogue, and the ways in which they've changed over the months, seasons, and years.
Read a more detailed explanation about the project below, or head straight to the
So in an age when there's more and more ways to access more and more information, what do people turn to the State Library for, and how do they go about finding it?
There are many ways to search the catalogue, and what follows is by no means an exhaustive list (and are all actual searches). Do we just use a single word ('cats', 'brisbane'), or the author ('Melissa Lucashenko', 'murakami'), or the title ('moby dick', 'The Yield'), wildcard characters ('monz mudflat*', 'nurs*'), or a Boolean search ('(preserves OR jam OR chutney OR pickles)'), or both ('(surg* OR operat*)'), or do we use an ISBN ('9780471732082'), or a catalogue number ('518774350002062'), or a Dewey Decimal number2 ('616.075'), or do we phrase it as a question ('does mozart make babies smarter', 'should citrus be stored in fridge'), or put it in quotes ('"one of the soldiers"', '"expo 88"'). Is it specific ('sleep paralysis and alien abduction'), or general ('weird shit'). Are we looking for books ('book about growing roses', 'little red book'), ebooks ('kids ebooks', 'tiny house ebooks'), journals ('Neue Grafik', 'Australian Journal of Political Science'), photos ('Badu Island photos', 'corley photographs'), illustrations or drawings ('botanical illustration', 'Drawing of migrants disembarking from a ship, ca. 1885'), musical scores or sheet music ('Chopin Sheet music', 'psycho complete score hitchcock'), recipes ('paleo recipes', 'GOOD HOUSEKEEPER'S PICTURE RECIPE BOOK'), maps ('flood maps', 'SG56-06'), letters ('Ernest Henry letters', 'WW1 letters'), manuscripts ('Harriet Barlow Manuscript', 'illuminated manuscripts'), films ('storm boy', 'Wake In Fright'), streaming ('Kanopy catalogue', 'streaming movies'), audio ('Gunggari Language Audio Cassettes', 'pavarotti audio'), newspapers or gazettes ('kingaroy newspaper', 'Queensland Police Gazette'), people ('Elvis'), places ('Cribb Island'), things ('zines'), or even last year's excellent digital catalyst project ('mapping future brisbane'). Do we search in lowercase ('biofuels'), or in all caps ('PLANTS OF CENTRAL QUEENSLAND'). Do we make typos ('euthenasia', 'databae'), or hit enter too soon ('a','everything is f', or the many blank searches), or accidently paste in a URL ('google.com used cars under $700', 'www.ato.gov.au'), or an email address ('*****@hotmail.com'), or another language ('해를 품은 달', 'حذاء رياضي'), or make a database hacking attempt ('"' '1'='1 OR": 1'). Are we uncertain ('is this working?'), frustrated ('lynda login F**K SAKE!'), or excited ('sloths!', 'minneeeeeeeeeeeecccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccrrrrrrrrrrrrrrrrraaaaaaaaaaafffftttttttttttttttt'), having questions about religion ('Orthodox Jews and IVF', 'Hindu belief on euthanasia', 'Christianity and sexting'), or are we just after something better ('better sleep', 'better public transport', 'BETTER BEEKEEPING'). Do we use emoticons (':)', '*-*'), or emojis (🐱), or is it gibberish ('sdfdfdfddfdfdfdfdffff', 'uuiji6ytdttttt') - possibly due to cats on keyboards - or is it meant for the library website, instead of the catalogue, ('easter opening hours', 'can you eat at the library?'), or, um, is it perhaps meant for another browser window ('you porn', 'stupid ass nae nae baby', 'stop driving by my sisters house the neighbours hate you').
However we search, our searches reveal more than just the topics we are interested in. In the words we use, and how we choose to phrase and/or formulate the request, it can also reveal our vocabulary, our familiarity with the library's systems, and at times, our emotions. Some searches are concise, some are verbose, some are flippant, some are polite. Some show people clearly wrestling with issues - health, dating, children, loneliness, anxiety, politics, religion, ethics - while others show boredom, frustration, and inattention. Regardless, the catalogue will dutifully try and provide a response to whatever was entered.
When looked at as a whole, the searches show the breadth of people's interests, hopes, fears, and moods, and as the data is updated at the end of each day, think of these maps as a glimpse into the forever-evolving, forever-shifting, collective preoccupations of Queenslanders over time.
2. Interestingly, the majority of the Dewey searches are between 610-620, which is the range for Medicine and Health
How it works.
Enter a search term (or terms) or select from a few predefined categories from the fields in the 'Search' section below, and you'll be shown a topographic map, generated from the relative monthly frequency of when those terms appeared in a search3. The map covers the period from early 2012, when the current dataset began, and is updated daily at around 3am. The map can be viewed as either 2D or 3D, and both the map and data can be downloaded (and although the map and csv data is aggregated by month, the json data has a daily breakdown).
Each map also has a few stats, as well as the top searches. Depending on the number of matches there are for each particular query, there's also a section of 'top words', and 'sentiment'. The top words show what other words appeared when people were searching, and give you more of an idea of the aspects of a subject people were more interested in, and as the linguist John Firth said, "you shall know a word by the company it keeps". The sentiment section provides a (crude) look at the sentiment of the search words that were used, whether positive, negative, or neutral (more info in the sections below).
But first a few caveats: maps are never completely accurate, and by necessity, are always an oversimplification of what they are trying to represent. Nor are they without the biases of their creator(s), as choices are made about what to include, what to exclude, what to highlight, and what to downplay. But their benefit is that they can provide a quick overview of some of the key features of an area at a glance, in this case the searches made by people using the library catalogue.
Also, the data that the library captures is completely anonymous. No personal details are recorded, and where email addresses have been accidently used by people as a search term, the name has been removed (but the service provider kept). Apart from that, the data is unfiltered, and left exactly as it was entered, and as a result, depending on what is searched for, the results may contain words, phrases, or concepts that people may find offensive.
And a final note: although it can be tempting to draw conclusions from any patterns and potential relationships that seem to appear in the maps, without further study, we can't be sure that those relationships actually exist, or just appear to (the old 'correlation is not causation'). This is especially true for terms that occur less frequently, and so the sample size is small, as well as any terms that appear to have peaks in April or May 2012, due to incomplete data for those months***.
3. The counts are based on the number of searches where the term appeared, rather than the number of occurrences of the term, so 'Palm Island 1930' and 'palm island centenary historical palm island images' both just count as 1 occurrence, despite the fact that 'palm island' occurs twice in the latter.
A quick overview
On average there's around - searches per day. The most searches were in 2016 & 2017, followed closely by 2020 (note: tracking didn't start properly until June 2012, hence the comparatively low number that year). The average number of words per search is around three. Hover over the graphs for more detail.
The graphs below show the relative percent of when searches occured in 2021.
Not surprisingly, searches were more common between 9am and 4pm, with a peak at around 2-3pm.
It peaked on Wednesday, with the least on Saturday, which had around half as many searches.
The peak was in May, with over twice as many searches as the lowest month (December). The season with the most searches was Autumn, the least was Spring (note that the month data above is based on the average per day for each month).
How to read
A topographic map shows the features of a given area. In the physical world, this would include features such as mountains, rivers, lakes, and oceans, as well as potentially the types of vegetation or land cover such as forests or swamps, and structures such as buildings, bridges, or towns. To show elevation, or in this case, how common a search term was at that point in time, contour lines are used. They enclose areas of the same height, and each contour line indicates the same increase or decrease in height.
Contour lines that are close together indicate a rapid change in height, contours lines that are farther apart indicate a gentler slope. The direction of the slope can usually be determined by the numbers on the contour lines (as well as colour), but these maps just use colour. If the colour is getting lighter, then the height is increasing, if it is getting darker, then is decreasing. Peaks are indicated by black dots, and the highest point overall by a pink one.
The maps below show a few examples of some of the typical features. Note that the map can be viewed in either 2D or 3D. Each map has a scale, showing the minimum value - a dark green - and the maximum value, indicated by white. In order to best show the differences within each search term the scale varies.
Note: clicking on the marker icon will show you the topographic map for that word or phrase, clicking on the magnifying glass icon will take you to the search results for that word or phrase in the library catalogue.
- SHOW TOPOGRAPHY
- SEARCH CATALOGUE
The map below shows the search results in 3D for six different searches. Note that in order to best show the different feature types, the elevation/vertical scale varies between maps (in reality, searches containing 'newspaper' are over 120 times more common than searches containing 'XXXX').
The first recorded search was in May 2012, but didn't become popular until October & December 2020.
An example of a search term with ongoing interest. First search in the catalogue was 'Covid-19 economy' at around 7am on the 9th of March, 2020.
An example of a term with an annual peak, in this case around the date of the show in early August. Note the recent drop, which could be because the show has been cancelled for the past two years.
An example of a term that is always popular, with multiple searches per day.
This shows a gentle increase from left to right from a low level. Note the darker colour, which shows this is lower.
This shows a gully in April-May 2013 & April 2014 and the saddle between March & May 2015.
This shows a steep increase up to the highest point on the map (as indicated by the pink dot).
How to search.
There are a number of different ways to explore the - searches that people have made over the years as they've searched the State Library of Queensland catalogue. These are described below.
exact: The simplest type of search. Will only match exactly what's in the search field. So a search for 'Brisbane' will only return searches that were also just 'Brisbane' (though as the default search is case-insensitive, it would also return 'brisbane', 'BRISBANE', and 'bRISBANE' unless the 'case-sensitive' option had been selected).
contains: This will return any search that contains what's in the search field, so a search for 'Brisbane', would return any search where the word 'Brisbane' appeared.
contains (and): This can be used to narrow your search, as by entering a comma separated list of words, it will only return searches where all the words appeared. For example, a search for 'bacon, eggs' would only return searches where both 'bacon' and 'eggs' appeared.
contains (or): This can be used to widen your search, as by entering a comma separated list of words, it will return any searches where at least one of the words appeared. For example, a search for 'bacon, eggs' would return any searches where either 'bacon' or 'eggs' appeared.
contains (not): This can be used to narrow your search, as by entering two words separated by a comma, it will only return searches that have the first word, but not the second. For example, a search for 'bacon, Francis' would return any searches where 'bacon' appeared, as long as 'Francis' did not appear.
starts with: Will match any search that starts with the characters in the search field.
ends with: Will match any search that ends with the characters in the search field.
regex: This allows you to use regular expressions to search the database. By default, the field in the database is converted to lowercase so there is no need to account for case in your expression. To do a case-sensitive search, make sure that 'case-sensitive' is selected. Note that the regex is the REGEXP in MySQL, so spaces are written as the character class [:space:] rather than \s etc. Read more about the requirements here.
There's also two other options that can help you get the search results you're after:
case-sensitive: By default the case of letters are ignored (or more accurately, they're all treated as lowercase) so 'Brisbane', 'BRISBANE', and 'brisbane' will all return searches where the word 'Brisbane' - in any combination of uppercase and lowercase letters - appears. If case matters, then make sure that this option is selected, and it will only return searches that match the case as it appears.
whole word: By default word boundaries are ignored, so 'cat' will also return searches that include 'cats', but it will also include those with words like 'catholic' and 'education', so if you want to ensure that other words aren't accidentally matched, make sure that this option is selected. (Note that this is more likely to be an issue with shorter words, so in order to automatically get plurals, it is better left unselected unless required.)
Note that it's worth checking 'Top Searches' to make sure that you're getting the results you expect, especially to determine whether the 'whole word' option should be selected.
Select from the options below to explore the - searches of the library catalogue. The results can then be downloaded as either csv or json (and the generated map as either an svg or a png). Note that the scale varies from search to search, and 'RATIO' refers to the number of times the term appeared every 100,000 searches in that month, while 'RATE' refers to the number of times the term appeared per day in that month. Maps can be viewed in either 2D or 3D (and the 3D one can be rotated). As a rough guide, terms that appear once per day work out to be around - times, once per week around -, and once per month around - or so. Note that depending on the search, some results may take a while to load (especially those marked with a '#' in the dropdown menu).
- SHOW TOPOGRAPHY
- SEARCH CATALOGUE
* Aboriginal languages in Queensland list here. As the page notes "Aboriginal and Torres Strait Islander languages are oral languages that have only been written since European settlement; there may be several variations in spelling" and here the code only looks for the languages as spelt (case-insensitive) as they appear on the page
** Torres Strait Islander language list from here.
*** April & May 2012 are unreliable as they only recorded 156 and 734 unique searches respectively (the average for the rest of the year was around 50,000 unique searches per month). In February & March 2022 the power was off at the library at times due to the floods, during which the online catalogue was unavailable, so total searches for those months are lower (during the affected period - February 27th to March 8th - there were around 300 searches per day on average, which is about 10% the daily totals of the weeks before).
**** stopwords are common words such as 'a', 'the', 'is' etc that are often removed in natural langauge processing in order to better highlight relevant information. The stop words used in this project were those from the python-based Natural Language Toolkit (NLTK) - except with pronouns removed - so the final list was as follows: "what", "which", "who", "whom", "this", "that", "these", "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while", "of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before", "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again", "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each", "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than", "too", "very", "s", "t", "can", "will", "just", "don", "should", "now". I also looked at using lemmatisation for this section, but given the sheer number of words in some of the queries, it was taking too long (over 10 times longer in many cases), so left the words as is. After that I played around with converting plural forms of the word to singular, but as most of these are regex based, and the sheer number of words in the corpus, errors were inevitable, and writing exceptions unwieldy, so again, I figured it was best to leave the words as they were (e.g. a rule based converter would change 'Torres' to 'Torre' or 'Torr').
**** VADER Sentiment Analysis Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014. Note that this project only uses the dictionary component, so context and word order are not factored in, and in the original study, the negative and positive scores were between -4 and 4 respectively, where here I have scaled them to be between -5 and 5 for no real reason other than the fact that I think it looks a wee bit neater.
While there are many ways to search - by title, author, keyword, catalogue number, Boolean, etc - a handful of people choose to politely phrase their search in the form of a question (currently -, or around -% of searches). This section looks at the searches that start with words such as 'who', 'what', 'when', 'where', 'why' etc (sometimes referred to as 'interrogative words'). As a proportion of all searches, people tended to ask the least questions in the summer, and the most in winter. To see more about these types of searches and their topography, search using the 'starts with' option in the 'Search' section above.
- SHOW TOPOGRAPHY
- SEARCH CATALOGUE
Note: only returns searches that had at least three words, and some of the results may be for books where the title of the book is a question, as there is not an easy way to exclude them. Also, given that 'Will' is also a name, to exclude searches for a person from the list, 'will' has the added criteria that it has to end with a question mark.
Some of the questions asked
- Search Library
View map: for all searches starting with ''
The trees, not the forest.
When looking at data, especially large data sets, the focus often tends to be on what is at the top - the biggest, the latest, the trends - and sometimes the information contained in the less frequent elements can be missed. Given that almost 40% of searches only ever appeared once, and a further 6% or so only twice, the searches exhibit the classic long tail distribution:
Number of times search
April 2012 - Dec 2021
Therefore, in order to briefly have a look at a few of the individual trees, rather than the overall forest, the searches below are a selection of those that have only appeared a single time.
A few recent examples
(there were - searches that only appeared once)
- Search Library
Notes, & a few
As in the physical world, certain events leave their marks on the landscape. The steep cliffs formed by searches related to COVID (the first search was for 'Covid-19' at around 7am, on Monday the 9th of March, 2020) are now part of the data, and only time will tell the final shape and form that those peaks will take. The gentle decline of DVDs (and the subsequent rise of streaming), the 2011 floods (and the catalogue going offline during the 2022 floods), the annual peaks for Anzac Day, the Ekka, and those (presumably) related to school projects on Ancient Egypt, Greece, and Rome, the dwindling interest in Lonely Planet travel guides, the increased popularity of the em dash, all are etched into the data, and are now permanent features of the landscape.
A librarian once said "we cannot see the role libraries play in fighting inequality, polarization and loneliness from a spreadsheet", to which I'd add, "or a data visualisation", but I hope this highlights the huge range of questions - both literal and implied - that people have turned to the library for, in order to seek some answers. There's obviously heaps more that could have been done with this dataset, but I hope people have found at least some aspects of this interesting, and so um, feel free to now go and search the catalogue (and thus contribute some more data points to the maps), get some tips on some of the different ways to search, or keep exploring the maps using some of the topics in the footer below.
- Search data is taken from Google Analytics and based on 'unique searches', where duplicate searches within a single session or visit are excluded. i.e. If someone searches for 'Townsville' several times during their visit, it will only count as one search for 'Townsville' (but it is case sensitive so 'Townsville' and 'TOWNSVILLE' will count as two separate searches). More info here.
- Search data is currently stored in Google Analytics with some of the search criteria as a prefix. So a general search for 'Townsville' is stored as 'any,contains,Townsville', a search for a title that begins with 'MOBY' is stored as 'title,begins_with,MOBY' etc. For this project, the prefixes have been stripped.
- in order to speed up how quickly the data is returned, the handful of searches that were over 768 characters were truncated (0.002% as of the end of 2021). Truncated searches end with '[...]'
- Those who searched for the Euro symbol (€) might have noticed a bunch of results where 'â€™' was in the results instead of an apostrophe (e.g.'Donâ€™t call me Ishmael'). This is an issue with the original data, and is due to a mismatch of character encoding, which can occur when the search text is copied and pasted from another program. Another common one that appears in the data is '~2F' for '/'
- For a while there was going to be another section looking at the ratio of unique words to total searches for a term, to see if more complex topics had more unique words used when people were searching for them, but I sort of ran out of time.
- HTML tags were stripped from the searches when they appeared.
- As noted earlier, although there is data for April, May, & June 2012, the first month with a full month of data is July, so data from earlier than that should be considered unreliable. Also, in February & March 2022 the power was off at the library at times due to the floods, during which the online catalogue was unavailable, so total searches for those months are lower (during the affected period - February 27th to March 8th - there were around 300 searches per day on average, which is about 10% the daily totals of the weeks before)