This tutorial will walk you through the steps needed to do simple data analysis for library collection and circulation data.
What you need:
- Excel (I used )
- A dataset (.csv, .tsv or .xls format). I’d recommend walking through the steps with a small sample dataset before applying them to your target dataset. I’ll be using the dataset MROBKJFI.xls for this tutorial.Your target dataset should include at least the following fields:
- title
- author
- date of publication
- call number
- total loans
- patience and fortitude!
What you will learn:
- How not to be scared of spreadsheets
- SUM
- LEFT
- Text to Columns
- Pivot Table
- Conditional Formatting
- IF
- Fill down shortcut
- Paste values only
- SUMIF, COUNTIF, AVERAGEIF
Step 1: Make sure you understand your data (at least sort of)
Ask yourself some questions before starting to do anything with your data. Where did this data come from? What column headings does it have? Do you understand what the column headings and file name means?
Step 2: Make your data easier to understand and share with better metadata
First of all, change the name of your file to something that is easier to understand, and which contains important information about your file. File naming conventions are important for good data management: read more about them in this excellent LibGuide from Purdue University Libraries. Use the following naming convention for your file:
YYYYMMDD_CourseNumberSemesterDate_ProjectName_DescriptionofFile_YourLastName_versionnumber.xls.
So the new file name for my file is:
20151105_LIS655Fall2015_Assignment4_MonroePLChildrensFictionDataAnalysis_Masemann_v01.xls.
Yes, it’s a mouthful, but using good naming conventions will save you lots of trouble and heartache later, especially when working in collaborative teams.
While you’re at it, make sure you know where you saved your file (since you probably downloaded it from somewhere and it’s easy to lose track when you do that). I’m saving mine to the desktop.
Now, we’re going to write a short readme file for this data. A readme file is an unstructured form of metadata. It’s written in plain text so it’s easily preserved and shared, and it contains the important information users might want to know about your data. Readme files are useful for other people using your data, but they can also be useful for you, because you might forget exactly where your data came from and what it contains.
So, open Notepad and write a short description of your data, including
- where your data came from
- what date the data is from
- what format the data is in
- anything else you think would be helpful.
You might want to leave this step to the end once you are more familiar with your data.But don’t forget to do it! Metadata is important!
Step 3: Clean up your Data
Make your data much easier to work with by formatting your spreadsheet as a table:
Select your entire worksheet by clicking in the top left corner. From the Home tab, click on Format as Table, and select your preferred table style. Click OK for any dialog boxes that pop up. It may take a few moments for the formatting to take effect, especially if your table is large.
Then, scroll through your spreadsheet and take a look at any data that looks unusual. You might notice that the data in the barcode field (if you have one) looks odd: try making that column wider.
Take a look at the Pub Date field. This field is pulled from the MARC field 260 subfield c. Catalogers often use non-numerical characters such as square brackets, question marks and ‘c’ (for copyright) when populating this field. Before we can run calculations on this date data, we need to strip out non-numeric characters. To do this:
Select the Pub Date column.
Click “Find and Select” (on the Home tab, at the top right).
In the Find what field enter the non-numeric character you would like to get rid of.
Leave the Replace with field blank.
Important: To replace question marks, do not just enter ? in the Find What field. The ? is the wildcard operator in Excel, and if you do this, you will erase all the data in the column. You will need to add a tilda (~) before the ?, like this:
Some ideas for characters to strip out: [ ] , . c, ©, Â, etc.
Once you have cleaned up your Pub Date column, it’s time to check to see whether the columns in your worksheet are formatted with the correct data format. Excel will process date data differently than it processes text data and numerical data. In order to perform calculations with dates, however, it’s easier to format them as numbers. And there are lots of numbers in catalog data that actually behave like text. For example, you would never calculate the sum of 2 Dewey Decimal numbers, but you would want to sort them. To format the Pub Date column as numerical data:
Select the Pub Date column.
Right click, choose “Format cells.”
Select “Number” from the category list, and select 0 decimal places.
Click OK.
So, go through each column in your worksheet and format it. Columns containing numerical circulation data (e.g. total number of checkouts) should be formatted as numbers. All other columns should be formatted as General.
Once you have formatted your Pub Date column as numerical data, all the numbers should be aligned to the right. If any are still aligned to the left, you need to check them and do the following:
- keep stripping out non-numerical characters
- make sure the cells each contain only one year – if 2 years are listed, keep the most recent year (which usually indicates the date of a reprint or new edition, which is a better indicator of the age of the book for collection management purposes).
For more tips on cleaning up Excel data, see Top Ten Ways to Clean Your Data.
Step 4: Add some additional columns to make your data more powerful, and delete some columns (cautiously!) to make it more manageable.
You have a lot of data to work with already, but adding some additional columns will provide you will make it even more powerful.
In addition to analyzing publication year, it’s also helpful to analyze the age of each item. To do this, you will need to create a new column and (cue drumroll) use a formula.
FORMULAS AND FUNCTIONS SHOULD NOT BE SCARY. THEY ARE YOUR FRIENDS BECAUSE THEY MAKE YOUR LIFE EASIER AND YOUR WORK FASTER.
If you have never used a formula in Excel, take a look at this guide from the University of Michigan Libraries.
To create the “Age of Item” column, first add the heading “Age of Item” to the column next to “Pub Date.” If you need to add a column in this position, use the “Insert” button on the Home tab.
Then position your cursor in the first empty cell in this column, and type
=2015-
then use your mouse to click and select the cell containing the pub date (which should be right next to the cell you are working in). The selected cell should change color and get a flashing border.
Press enter.
The “Age of Item” column should now be populated with the age in years of each item.
One major problem with circulation data is transferring data from one ILS (integrated library system) to another when you upgrade your system. So you end up with a situation where data such as number of checkouts is stored in two or more places: you can have one field in your current ILS which contains the number of checkouts since that system was implemented, and another field that contains the legacy circulation data from the old system. For the purposes of this exercise, it’s not important to compare circulation from before and after the implementation of an ILS (although this could be an interesting question for another day). So let’s combine the columns that contain data from both sources. While we’re doing this, let’s also combine the number of checkouts with the number of renewals, since we’re not interested in comparing checkouts to renewals: we only want to know about the total circulation of the item.
In this example,
Circulation information from the current ILS is contained in the column Koha CKO
Legacy circulation information is contained in the column Past Use
Renewal information is contained in the column Koha Renews.
First, change the heading of the column next to “Age of Item” to “Total checkouts”.
Then, place your cursor in the first empty cell of that column.
Type the following:
=SUM(
Then, click on the first cell in the Koha CKO columnn
Type a colon :
Click on the first cell in the Koha Renews column.
This should select the first cells in each of the Koha CKO, Past Use, and Koha Renews columns.
Type a ) to finish the formula, and press enter. Your new column should be populated with the total of all three columns.
You will be working with the Total Checkouts column for the rest of the exercise, but don’t delete the columns you used to calculate it!
Finally, we want to create another column that gives us a broad classification for the materials. Right now, we have a column containing call numbers (in either Library of Congress or Dewey Decimal format), but it’s difficult to use complete call numbers for analysis because we can’t summarize them easily. It’s pretty pointless to say “We have one book with call number 025.26407 and one book with call number 025.3445” but it’s quite useful to be able to say “We have 300 books in Dewey class 025 and 600 books in Dewey class 027, but the ones in Dewey class 025 circulate at three times the rate of the ones in class 027.”
Luckily, both Dewey call numbers and LC call numbers are designed in such a way that they can be compared more easily if you truncate them. It’s easiest to see how this works if you just dive in and do it. So,
First, change the heading of the column next to “Total checkouts” to “LC Class” or “Dewey Class” depending on your data.
Then, place your cursor in the first empty cell of that column.
Type the following:
=LEFT(
Then, click on the first cell in the call number column. You should have something that looks like this (if Column G is the column containing the call numbers):
=LEFT(G2,
The number after the comma in this formula is the number of characters that will remain at the left of the data once it is truncated.
So, if you use the formula =LEFT(G2,2), this is what will happen to the following call numbers:
025.23 –> 02
PN5677 –> PN
You can see that for Dewey call numbers, it’s more useful to truncate them to 3 characters, and for LC call numbers, 2 characters is better. If you were doing more detailed analysis, you can even truncate Dewey numbers to 4 characters and get useful data. But for this exercise, let’s stick with 3.
So, choose your number of characters, close your formula with a closing parenthesis, and press enter. Your new column will be populated.