你在这里

【实习报告】试用R语言对上海图书馆借阅情况的初步分析(中英文版)

试用R语言对上海图书馆借阅情况的初步分析

许朱怡 (Zhuyi Xu)

*See English version below

        信息爆炸的新时代,传统的图书馆是否会受到冲击?根据在上海图书馆的实际观察,目前依然读者盈门。这样的势头是否可以延续,还是另有隐忧呢?好在图书馆运营和管理也在进一步信息化,互联网化,积累了大量数据。那么,是否可以对读者及借阅情况做一个简要的统计分析,看看有些什么有趣的发现呢?这正是我在上海图书馆实习期间关注的内容,并基于一定的样本数据,利用R语言作为分析展现工具,进行了初步尝试。

 

  • 读者年龄的分布情况。

图书馆的服务对象就是广大读者,所以选择读者的概况作为第一个研究对象。Histgram可以用来直观地反映分布情况,按照每5岁年龄段作为划分的话,就得到了图一所示的结果。可以发现,读者的年龄主要集中在20-45岁之间。

图一

 

        如果进一步细分,如图二,可以看到读者年龄最集中的是在25-27岁。另外还有一些有趣的现象,比如读者年龄在14-15岁时出现一个低谷,大概正好是初三高一的阶段,同学们都在忙着准备中考,以及适应高中生活吧? 而到了大学阶段,读者数量开始明显上升。在大学毕业以后的年龄段,出现了读者数量最快速的增长。这应该是离开大学的图书馆以后,需要找到一个新的阅读途径;而且刚刚踏上工作岗位,很多方面需要补课的原因吧。而60岁左右,也出现了一个增长,也许是在退休以后,有不少人有时间可以重新捧起书本。简单一张图,看透了人生啊。

 

图二

 

        那么,在这其中,男性和女性的哪个多呢? 根据统计,女性读者占比41.4%,男性读者占比58.6%。见图三。

图三

 

        按年龄分布看的话,不同性别读者的分布结构基本类似,略有差别,见图四。

图四

        如果用图五来直接比对的话,可以发现:25岁以前,女性读者比男性读者略多;而25岁以后,男性读者明显超出女性读者,在50以上年龄段差距进一步拉大。她们是否把更多的时间花在电视机前了呢?如果可以向她们推荐合适的好书,应该会吸引更多的读者吧。

图五

 

        在统计学上,箱线图boxplot是一种常见的展现方法,与柱状图相比,可以体现不同的视角和细节。如图六中,盒子中间的粗线表示中位线,而盒子的左右两边则是四分位线。箱体外的两条线是1.5倍四分位数间距,再外面的圆点表示一些特殊情况,直到最大最小值。

图六

 

  • 读者偏好的载体 书/杂志/报纸

下面再来看一下读者偏好的载体。在三类载体中,书籍当然还是占最多数的,也是被阅读量最大的。同时,杂志也吸引了相当数量的读者,而报纸也是有一定的市场的,尤其是在年长的读者中。见图七。

图七

        在图七中可以看到,有不少读者非常踊跃,借阅量上百。但进一步分析的话会发现,在借阅书籍的读者中,75%的人只借阅了不超过5本书;相对应的,杂志的数字是4,报纸的数字是2。如图八所示。如何来充分激活更多读者的活跃度,以后可以做进一步的研究。

图八

 

 

  • 热门书籍

        接下来要分析的是另一个大主角,书,看看大家具体对哪些书会特别感兴趣。图书有不同的分类法,通常的有书店上架推荐的分类法,还有专业的中图分类法等。根据2014年借阅的样本数据,我下面来展开探索。

        在R语言中,summary是一个非常方便的统计汇总功能。利用这一功能,我可以非常快速地对数十万条记录进行运算并立刻得到结果。按照常用图书分类法,可以看到2014年最热门的图书分类排名依次是:经济、文学、中国文学、计算机与网络、历史地理、语言与文化、轻工业、哲学、医药卫生、经济管理。而各个年龄层的读者,对书籍内容的选择会各有特征。汇总在一起以后,得到了图九所示不同年龄的阅读偏好。

图九

        从中可以观察到:

  1. 2014年最热门的类型是经济,而且集中在20-60年龄段。60岁以上的读者对此的兴趣明显下降。
  2. 文学以及中国文学是唯一出现在各个年龄段前十的类型。
  3. 随着年龄的增长,读者对医药卫生的关注度逐步增加。
  4. 我最关心的青少年组的热门书籍与其他组比很有特色,见图10。心理学位居第二,确实我也读了不少心理学方面的书。传记,社会学等的上榜,说明大家正在找学习榜样,并学着融入社会吗,不是吗?

图十

 

  • 总结

        以上是实习阶段的一些收获与汇总。通过这次实习,让我能够把学到的统计学知识应用到实际的而我又非常感兴趣的场景之中,是非常开心的事情。通过R语言的学习和实践,把分析的结果很直观地反映出来,发现R语言的能力非常强大,还有更多的功能有待进一步挖掘。而我也愿意继续在这方面深入学习,通过科技的手段,帮助在人文方面的研究,是美妙的事情。

 

  • 致谢

        非常感谢上海图书馆系统网络中心提供这次宝贵的实习机会。同时也感谢UIUC的统计学专家瞿培勇教授在研究时给予的指导。

 

  • 参考

《R语言统计入门》, 人民邮电出版社

《数据分析:R语言实战》, 电子工业出版社

人大经济论坛 - R语言论坛, http://bbs.pinggu.org/forum-69-1.html

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

 

Preliminary Research on Shanghai Library lending stats with R Programming Language

Internship Summary Report

Zhuyi Xu

 

In the new era of explosive information collected, people rely on their electronic devices more than ever, and I was wondering: how does it affect the traditional libraries? During my internship at Shanghai Library this summer, I found that readers still crowded the reading rooms every day and there was even a long line waiting for the library to open early in the morning. But my question still remains: are there any interesting phenomenon and facts which we might not know? Fortunately, the library is managed and supported by very advanced IT systems which provided a large quantity of data, and a statistical analysis of the library readers and their lending information can be retrieved, which allows us to unveil some interesting facts. This is the scope of my internship, for which I performed some preliminary data analysis using R programming as a software tool.

 

Distribution of Reader’s age

Readers are the main service subjects of the library, so I focus on the basic facts of readers as my first research project. The histogram is used to show the distribution of reader’s ages. I divided the readers’ age into groups of 5-year each, and the results are shown in Fig. 1. It is very obvious that the main reader groups are between the ages of 20 and 45.

Fig. 1

 

In the next step, I studied the distribution with more refined scale by grouping readers every 1 year of their ages. Fig. 2 indicates that most readers are concentrated in the range of 25 – 27 years old. There are also some interesting facts presented in the chart. For one thing, it is apparent that the population of readers goes through a small rise at age 14 to 15. That is possible that teenagers during that ages are busy preparing for the high school entrance exams and getting used to the new high school life. And then the number of readers experiences a dramatic rise starting from 20 years old and reaches its peak at about 30 years old, which is probably due to readers’ need to find a substitute of college library and the urge to learn new knowledge for their new jobs and possible new way of living. Also, there is a small plateau around 60 years old, which might due to the result of more people picking up readings again after their retirement. A simple chart, reflecting many different stage and chapter of their lives, which I found they are are quite interesting.

 

 

Fig. 2

 

According to the statistics result in Fig. 3, female readers are around 41.4% of total population, and male readers are 58.6% respectively.

 

Fig. 3

 

 

 

In addition, I drew the distribution of readers’ age between different genders, and compared the structures of two genders, which are quite similar. More details are provided in Fig. 4 as follows.

 

Fig. 4

Furthermore, by comparing the distribution of readers’ ages for different genders side by side, which is shown in Fig. 5, I observed that there are slightly more female readers than male readers among those younger than 25 years old; but for the elder ones, male readers are significantly more than females, and indeed, they become the majority of readers among readers 50 years old or beyond. This phenomenon also reminds me of my own grandparents: my grandmother likes to stay in front of television all day while my grandfather would make a cup of tea and read for a while everyday.

 

Fig. 5

 

In Statistics, boxplot is another frequently used graphical visualization tool, it can show different angles and details comparing to histogram chart. See Fig. 6 as an example, the thick line in the middle of the box is the median, the two sides of the box are quartiles, so what’s within the box reflects middle 50% of all data. The crosshatch on the whisker is plotted at the position of 1.5 x IQR, and those small circles reflect some extreme data, until the minimum and maximum are all drawn.

Fig. 6

 

Reader’s preferences on three reading categories - books vs magazine vs newspaper

In the following, I also investigate the reader’s reading choices among three types: books, magazine and newspaper. I observed that books are of course the majority, which are most lent. In the meantime, magazine also attracted a significant amount of readers, and newspaper gains a certain popularity, especially among the aging readers. See Fig. 7 below.

Fig. 7

In Fig. 7, it’s noticeable that many readers are very active, borrowed hundreds of books or magazines. But through the view of boxplot in Fig. 8, 75% of book readers only borrowed no more than 5 books, 4 magazines, or 2 newspapers respectively.  More research could be done to find out how to stimulate more readers to borrow more books, magazines or newspapers.

 

Fig. 8

 

 

Popular Books

Next, I also analyzed another important factor, books, to discover what kind of books are most sought by readers. There are different methods to classify the book types, for example, one is used to suggest the right display shelf in bookstores, and another is more professional Chinese Library Classification(CLC). I will further explore this based on the sample data of books lending history in 2014.

 

Using R programming, a summary is a very handy function, which can be used to summarize tens of thousands of records and get the outcomes within seconds. According to the standard book classification, the most popular book types in 2014 have a following order: Economy, Literature, Chinese Literature, Computer and Internet, History and Geography, Language, Light Industry, Philosophy, Pharmacy & Health, and Economics Administration. By taking a closer look at the preferences of people from different age groups, the results are summarized in Fig. 9.

 

Fig. 9

Following are the findings:

  • People between 20-60 years old are most interested in Economics, which is on the top of their reading list in 2014.  Although readers older than 60 obviously lost their interest on reading Economics books.
  • Literature and Chinese Literature are the only two types of books that are among top 10 across all age groups.
  • Readers’ interests toward Pharmacy & Health grow as people getting older.
  • The young readers’ group, which I am most interested, is between 1-20 years old, their books selection is quite unique comparing to other age groups. Fig. 10 provides more details. Note that Psychology is ranked No. 2 for the young age group. This is not surprising, since I also read many books in Psychology. Biography and Sociology, which are top 10 for this age group. This may suggest that young people like me are seeking role models for their growing and trying to learn to fit into the society. Does that make sense?

 

Fig. 10

 

 

  1. Summary

Above is the summary of my preliminary research during my summer internship in 2015. During which, I was able to use the basic statistics knowledge that I learned by myself, and apply it for the real world scenarios which also interest me a lot. This is an amazing experience. Through learning and practicing R programming language, I found that the R is such a handy and powerful tool to perform data processing and visualization. There are more to be learned, and I would love to continue to refine my skills of mastering R, and conduct humanity research using high technology. Wow, I really enjoy the learning experience!

 

  1. Acknowledgement

I am very grateful to have this internship opportunity and learning in Shanghai Library System and Networking Center. I also appreciate Professor Annie Qu of UIUC to provide statistical guidance during my research.

 

  1. References
  • <Introductory Statistics with R Second Edition> - Peter Dalgaard - Posts & Telecom Press
  • <Data Analysis: R Programming Language Practices > - Li Shi Yu et. Al. - Publishing House of Electronics Industry
  • R Forum at pinggu.org  http://bbs.pinggu.org/forum-69-1.html

 

 

图片: 
所属委员会: