Bo HAN

A Introduction of My PhD Research

My research focuses on Improving the Utility of Social Media with Natural Language Processing (NLP). In particular, I worked on Twitter text normalisation and geolocation prediction. My research aims to practise a divide and conquer paradigm for the huge amount of social media data. On "conquer the noise", lexical normalisation aims to convert non-standard words to their canonical forms in social media, e.g., 4eva ("forever"). By doing so, the normalised data is expected to be more accessible to existing NLP tools and downstream applications. As for "divide the data", geolocation prediction usually takes a Twitter user's tweets (incl. metadata) as input, and outputs the most probable location from a discrete set of pre-defined locations, such as metropolitan cities. It enables data partition by location. As a result, it makes location-based applications feasible (e.g., local event detection, regional sentiment analysis) and avoids dealing with massive irrelevant data.

Lexical Normalisation in Social Media

Lexical Normalisation of Short Text Messages, Bo Han, Paul Cook and Timothy Baldwin. In ACM Transactions on Intelligent Systems and Technology (TIST) 4(1), pp. 5:1–5:27, 2013.
Automatically Constructing a Normalisation Dictionary for Microblogs, Bo Han, Paul Cook and Timothy Baldwin. In EMNLP-CoNLL 2012, 421–432, Jeju, Republic of Korea.
[Off-the-shelf English normalisation lexicon]
Lexical normalisation of short text messages: Makn sens a #twitter, Bo Han, Timothy Baldwin. In ACL 2011, 368–378, Portland, OR, USA.
[Normalisation dataset (with corrections from Jacob Eisenstein)]

Geolocation Prediction in Social Media

Text-based Twitter User Geolocation Prediction, Bo Han, Paul Cook and Timothy Baldwin. In Journal of Artificial Intelligence Research, Volume 49, pages 451-500, 2014.
[Ranked Location Indicative Words using various methods] [Dataset User IDs]
A Stacking-based Approach to Twitter User Geolocation Prediction, Bo Han, Paul Cook and Timothy Baldwin. In ACL 2013, Demo Session, pages 7–12, Sofia, Bulgaria.
[Live demo] [Source code for live demo]
Geolocation Prediction in Social Media Data by Finding Location Indicative Words, Bo Han, Paul Cook and Timothy Baldwin. In COLING 2012, 1045–1062, Mumbai, India.
[Data (only IDs are provided) ]

Miscellaneous

Twitter Geolocation Prediction Shared Task of the 2016 Workshop on Noisy User-generated Text, Bo Han, Afshin Rahimi, Leon Derczynski and Timothy Baldwin, In Proceedings of the 2nd Workshop on Noisy User-generated Text (W-NUT), 213–217, Osaka, Japan, 2016.
:telephone::person::sailboat::whale::okhand: ; or “Call me Ishmael” – How do you translate emoji?, Will Radford Andrew Chisholm Ben Hachey Bo Han. In ALTA 2016, 161-165, Melbourne, Australia.
Temporal Modelling of Geospatial Words in Twitter, Bo Han and Antonio Jimeno Yepes and Andrew MacKinlay and Lianhua Chi. In ALTA 2016, 144-148, Melbourne, Australia
Shared Tasks of the 2015 Workshop on Noisy User-generated Text: Twitter Lexical Normalization and Named Entity Recognition, Timothy Baldwin, Marie-Catherine de Marneffe, Bo Han, Young-Bum Kim, Alan Ritter and Wei Xu. In Proceedings of the ACL 2015 Workshop on Noisy User-generated Text (W-NUT), 126–135, Beijing, China, 2016.
Statistical Methods for Identifying Local Dialectal Terms from GPS-Tagged Documents, Paul Cook, Bo Han and Timothy Baldwin. In Dictionaries: Journal of the Dictionary Society of North America, Number 35, 248-271, 2014.
Identifying Location Mentions in Twitter, Bo Han and Antonio Jimeno Yepes and Andrew MacKinlay and Qiang Chen. In ALTA 2014 (Shared task), 157-162, Melbourne, Australia.
Unsupervised Word Usage Similarity in Social Media Texts, Spandana Gella, Paul Cook and Bo Han. In *SEM2013, 248-253, Atlanta, GA, USA. [Best short paper]
unimelb: Spanish Text Normalisation, Bo Han, Paul Cook and Timothy Baldwin. In Proceedings of Tweet-norm: Tweet Normalization Workshop at SEPLN 2013, 67-71, Madrid, Spain.
A Support Platform for Event Detection using Social Intelligence, Timothy Baldwin, Paul Cook, Bo Han, Aaron Harwood, Shanika Karunasekera and Masud Moshtaghi. In EACL 2012, 69–72, Demo Session, Avignon, France.
Melbourne Language Group Microblog Track Report (Non-referred), Bo Han, Marco Lui and Timothy Baldwin. In TREC 2011, NIST Special Publication: SP 500–295.
Correcting verb selection errors for ESL with the perceptron, Xiaohua Liu, Bo Han, and Ming Zhou, In CICLing 2011, 411–423, Tokyo, Japan.
SRL-based verb selection for ESL, Xiaohua Liu, Bo Han, Kuan Li, Stephan Hyeonjun Stiller, and Ming Zhou, In EMNLP 2010, 1068–1076, Cambridge, MA, USA.
Semantic role labeling for news tweets, Xiaohua Liu, Kuan Li, Bo Han, Ming Zhou, Long Jiang, Zhongyang Xiong, and Changning Huang, In COLING 2010, 698–706, Beijing, China.
Collective Semantic Role Labeling on Open News Corpus by Leveraging Redundancy, Xiaohua Liu, Kuan Li, Bo Han, Ming Zhou, Long Jiang, Daniel Tse, Zhongyang Xiong, In COLING 2010, 725-729, Beijing, China.
Automatic seed set expansion for trust propagation based anti-spamming algorithms, Xianchao Zhang, Bo Han, and Wenxin Liang, In WIDM 2009, 31–38, Hong Kong, China.
Mining Micro-Blogs: Opportunities and Challenges, Yang Liao, Masud Moshtaghi, Bo Han, Shanika Karunasekera, Ramamohanarao Kotagiri, Timothy Baldwin, Aaron Harwood and Philippa Pattison. In Ajith Abraham (eds.) Computational Social Networks: Mining and Visualization, Springer: London, UK, pp. 129–159.

Coding for myself

Sometimes, I'd like to code some side data science projects. These projects are like Unix utility tools which are dedicated to solve a particular information need. I found them useful when I bought my property and car.

Cars (in Australia)

Real Estate (in Australia)