Is ChatGPT a Good Sentiment Analyzer? A Preliminary Study [arXiv:2304.04339]
In this repo, we release the test sets we used for evaluation in our paper.
Recently, ChatGPT has drawn great attention from both the research community and the public. However, despite its huge success, we still know little about the capability boundaries, i.e., where it does well and fails. We are particularly curious how ChatGPT performs on the sentiment analysis tasks, i.e., Can it really understand the opinions, sentiments, and emotions contained in the text?
To answer this question, we conduct a preliminary evaluation on 5 representative sentiment analysis tasks and 18 benchmark datasets, which involves four different settings including standard evaluation, polarity shift evaluation, open-domain evaluation, and sentiment inference evaluation. We compare ChatGPT with fine-tuned BERT-based models and corresponding SOTA models on each task for reference.
Through rigorous evaluation, our findings are as follows:
- ChatGPT exhibits impressive zero-shot performance in sentiment classification tasks and can rival fine-tuned BERT, although it falls slightly behind the domain-specific fullysupervised SOTA models.
- ChatGPT appears to be less accurate on sentiment information extraction tasks such as E2E-ABSA. Upon observation, we find that ChatGPT is often able to generate reasonable answers, even though they may not strictly match the textual expression. From this point of view, the exact matching evaluation in information extraction is not very fair for ChatGPT. In our human evaluation, ChatGPT can still perform well in these tasks.
- Few-shot prompting (i.e., equipping with a few demonstration examples in the input) can significantly improve performance across various tasks, datasets, and domains, even surpassing fine-tuned BERT in some cases but still being inferior to SOTA models.
- When coping with the polarity shift phenomenon (e.g., negation and speculation), a challenging problem in sentiment analysis, ChatGPT can make more accurate predictions than fine-tuned BERT.
- Compared to the conventional practice - training domain-specific models, which typically perform poorly when generalized to unseen domains, ChatGPT demonstrates its powerful open-domain sentiment analysis ability in general, yet it is still worth noting that its performance is quite limited in a few specific domains.
- ChatGPT exhibits impressive sentiment inference ability, achieving comparable performance on the emotion cause extraction task or emotion-cause pair extraction task, in comparison with the fully-supervised SOTA models we set up.
In summary, compared to training a specialized sentiment analysis system for each domain or dataset, ChatGPT can already serve as a universal and well-behaved sentiment analyzer.
If you find this work helpful, please cite our paper as follows:
@inproceedings{
wang2024is,
title={Is Chat{GPT} a Good Sentiment Analyzer?},
author={Zengzhi Wang and Qiming Xie and Yi Feng and Zixiang Ding and Zinong Yang and Rui Xia},
booktitle={First Conference on Language Modeling},
year={2024},
url={https://openreview.net/forum?id=mUlLf50Y6H}
}
If you have any questions related to this work, you can open an issue with details or feel free to email Zengzhi([email protected]
), Qiming([email protected]
).
Human Evaluation (still in zero-shot)
We choose the ECE and ECPE tasks as the testbed.
Note that the right part is the English version translation of the left part for both ECE and ECPE.