TY - GEN
T1 - Leveraging GPT for the Generation of Multi-Platform Social Media Datasets for Research
AU - Tari, Henry
AU - Khan, M. Danial
AU - Rutten, Justus
AU - Othman, Darian
AU - Bertaglia, Thales
AU - Kaushal, Rishabh
AU - Iamnitchi, Adriana
N1 - Publisher Copyright:
© 2024 ACM.
PY - 2024/9/10
Y1 - 2024/9/10
N2 - Social media datasets are essential for research on disinformation, influence operations, social sensing, hate speech detection, cyberbullying, and other significant topics. However, access to these datasets is often restricted due to costs and platform regulations. As such, acquiring datasets that span multiple platforms which are crucial for a comprehensive understanding of the digital ecosystem is particularly challenging. This paper explores the potential of large language models to create lexically and semantically relevant social media datasets across multiple platforms, aiming to match the quality of real datasets. We employ ChatGPT to generate synthetic data from a real dataset consisting of posts from three different social media platforms. We assess the lexical and semantic properties of the synthetic data and compare them with those of the real data. Our empirical findings suggest that using large language models to generate synthetic multi-platform social media data is promising. However, further enhancements are necessary to improve the fidelity of the outputs.
AB - Social media datasets are essential for research on disinformation, influence operations, social sensing, hate speech detection, cyberbullying, and other significant topics. However, access to these datasets is often restricted due to costs and platform regulations. As such, acquiring datasets that span multiple platforms which are crucial for a comprehensive understanding of the digital ecosystem is particularly challenging. This paper explores the potential of large language models to create lexically and semantically relevant social media datasets across multiple platforms, aiming to match the quality of real datasets. We employ ChatGPT to generate synthetic data from a real dataset consisting of posts from three different social media platforms. We assess the lexical and semantic properties of the synthetic data and compare them with those of the real data. Our empirical findings suggest that using large language models to generate synthetic multi-platform social media data is promising. However, further enhancements are necessary to improve the fidelity of the outputs.
KW - LLMs
KW - Social Media Research
KW - Synthetic Data
UR - http://www.scopus.com/inward/record.url?scp=85204874536&partnerID=8YFLogxK
U2 - 10.1145/3648188.3675153
DO - 10.1145/3648188.3675153
M3 - Conference contribution
AN - SCOPUS:85204874536
T3 - HT 2024: Creative Intelligence - 35th ACM Conference on Hypertext and Social Media
SP - 337
EP - 343
BT - HT 2024
PB - Association for Computing Machinery
T2 - 35th ACM Conference on Hypertext and Social Media, HT 2024
Y2 - 10 September 2024 through 13 September 2024
ER -