Public AI Training Datasets Are Rife With Licensing Errors
Large language models feed on big data from publicly available training sets, but most of the sets are of doubtful legal status.The scope of the problem has been demonstrated by the newly launched Data Provenance Initiative, which brings together a multi-institutional team of machine-learning and legal experts led by researchers…