Are there any public test data sets of name corner cases? Given the popular "falsehoods programmers believe" lists, someone could create a public data set of unsanitized name inputs, expected decompositions, and expected round trip result. I think genealogy organizations have published de facto standards for name formatting.