8 Inches Above: Musical Turing Test

I currently TA for an undergraduate Issues in Computer Science class, and during one of the lectures, the professor talked about a computer's ability to create art and music—which, one might suppose, are very "human" skills. Can a computer compose music? Sure...but it won't sound quite right. Right? In class, the professor tried this with two composers: Bach and Joplin. For each of these composers, she had a computer program generate ("compose") a musical piece after it had identified patterns ("learned") from a lot of music from that composer. She then recorded a pianist playing the computer-composed piece, so a piece generated by a Bach-learned program would sound a little something like Bach, and a Joplin-learned one would sound quite a bit like your typical Joplin rag.

I won't bore you with any more details (there are a few more at the actual test—you should take it, it only takes 7 minutes tops), but the results were interesting. During class, the teacher juxtaposed each computer-composed piece with an actual human composed piece, and then asked the students to pick the real piece. The performance was terrible. It was clear these students couldn't tell the difference between a great like Bach and a computer imitation of Bach. But then I began to wonder: as I took the test, I was able to pick out each of the real compositions. Was this because of my musical background? I decided to give the same test to a larger audience, this time recording their self-reported musical experience.

If you're not interested in the individual results and discussion, the takeaway is that, even among self-reported musicians, only just over 40% of the individuals were able to correctly identify the human composer. Yes. Only 40%. In other words, more often than not, humans think computers are more "human" composers than Bach or Chopin or Joplin. This is a terrible result! (There's a pretty good description why this is bad in this RadioLab episode, at about 16:26.) I'll try and go into it more in the Analysis section below.

Population Distribution
For those of you interested in the questions and population distribution, I've been updating the values here. For your information, the quiz consisted of 6 questions: three background, and three with the paired audio samples. For the first three questions, here are the number of respondents (total of 247) in each category:

What is your major field of work? The responses were free-form and varied from math to computers to culinary arts to music to motorcycle repair. I haven't looked at any of these responses past collecting answers.
Do you play any instruments? This was multiple choice, divided into 4 categories of increasing "musicality." Here are the number of respondents that self-selected each category:

No: 28
Yes, but I've picked it up on my own without lessons: 27
Yes, and I've had official lessons for at least one year: 131
Yes, I would consider myself a professional musician: 61

What is your technical musical background? This was also multiple choice with 4 categories, and had the following number of respondents:

None: 38
I know some musical theory, but I've picked it up on my own: 42
I've taken a class or lessons in musical theory: 124
I'm a professional musician or majored in music theory: 43

Categorical Results by Response
And here are the responses, separated into response for each category

Second question, accuracy by response:

1 - 27.4%

2 - 40.7%

3 - 42.0%

4 - 50.3%

Third question, accuracy by response:

1 - 32.5%

2 - 42.9%

3 - 40.9%

4 - 54.3%

Average total accuracy:
42.2%

A graphical representation of the responses, by category. The x-axis has the question number, and the y-axis has the accuracy. The circle size is representative of the number of respondents for that specific question.

Overall Results and Analysis

The overall accuracy for all three composers was 42.2%. In other words, only 42 out of 100 times were people able to correctly identify who the human composer was. There probably aren't quite enough samples to show statistical significance, and the survey process wasn't exactly accurate (I didn't prevent anyone from cheating—they could take the survey as many times as they wanted), but there does seem to be a general trend that with musical experience, the accuracy increases.

This is what I expected.

Almost.

On closer inspection, this all is actually quite disconcerting. With the sole exception of those who are professional musicians, humans think that the computer composer actually sounds more human. How could this be possible? Well, I can think of a couple of possibilities.

First, it's possible that there was a hidden variable we weren't controlling for. I had my roommate take the quiz first, and he got them all right. I asked him how he did it, and his response was that he just picked the better recording each time. Oops. When I designed the quiz, I already had the mp3 files for the computer composers, but I was lazy with the real compositions and just ripped the music off YouTube clips. I went back and actually purchased the music off iTunes so they would sound more similar. When I sent the survey out to a larger audience, there were more issues. Turns out iTunes adds album artwork, which was displayed in some browsers when they played the music. Oops again. This was a little harder, but with some online conversion and VLC magic, I finally cleared them all of their extraneous information. Perhaps there is something more I'm not controlling for—if this were real science, I'd want to have the same performer play each piece and be recorded with the same device. Good thing this isn't real science, right?

But there's another possibility that seems more likely. When the computer "composed" a piece by Bach, she wanted it to sound as much like Bach as she could, so she consciously added bits and pieces of actual songs. But when Bach really wrote his pieces, his goal was to create something new and fresh, something that didn't sound like anything he'd written before. Something that people would clearly identify as Bach--or would at least recognize it when they heard it before. All of his similarities to previous works were subconscious, influences from his schooling, the music he enjoyed listening to, even some of the music he previously composed.

Conclusion

With high statistical confidence, humans think that a computer who is trying to imitate a Bach piece actually sounds more like Bach than the composer himself. Maybe this is bad (computers will eventually replace composers and artists and singers and all other "human" fields), but maybe, instead, this identifies an aspect of talented human creators: the ability to create something truly unique.

Notes:
Note1: As people take this test, I'll update the results on this page. Hopefully I replace them all so it makes sense. Also, if you want some more statistics, I can try and get them to you—or I could probably give you the data myself, it's pretty well-anonymized.
Note 2: The first numbers were published on 23 Nov 2013 with 65 respondents and 42.6% accuracy. The results were updated on 26 Nov 2013 with 99 respondents and 40.7% accuracy. They were updated again on 11 Dec 2013 with 201 respondents and 42.8% accuracy. They were updated on 12 Sep 2014 with 247 respondents and 42.2% accuracy.

2 comments:

micimizeDecember 12, 2013 at 12:38 PM
I just took the test, and got them all right. This surprised me, I was only really confident in the Joplin. I think a listener's ability to identify the composer probably has a lot to do with what kind of music they listen to.

I used to listen to a lot of jazz and rag time piano, so because the roboJoplin sounded a bit stale and didn't have anything clever (like the little slips the second one did, that are out of tune) it was a sure thing.

Most of the Chopin I've listened to was nocturnes, so initially the first recording sounded right, but because the second one changed moods so much while maintaining a consistent composition, I liked it more, and ended up choosing it.

The Bach... well, I don't like this type of Bach. I think I might have chosen the first one just cause it was first. I liked the first one more than the second, but I don't know why. Maybe because it was first. They we both pretty dull, I think it was the closest call because Bach liked to form mathematical patterns in his music, and would base pieces around those patterns rather than around an emotion the way Chopin did, or around entertainment the way Joplin did. I dunno, maybe I'm just getting over-speculative though.

But yeah, I think keeping track of the musical taste of the listener is important. Musical training is kinda similar, but not all musicians enjoy classical or rag time. I would also advocate adding genres, if not for the time involved, as well as the decreasing meaning of the results musically.

I'd look at http://www.drbunsen.org/coffee-experiments/ for casual experiment design
NathanDecember 12, 2013 at 4:19 PM
Thanks for your comments. There are a ton of other questions I could have asked--age might have a lot to do with it (young people have a tendency to listen to less classical than older people), and I would have liked to have asked if they had heard the song before (I recognized the Bach piece, so that was easy for me. Musical taste might have been interesting as well, but that's so varied that it would have been free input and not easily categorized.

Mostly, I wanted to issue a quick survey to my friends, and I knew that most of the people wouldn't take a survey if they saw tons of questions. So I wanted to keep it simple. I do think there are several interesting things about this survey that could be extrapolated further if/when I ever do something like again.

And thanks for your comments on how you could tell which ones were correct. That's what was most interesting to me: For the ones who got them right, what was the deciding factor? So thanks for your input.

Saturday, November 23, 2013

Musical Turing Test

2 comments: