md`### Error diagnoses (left is the model output, right is "correct"; note the output is garbage for failed matches)
${errors[0].index}: Minor: whitespace ignored by model (& matching)
${diff(errors[0].output.out, errors[0].target)}
${errors[1].index}: Incorrect end line given by the model (one early) -> match fail
${diff(errors[1].output.out, errors[1].target)}
${errors[2].index}: Incorrect starting + ending line given by the model (one early) -> match fail
${diff(errors[2].output.out, errors[2].target)}
${errors[3].index}: Incorrect occurence index given by the model
${diff(errors[3].output.out, errors[3].target)}
${errors[4].index}: Incorrect end line given by the model (one early) -> match fail; also, model output was missing a comment in the source code
${diff(errors[4].output.out, errors[4].target)}
${errors[5].index}: Gave old snippet as new -> match fail
${diff(errors[5].output.out, errors[5].target)}
${errors[6].index}: Minor: whitespace ignored by model (& matching)
${diff(errors[6].output.out, errors[6].target)}
${errors[7].index}: Incorrect end line given by the model (one early) -> match fail
${diff(errors[7].output.out, errors[7].target)}
${errors[8].index}: Incorrect occurence index given by the model [reveals a bug in our counting method]
${diff(errors[8].output.out, errors[8].target)}
${errors[9].index}: Incorrect end line given by the model (one early) -> match fail
${diff(errors[9].output.out, errors[9].target)}
${errors[10].index}: Incorrect end line given by the model (one early) -> match fail
${diff(errors[10].output.out, errors[10].target)}
### Notes
* In every case but 1 (where it gave output from the input file), the model's output for the analogous snippet was correct.
* 10/11 of the errors were due to whitespace or line number identification issues.
* Of the 6 line number errors, 5 were due to misidentification of the end of a block (guessing it was 1 line earlier)
* 1 of the line number errors was an early start.
* Model's "nth occurence" counting as part of a large question seems bad
### Adjustments
* (6) If there's no match, if we can expand the section size by one line on either side, we solve all line number errors in this test
* (2) Ignore the 2 whitespace errors, we don't care about these for now
* (2) For incorrect output, we can try emphasizing that the text must be from the UPDATED file (without line numbers), and must include any comments. ((Add to 1) The text MUST be in UPDATED, and must also include any comments that were in UPDATED.)
* (2) nthOccurrence: show the/a model the options with some/all context, ask for a choice of which one to highlight
### Expected improvement
We should be able to easily solve or ignore about 8/11 errors. This gives us an accuracy around 96%. The other 3 involve
1. matching text from a long section that may have missing pieces to the actual code
2. matching nth occurrence in a line
Both of these issues should be solvable with fuzzy searching or more simply with another request to the model.
### Overall
We did not find evidence on this dataset that the model is incapable of understanding the semantics of code, though the kinds of structures that are exercised by the benchmark are random and intended to represent normal refactorings and may not represent the most serious possible challenges for retagging.
We did notice that the model tends to make mistakes around matching closing braces when referencing line numbers, and may need tools for helping it "see" matched brackets like a human. In our experience, it tended to choose earlier braces, possibly due to our prompt asking it to BE CAREFUL TO INCLUDE NOTHING EXTRA. When producing code, it did not have this problem.
`