Public
Edited
Mar 1, 2024
Insert cell
Insert cell
results.length
Insert cell
results.filter(r => r.output.out == r.target).length
Insert cell
results.reduce((acc, r) => acc + r.time, 0)/results.length // milliseconds
Insert cell
md`### Error diagnoses (left is the model output, right is "correct"; note the output is garbage for failed matches)
${errors[0].index}: Minor: whitespace ignored by model (& matching)
${diff(errors[0].output.out, errors[0].target)}

${errors[1].index}: Incorrect end line given by the model (one early) -> match fail
${diff(errors[1].output.out, errors[1].target)}

${errors[2].index}: Incorrect starting + ending line given by the model (one early) -> match fail
${diff(errors[2].output.out, errors[2].target)}

${errors[3].index}: Incorrect occurence index given by the model
${diff(errors[3].output.out, errors[3].target)}

${errors[4].index}: Incorrect end line given by the model (one early) -> match fail; also, model output was missing a comment in the source code
${diff(errors[4].output.out, errors[4].target)}

${errors[5].index}: Gave old snippet as new -> match fail
${diff(errors[5].output.out, errors[5].target)}

${errors[6].index}: Minor: whitespace ignored by model (& matching)
${diff(errors[6].output.out, errors[6].target)}

${errors[7].index}: Incorrect end line given by the model (one early) -> match fail
${diff(errors[7].output.out, errors[7].target)}

${errors[8].index}: Incorrect occurence index given by the model [reveals a bug in our counting method]
${diff(errors[8].output.out, errors[8].target)}

${errors[9].index}: Incorrect end line given by the model (one early) -> match fail
${diff(errors[9].output.out, errors[9].target)}

${errors[10].index}: Incorrect end line given by the model (one early) -> match fail
${diff(errors[10].output.out, errors[10].target)}

### Notes
* In every case but 1 (where it gave output from the input file), the model's output for the analogous snippet was correct.
* 10/11 of the errors were due to whitespace or line number identification issues.
* Of the 6 line number errors, 5 were due to misidentification of the end of a block (guessing it was 1 line earlier)
* 1 of the line number errors was an early start.
* Model's "nth occurence" counting as part of a large question seems bad

### Adjustments
* (6) If there's no match, if we can expand the section size by one line on either side, we solve all line number errors in this test
* (2) Ignore the 2 whitespace errors, we don't care about these for now
* (2) For incorrect output, we can try emphasizing that the text must be from the UPDATED file (without line numbers), and must include any comments. ((Add to 1) The text MUST be in UPDATED, and must also include any comments that were in UPDATED.)
* (2) nthOccurrence: show the/a model the options with some/all context, ask for a choice of which one to highlight

### Expected improvement
We should be able to easily solve or ignore about 8/11 errors. This gives us an accuracy around 96%. The other 3 involve

1. matching text from a long section that may have missing pieces to the actual code
2. matching nth occurrence in a line

Both of these issues should be solvable with fuzzy searching or more simply with another request to the model.

### Overall
We did not find evidence on this dataset that the model is incapable of understanding the semantics of code, though the kinds of structures that are exercised by the benchmark are random and intended to represent normal refactorings and may not represent the most serious possible challenges for retagging.

We did notice that the model tends to make mistakes around matching closing braces when referencing line numbers, and may need tools for helping it "see" matched brackets like a human. In our experience, it tended to choose earlier braces, possibly due to our prompt asking it to BE CAREFUL TO INCLUDE NOTHING EXTRA. When producing code, it did not have this problem.
`
Insert cell
results.map(r => r.output.gptRetaggingJSON[4]).filter(v => v !== 1)
Insert cell
md`Talk about:
* handling weird cases
* "please respect expression boundaries"
* statements with syntactic meaning
* constraining phrases
* semantic selector in discussion
`
Insert cell
import {train, benchmark, excluded} from "4ac2191d97f52603"
Insert cell
import {retagUpdate, prompt_breakdown10, prompt_breakdown11, computeUpdatedCodeWithSnippetRetagged} from "ac71f9212cd0fff6"
Insert cell
async function wait(ms) {
return new Promise((resolve) => setTimeout(() => resolve('done'), ms))
}
Insert cell
wait(1000)
Insert cell
results = {
if (resultsCache) {return resultsCache}
const results = []
let i = 1
for (const t of benchmark) {
console.log(i)
const t0 = performance.now()
const out = await retagUpdate(t.codeWithSnippetDelimited, t.updatedCodeWithSnippetDelimited.replaceAll(t.delimiter, ''), t.delimiter)
const t1 = performance.now()
const time = t1 - t0
results.push({input: {codeWithSnippetDelimited: t.codeWithSnippetDelimited, updatedCodeWithoutDelimiters: t.updatedCodeWithSnippetDelimited.replaceAll(t.delimiter, ''), delimiter: t.delimiter}, output: out, target: t.updatedCodeWithSnippetDelimited, other: t, time, index: t.index})
await wait(5000)
i = i + 1
}
return results
}
Insert cell
// excludedResults = {
// //if (resultsCache) {return resultsCache}
// const results = []
// let i = 1
// for (const t of excluded.slice(7,10)) {
// console.log(i)
// const t0 = performance.now()
// const out = await retagUpdate(t.codeWithSnippetDelimited, t.updatedCodeWithSnippetDelimited.replaceAll(t.delimiter, ''), t.delimiter)
// const t1 = performance.now()
// const time = t1 - t0
// results.push({input: {codeWithSnippetDelimited: t.codeWithSnippetDelimited, updatedCodeWithoutDelimiters: t.updatedCodeWithSnippetDelimited.replaceAll(t.delimiter, ''), delimiter: t.delimiter}, output: out, target: t.updatedCodeWithSnippetDelimited, other: t, time, index: t.index})
// await wait(5000)
// i = i + 1
// }
// return results
// }
Insert cell
copy(JSON.stringify(excludedResults))
Insert cell
md`${excludedResults.map(e =>
md`${e.index}
${diff(e.output.out, e.input.codeWithSnippetDelimited)}`
)}`
Insert cell
results
Insert cell
copy(JSON.stringify(results[0].input))
Insert cell
results[0]
Insert cell
results
Insert cell
resultsCache = FileAttachment("results@3.json").json()
Insert cell
80/90
Insert cell
looserResults = results.map(r => ({...r, loose: computeUpdatedCodeWithSnippetRetagged({
code:r.input.updatedCodeWithoutDelimiters,
snippet:r.output.gptRetaggingJSON[1],
lineStart:r.output.gptRetaggingJSON[2],
lineEnd:r.output.gptRetaggingJSON[3],
nthOccurrence:r.output.gptRetaggingJSON[4],
delimiterStart:r.input.delimiter,
delimiterEnd:r.input.delimiter})
}))
Insert cell
copy(JSON.stringify(looserResults))
Insert cell
looserErrors = looserResults.filter(r => r.loose != r.target)
Insert cell
errors = results.filter(r => r.output.out != r.target)
Insert cell
errors[0].output
Insert cell
r = errors[4]
Insert cell
computeUpdatedCodeWithSnippetRetagged({
code:r.input.updatedCodeWithoutDelimiters,
snippet:r.output.gptRetaggingJSON[1],
lineStart:r.output.gptRetaggingJSON[2],
lineEnd:r.output.gptRetaggingJSON[3],
nthOccurrence:r.output.gptRetaggingJSON[4],
delimiterStart:r.input.delimiter,
delimiterEnd:r.input.delimiter})
Insert cell
benchmark[0]
Insert cell
errors[2].output
Insert cell
results1 = FileAttachment("results.json").json()
Insert cell
errors2 = results1.filter(r => r.output != r.target)
Insert cell
variance = results1.filter(r => r.output != results.find(r1 => r1.index === r.other.index).output.out).map(r => r.other.index)
Insert cell
errors[0]
Insert cell
copy(prompt_breakdown11({codeWithSnippetDelimited: errors[4].input.codeWithSnippetDelimited,
updatedCodeWithSnippetDelimited: errors[4].input.updatedCodeWithoutDelimiters,
delimiter: errors[4].input.delimiter }))
Insert cell
errors.filter(r => r.index == 30)[0].output
Insert cell
md`${looserErrors.map(e =>
md`${e.index}
${diff(e.loose, e.target)}`
)}`
Insert cell
md`${errors.map(e =>
md`${e.index}
${diff(e.output.out, e.target)}`
)}`
Insert cell
import {copy, asyncCopy} from "@ryanseddon/copy"
Insert cell
import {diff} from "@jobleonard/diff"
Insert cell

Purpose-built for displays of data

Observable is your go-to platform for exploring data and creating expressive data visualizations. Use reactive JavaScript notebooks for prototyping and a collaborative canvas for visual data exploration and dashboard creation.
Learn more