Magic Markup (PX'24 submission) -- Evaluation / Edward Misback

Edward Misback

Live programming @ University of Washington

Workspace

Public

Magic Markup

Edited

Mar 1, 2024

Magic Markup

Re-Tagging Demo Tagged Code Updates Re-tagging System Magic Markup (PX'24 submission) -- Benchmark Generation Tools

Magic Markup (PX'24 submission) -- Evaluation

results.length

results.filter(r => r.output.out == r.target).length

results.reduce((acc, r) => acc + r.time, 0)/results.length // milliseconds

md`### Error diagnoses (left is the model output, right is "correct"; note the output is garbage for failed matches)

${errors[0].index}: Minor: whitespace ignored by model (& matching)

${diff(errors[0].output.out, errors[0].target)}

${errors[1].index}: Incorrect end line given by the model (one early) -> match fail

${diff(errors[1].output.out, errors[1].target)}

${errors[2].index}: Incorrect starting + ending line given by the model (one early) -> match fail

${diff(errors[2].output.out, errors[2].target)}

${errors[3].index}: Incorrect occurence index given by the model

${diff(errors[3].output.out, errors[3].target)}

${errors[4].index}: Incorrect end line given by the model (one early) -> match fail; also, model output was missing a comment in the source code

${diff(errors[4].output.out, errors[4].target)}

${errors[5].index}: Gave old snippet as new -> match fail

${diff(errors[5].output.out, errors[5].target)}

${errors[6].index}: Minor: whitespace ignored by model (& matching)

${diff(errors[6].output.out, errors[6].target)}

${errors[7].index}: Incorrect end line given by the model (one early) -> match fail

${diff(errors[7].output.out, errors[7].target)}

${errors[8].index}: Incorrect occurence index given by the model [reveals a bug in our counting method]

${diff(errors[8].output.out, errors[8].target)}

${errors[9].index}: Incorrect end line given by the model (one early) -> match fail

${diff(errors[9].output.out, errors[9].target)}

${errors[10].index}: Incorrect end line given by the model (one early) -> match fail

${diff(errors[10].output.out, errors[10].target)}

### Notes

* In every case but 1 (where it gave output from the input file), the model's output for the analogous snippet was correct.

* 10/11 of the errors were due to whitespace or line number identification issues.

* Of the 6 line number errors, 5 were due to misidentification of the end of a block (guessing it was 1 line earlier)

* 1 of the line number errors was an early start.

* Model's "nth occurence" counting as part of a large question seems bad

### Adjustments

* (6) If there's no match, if we can expand the section size by one line on either side, we solve all line number errors in this test

* (2) Ignore the 2 whitespace errors, we don't care about these for now

* (2) For incorrect output, we can try emphasizing that the text must be from the UPDATED file (without line numbers), and must include any comments. ((Add to 1) The text MUST be in UPDATED, and must also include any comments that were in UPDATED.)

* (2) nthOccurrence: show the/a model the options with some/all context, ask for a choice of which one to highlight

### Expected improvement

We should be able to easily solve or ignore about 8/11 errors. This gives us an accuracy around 96%. The other 3 involve

1. matching text from a long section that may have missing pieces to the actual code

2. matching nth occurrence in a line

Both of these issues should be solvable with fuzzy searching or more simply with another request to the model.

### Overall

We did not find evidence on this dataset that the model is incapable of understanding the semantics of code, though the kinds of structures that are exercised by the benchmark are random and intended to represent normal refactorings and may not represent the most serious possible challenges for retagging.

We did notice that the model tends to make mistakes around matching closing braces when referencing line numbers, and may need tools for helping it "see" matched brackets like a human. In our experience, it tended to choose earlier braces, possibly due to our prompt asking it to BE CAREFUL TO INCLUDE NOTHING EXTRA. When producing code, it did not have this problem.

results.map(r => r.output.gptRetaggingJSON[4]).filter(v => v !== 1)

md`Talk about:

* handling weird cases

* "please respect expression boundaries"

* statements with syntactic meaning

* constraining phrases

* semantic selector in discussion

import {train, benchmark, excluded} from "4ac2191d97f52603"

import {retagUpdate, prompt_breakdown10, prompt_breakdown11, computeUpdatedCodeWithSnippetRetagged} from "ac71f9212cd0fff6"

async function wait(ms) {

return new Promise((resolve) => setTimeout(() => resolve('done'), ms))

}

wait(1000)

results = {

if (resultsCache) {return resultsCache}

const results = []

let i = 1

for (const t of benchmark) {

console.log(i)

const t0 = performance.now()

const out = await retagUpdate(t.codeWithSnippetDelimited, t.updatedCodeWithSnippetDelimited.replaceAll(t.delimiter, ''), t.delimiter)

const t1 = performance.now()

const time = t1 - t0

results.push({input: {codeWithSnippetDelimited: t.codeWithSnippetDelimited, updatedCodeWithoutDelimiters: t.updatedCodeWithSnippetDelimited.replaceAll(t.delimiter, ''), delimiter: t.delimiter}, output: out, target: t.updatedCodeWithSnippetDelimited, other: t, time, index: t.index})

await wait(5000)

i = i + 1

}

return results

}

// excludedResults = {

// //if (resultsCache) {return resultsCache}

// const results = []

// let i = 1

// for (const t of excluded.slice(7,10)) {

// console.log(i)

// const t0 = performance.now()

// const out = await retagUpdate(t.codeWithSnippetDelimited, t.updatedCodeWithSnippetDelimited.replaceAll(t.delimiter, ''), t.delimiter)

// const t1 = performance.now()

// const time = t1 - t0

// results.push({input: {codeWithSnippetDelimited: t.codeWithSnippetDelimited, updatedCodeWithoutDelimiters: t.updatedCodeWithSnippetDelimited.replaceAll(t.delimiter, ''), delimiter: t.delimiter}, output: out, target: t.updatedCodeWithSnippetDelimited, other: t, time, index: t.index})

// await wait(5000)

// i = i + 1

// }

// return results

// }

copy(JSON.stringify(excludedResults))

md`${excludedResults.map(e =>

md`${e.index}

${diff(e.output.out, e.input.codeWithSnippetDelimited)}`

)}`

results

copy(JSON.stringify(results[0].input))

results[0]

results

resultsCache = FileAttachment("results@3.json").json()

80/90

looserResults = results.map(r => ({...r, loose: computeUpdatedCodeWithSnippetRetagged({

code:r.input.updatedCodeWithoutDelimiters,

snippet:r.output.gptRetaggingJSON[1],

lineStart:r.output.gptRetaggingJSON[2],

lineEnd:r.output.gptRetaggingJSON[3],

nthOccurrence:r.output.gptRetaggingJSON[4],

delimiterStart:r.input.delimiter,

delimiterEnd:r.input.delimiter})

}))

copy(JSON.stringify(looserResults))

looserErrors = looserResults.filter(r => r.loose != r.target)

errors = results.filter(r => r.output.out != r.target)

errors[0].output

r = errors[4]

computeUpdatedCodeWithSnippetRetagged({

code:r.input.updatedCodeWithoutDelimiters,

snippet:r.output.gptRetaggingJSON[1],

lineStart:r.output.gptRetaggingJSON[2],

lineEnd:r.output.gptRetaggingJSON[3],

nthOccurrence:r.output.gptRetaggingJSON[4],

delimiterStart:r.input.delimiter,

delimiterEnd:r.input.delimiter})

benchmark[0]

errors[2].output

results1 = FileAttachment("results.json").json()

errors2 = results1.filter(r => r.output != r.target)

variance = results1.filter(r => r.output != results.find(r1 => r1.index === r.other.index).output.out).map(r => r.other.index)

errors[0]

copy(prompt_breakdown11({codeWithSnippetDelimited: errors[4].input.codeWithSnippetDelimited,

updatedCodeWithSnippetDelimited: errors[4].input.updatedCodeWithoutDelimiters,

delimiter: errors[4].input.delimiter }))

errors.filter(r => r.index == 30)[0].output

md`${looserErrors.map(e =>

md`${e.index}

${diff(e.loose, e.target)}`

)}`

md`${errors.map(e =>

md`${e.index}

${diff(e.output.out, e.target)}`

)}`

import {copy, asyncCopy} from "@ryanseddon/copy"

import {diff} from "@jobleonard/diff"

Purpose-built for displays of data

Observable is your go-to platform for exploring data and creating expressive data visualizations. Use reactive JavaScript notebooks for prototyping and a collaborative canvas for visual data exploration and dashboard creation.

Learn more