extract_references.py 7.4 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384
  1. from evaluator import *
  2. DESCRIPTION = "Test if the model can extract paper tiles from a block of text."
  3. TAGS = ['code', 'python']
  4. question = '''Extract a list the titles of the papers from the following list of references.
  5. Start your response
  6. ```json
  7. [title_1, title_2, ...]
  8. ```
  9. Here's the block of text:
  10. A Suffix Arrays [45] SHOKRI, R., STRONATI, M., SONG, C., AND
  11. A suffix of length k of a string x are the last k characters (or, SHMATIKOV, V. Membership inference attacks against
  12. tokens) of this string, i.e,. x[−k:] machine learning models. In IEEE Symposium on
  13. . If we want to know: “was Security and Privacy (2017).
  14. 0 100 200 300 [46] SOLDAINI, L. AI2 Dolma: 3 trillion token open corpus
  15. length of k-gram for language model pretraining, 2023.
  16. 104 [47] SOMEPALLI, G., SINGLA, V., GOLDBLUM, M., GEIPING, J., AND GOLDSTEIN, T. Diffusion art or digital
  17. 105 forgery? Investigating data replication in diffusion models. In CVPR (2023).
  18. 106 [48] SOUTHWOOD, T. R. E., AND HENDERSON, P. A. Ecological methods. John Wiley & Sons, 2009.
  19. # generated kgrams [49] TOUVRON, H., LAVRIL, T., IZACARD, G., MARTINET, X., LACHAUX, M.-A., LACROIX, T., ROZIÈRE, B., GOYAL,
  20. in training data N., HAMBRO, E., AZHAR, F., RODRIGUEZ, A., JOULIN, A., GRAVE, E., AND LAMPLE,
  21. Figure 14: The suffix length threshold k significantly impacts G. LLaMA: Open and Efficient Foundation Language
  22. the rate of data determined to be memorized. We set k = 50. Models, 2023.
  23. x [50] TOUVRON, H., MARTIN, L., STONE, K., ALBERT, P.,
  24. ′ ALMAHAIRI, A., BABAEI, Y., BASHLYKOV, N., BATRA, S., BHARGAVA, P., BHOSALE, S., ET AL. LLaMA
  25. [−k:] 2: Open foundation and fine-tuned chat models. arXiv
  26. in x”, then we would have to do an O(n) search checking preprint arXiv:2307.09288 (2023).
  27. all suffixes of x. This linear scan is expensive if x is large, [51] TTI. Introducing Falcon 180b.
  28. as it is in training large language models, often terabytes in [52] YEOM, S., GIACOMELLI, I., FREDRIKSON, M., AND
  29. size. Instead, a suffix array will enable us to do this search JHA, S. Privacy risk in machine learning: Analyzing
  30. efficiently in O(logn) time. the connection to overfitting. In IEEE CSF (2018).
  31. A suffix array s over a dataset X, denoted as s(X) is a [53] ZELTERMAN, D. Smooth nonparametric estimation of
  32. data structure that indexes all suffixes of this string in a the quantile function. Journal of statistical planning
  33. lexicographically-sorted ordering. This sorting, as we will and inference 26, 3 (1990), 339–352.
  34. see, is important as it enables efficient binary searches for a [54] ZHANG, S., ROLLER, S., GOYAL, N., ARTETXE, M.,
  35. particular substring/suffix. CHEN, M., CHEN, S., DEWAN, C., DIAB, M., LI, X.,
  36. In the simplest form, we can consider the suffix array of a LIN, X. V., MIHAYLOV, T., OTT, M., SHLEIFER, S.,
  37. word, e.g., x =“banana”. The following is the set of all suffixes SHUSTER, K., SIMIG, D., KOURA, P. S., SRIDHAR,
  38. as obtained by traversing the string backwards and keeping only A., WANG, T., AND ZETTLEMOYER, L. Opt: Open
  39. unique suffixes, in this case, all suffixes: {“a”, “na”, pre-trained transformer language models, 2022.
  40. “ana”, “nana”, “ anana”, “banana”}, which are represented by [55] ZIEGLER, A. Github Copilot research recitation, 2021.
  41. the indices s = {5,4,3,2,1,0}. In this form, we still require [56] ZOU, A., WANG, Z., KOLTER, J. Z., AND FREDRIKSON, M. Universal and transferable adversarial
  42. an O(n) search as there is no ordering. However, a suffix array attacks on aligned language models. arXiv preprint
  43. will store these suffixes in a lexicographically sorted ordering. arXiv:2307.15043 (2023).
  44. '''
  45. answer = set([
  46. "membership inference attacks against machine learning models",
  47. "ai2 dolma: 3 trillion token open corpus for language model pretraining",
  48. "diffusion art or digital forgery? investigating data replication in diffusion models",
  49. "ecological methods",
  50. "llama: open and efficient foundation language models",
  51. "llama 2: open foundation and fine-tuned chat models",
  52. "introducing falcon 180b",
  53. "privacy risk in machine learning: analyzing the connection to overfitting",
  54. "smooth nonparametric estimation of the quantile function",
  55. "opt: open pre-trained transformer language models",
  56. "github copilot research recitation",
  57. "universal and transferable adversarial attacks on aligned language models",
  58. ])
  59. def check_ok(dat):
  60. import json
  61. dat = dat.replace("```json", "```")
  62. dat = dat.split("```")[1]
  63. dat = dat.lower().replace(".","")
  64. return set(json.loads(dat)) == answer
  65. TestExtractRef = question >> LLMRun() >> PyFunc(check_ok)
  66. if __name__ == "__main__":
  67. print(run_test(TestExtractRef))