- cross-posted to:
- hackernews@lemmy.smeargle.fans
- hackernews@derp.foo
- cross-posted to:
- hackernews@lemmy.smeargle.fans
- hackernews@derp.foo
30T tokens, 20.5T in English, allegedly high quality, can’t wait to see people start putting it to use!
Related github: https://github.com/togethercomputer/RedPajama-Data
I think the implication is more stating that this dataset is even more useful if you don’t jam the whole thing into your training but instead further filter it to a reasonable number of tokens, around 5T, and train on that subset instead
I could be incorrect, cause they do explicitly say deduplicating, but it’s phrased oddly either way