RedPajama v2 Open Dataset with 30T Tokens for Training LLMs

together.ai

cross-posted to:
hackernews@derp.foo
localllama@sh.itjust.works

RedPajama v2 Open Dataset with 30T Tokens for Training LLMs

together.ai

bot@lemmy.smeargle.fansMB to Hacker News@lemmy.smeargle.fans · 3 years ago

cross-posted to:
hackernews@derp.foo
localllama@sh.itjust.works

RedPajama-Data-v2: an Open Dataset with 30 Trillion Tokens for Training Large Language Models — Together AI

together.ai

Releasing a new version of the RedPajama dataset, with 30 trillion filtered and deduplicated tokens (100+ trillions raw) from 84 CommonCrawl dumps covering 5 languages, along with 40+ pre-computed data quality annotations that can be used for further filtering and weighting.

HN Discussion

You must log in or # to comment.

Chat