Four Takeaways on the Race to Amass Data for A.I.

To make artificial intelligence systems more powerful, tech companies need online data to feed the technology. Here’s what to know.

Online data has long been a valuable commodity. For years, Meta and Google have used data to target their online advertising. Netflix and Spotify have used it to recommend more movies and music. Political candidates have turned to data to learn which groups of voters to train their sights on.

Over the last 18 months, it has become increasingly clear that digital data is also crucial in the development of artificial intelligence. Here’s what to know.

The more data, the better.

The success of A.I. depends on data. That’s because A.I. models become more accurate and more humanlike with more data.

In the same way that a student learns by reading more books, essays and other information, large language models — the systems that are the basis of chatbots — also become more accurate and more powerful if they are fed more data.

Some large language models, such as OpenAI’s GPT-3, released in 2020, were trained on hundreds of billions of “tokens,” which are essentially words or pieces of words. More recent large language models were trained on more than three trillion tokens.

The Data Inside GPT-3

OpenAI’s groundbreaking A.I. model was trained on billions of websites, books and Wikipedia articles collected from across the internet. OpenAI did not share the data it used to train its recent models.

#g-gpt3-breakdown-box ,
#g-gpt3-breakdown-box .g-artboard {
margin:0 auto;
}
#g-gpt3-breakdown-box p {
margin:0;
}
#g-gpt3-breakdown-box .g-aiAbs {
position:absolute;
}
#g-gpt3-breakdown-box .g-aiImg {
position:absolute;
top:0;
display:block;
width:100% !important;
}
#g-gpt3-breakdown-box .g-aiSymbol {
position: absolute;
box-sizing: border-box;
}
#g-gpt3-breakdown-box .g-aiPointText p { white-space: nowrap; }
#g-gpt3-breakdown-Artboard_1 {
position:relative;
overflow:hidden;
}
#g-gpt3-breakdown-Artboard_1 p {
font-family:nyt-franklin,arial,helvetica,sans-serif;
font-weight:300;
line-height:19px;
opacity:1;
letter-spacing:0em;
font-size:16px;
text-align:left;
color:rgb(0,0,0);
text-transform:none;
padding-bottom:0;
padding-top:0;
mix-blend-mode:normal;
font-style:normal;
height:auto;
position:static;
}
#g-gpt3-breakdown-Artboard_1 .g-pstyle0 {
font-weight:700;
height:19px;
top:1.3px;
position:relative;
}
#g-gpt3-breakdown-Artboard_1 .g-pstyle1 {
height:19px;
top:1.3px;
position:relative;
}
#g-gpt3-breakdown-Artboard_1 .g-pstyle2 {
font-weight:700;
line-height:17px;
height:17px;
font-size:14px;
text-align:center;
color:rgb(255,255,255);
top:1.1px;
position:relative;
}
#g-gpt3-breakdown-Artboard_1 .g-pstyle3 {
font-weight:700;
}
#g-gpt3-breakdown-Artboard_2 {
position:relative;
overflow:hidden;
}
#g-gpt3-breakdown-Artboard_2 p {
font-family:nyt-franklin,arial,helvetica,sans-serif;
font-weight:300;
line-height:19px;
opacity:1;
letter-spacing:0em;
font-size:16px;
text-align:left;
color:rgb(0,0,0);
text-transform:none;
padding-bottom:0;
padding-top:0;
mix-blend-mode:normal;
font-style:normal;
height:auto;
position:static;
}
#g-gpt3-breakdown-Artboard_2 .g-pstyle0 {
font-weight:700;
line-height:13px;
height:13px;
font-size:11px;
text-align:center;
color:rgb(255,255,255);
top:0.9px;
position:relative;
}
#g-gpt3-breakdown-Artboard_2 .g-pstyle1 {
font-weight:700;
height:19px;
text-align:center;
color:rgb(255,255,255);
top:1.3px;
position:relative;
}
#g-gpt3-breakdown-Artboard_2 .g-pstyle2 {
font-weight:700;
line-height:11px;
height:11px;
font-size:10px;
text-align:center;
color:rgb(255,255,255);
top:0.8px;
position:relative;
}
#g-gpt3-breakdown-Artboard_2 .g-pstyle3 {
font-weight:700;
line-height:12px;
height:12px;
font-size:10px;
text-align:right;
top:0.8px;
position:relative;
}
#g-gpt3-breakdown-Artboard_2 .g-pstyle4 {
font-weight:700;
}
#g-gpt3-breakdown-Artboard_2 .g-cstyle0 {
font-weight:700;
}

Common Crawl

Text from web pages collected since 2007.

Wikipedia

(3 billion tokens)

English-language

Wikipedia pages.

billion

Books 1 and Books 2

OpenAI has not explained the contents of these datasets. They are believed to contain text from millions of published books.

55 billion

410 billion tokens

WebText2

Web pages linked from Reddit that received three or more upvotes – an indication of approval from users.

19 billion

WebText2

19 billion

Common Crawl

410 billion tokens

Books 2

55 billion

Books 1

12 billion

Wikipedia

Common Crawl: Text from web pages collected since 2007.

WebText2

Web pages linked from Reddit that received three or more upvotes – an indication of approval from users.

Books 1 and Books 2: OpenAI has not explained the contents of these datasets. They are believed to contain text from millions of published books.

Wikipedia: English-language Wikipedia pages.

We are having trouble retrieving the article content.

Please enable JavaScript in your browser settings.

Thank you for your patience while we verify access.

Already a subscriber? Log in.

Want all of The Times? Subscribe.

Featured

Four Takeaways on the Race to Amass Data for A.I.

ByJohn Rosenberg

The more data, the better.

The Data Inside GPT-3

By John Rosenberg

Related Post

Elon Musk Gets a Crash Course in How Trumpworld Works

Bluesky Is Turning Into a Strong X Alternative

What Elon Musk Needs From China

You missed

Trump’s Choices for Health Agencies Suggest a Shake-Up Is Coming

COP29 Climate Talks Get a Deal on Money, but Only After a Fight

So Many Days Lost at the Doctor’s Office

Hundreds More Nazca Lines Emerge in Peru’s Desert