DeepSeek-affiliated Hangzhou DeepSeek AI Fundamental Technology Research Co.004 Archives Ltd. today filed a patent for a new web data collection system designed to improve efficiency and data quality. The patent outlines a method for discovering more webpage links while minimizing website traffic impact. It assesses downloaded content to predict the quality of undiscovered links, prioritizing high-value data and reducing redundant downloads. Efficient web data collection is crucial for training large language models (LLMs), which power AI systems like ChatGPT. Existing techniques struggle with incomplete link retrieval, excessive downloads that can crash websites, and low-quality data filtering. DeepSeek’s proposed system aims to solve these issues by optimizing data allocation and maintaining metadata accuracy. [iThome, in Chinese]
Related Articles
2025-06-26 04:53
2964 views
The Anatomy of Liberal Melancholy
J.M. Bernays ,April 25, 2017 The Anatomy o
Read More
2025-06-26 04:34
2921 views
Early Prime Day kitchen deals: Margaritaville, Ninja, more
A quick look at the best early Prime Day kitchen deals Best air fryer deal
Read More
2025-06-26 04:30
131 views
Hubble snaps photo of an eerie part of the universe
NASA's school-bus-sized telescope won't quit.In what's become a brilliant investment for scientists
Read More