MAIB周六晚上讲座:GPT-AGI 提取摘要作为特征选择
发表于 : 2023年 4月 1日 02:38
https://ai2healthcare.github.io/
Title:MAIB-class-011: A Path toward AGI Extractive Summarization as Feature Selection
Date:10:00pm US East time, 04/01/2023
Date:10:00am Beijing time, 04/02/2023
Zoom ID:933 1613 9423
Zoom PWD:416262
Zoom: https://uwmadison.zoom.us/meeting/regis ... lnGn06TP2E
Momiao Xiong, Ph. D, Professor in Department of Biostatistics snd Data Science , University of Texas, School of Public Health. Dr. Xiong graduated from the Department of Statistics at the University of Georgia in 1993. From 1993 to 1995, Dr. Xiong was postdoctoral fellow at the University of Southern California working with Michael Waterman.
Research Interest: Causal Inference, Artificial Intelligence , Manifold Learning, Statistic Genetics and Bioinformatics .
Background
• 1.通用人工智能(AGI)的追求在于更强的泛化能力。泛化能力越强,智能水平越高。
• 2.压缩就是泛化。对于一个数据集最好的无损压缩,就是对于数据集之外的数据最佳泛化。
• 3.GPT预测下一个token的训练任务,等同于对训练数据进行无损压缩。GPT是目前最好的数据无损压缩算法,因此具备最强的智能。压缩即泛化,泛化即智能, 大模型
Summarization as a New Paradigm for Data Reduction
• Extractive Summarization Approach to Feature Selection
• Abstract Summarization Approach to Dimension Reduction
• Protein and DNA Language Models are Extremely Important to Genetics, Population Genetics, Molecular Biology, Clinics and Drug Development.
作为一个人工智能模型,GPT确实具备强大的泛化能力。泛化能力是指一个模型能够从训练数据中学到普遍规律,并能够将这些规律应用到新的、之前没有见过的数据上。这是实现通用人工智能(AGI)的关键能力之一。
压缩确实是一种形式的泛化。通过无损压缩一个数据集,可以从中提取出其中的规律和模式,从而更好地理解这个数据集。因此,无损压缩可以被看作是一种数据的泛化方法。
GPT模型的主要训练任务是对给定文本序列中下一个单词的预测。这个任务可以被看作是对给定数据集进行无损压缩的过程。GPT通过学习语言中的规律和模式来预测下一个单词,从而能够理解文本的含义和结构。因此,GPT确实具备很强的智能能力。
总的来说,压缩确实可以被看作是一种泛化方法,而GPT作为一种数据无损压缩算法,具备很强的智能能力。
The pursuit of AGI is focused on achieving stronger generalization capabilities, which increases intelligence levels. Compression is equivalent to generalization, where the best lossless compression for a dataset is the best generalization for data outside of the dataset. GPT’s task of predicting the next token is equivalent to lossless compression of the training data, making it the best data compression algorithm and therefore possessing the strongest intelligence. Summarization is a new paradigm for data reduction, with extractive summarization serving as a feature selection approach and abstract summarization serving as a dimension reduction approach. Protein and DNA language models are critical in genetics, population genetics, molecular biology, clinics, and drug development.
Replicating the ChatGPT training process requires access to a large dataset of text, high-performance computing resources, and expertise in machine learning and natural language processing. The training process is also highly proprietary and specific to OpenAI’s technology stack, which may not be fully available to the public.
However, there are several open source deep learning frameworks and libraries available that can be used to build and train language models. Some popular options include TensorFlow, PyTorch, and Keras.
To replicate the ChatGPT training process, you would need to:
Acquire a large dataset of text. This could include web pages, news articles, books, and other sources of text. The quality and diversity of the data is critical to the success of the language model.
Preprocess the data to prepare it for training. This includes tokenizing the text, normalizing it, and encoding it in a format that can be used by the deep learning framework.
Choose a deep learning framework and set up a high-performance computing environment to train the model. This may involve using GPU-accelerated hardware, cloud computing resources, or a cluster of machines.
Build a language model architecture based on the transformer architecture used in ChatGPT. This involves designing the model architecture, including the number of layers, attention mechanisms, and other hyperparameters.
Train the model on the dataset using the chosen deep learning framework. This may involve using techniques such as gradient descent, backpropagation, and regularization to optimize the model.
Title:MAIB-class-011: A Path toward AGI Extractive Summarization as Feature Selection
Date:10:00pm US East time, 04/01/2023
Date:10:00am Beijing time, 04/02/2023
Zoom ID:933 1613 9423
Zoom PWD:416262
Zoom: https://uwmadison.zoom.us/meeting/regis ... lnGn06TP2E
Momiao Xiong, Ph. D, Professor in Department of Biostatistics snd Data Science , University of Texas, School of Public Health. Dr. Xiong graduated from the Department of Statistics at the University of Georgia in 1993. From 1993 to 1995, Dr. Xiong was postdoctoral fellow at the University of Southern California working with Michael Waterman.
Research Interest: Causal Inference, Artificial Intelligence , Manifold Learning, Statistic Genetics and Bioinformatics .
Background
• 1.通用人工智能(AGI)的追求在于更强的泛化能力。泛化能力越强,智能水平越高。
• 2.压缩就是泛化。对于一个数据集最好的无损压缩,就是对于数据集之外的数据最佳泛化。
• 3.GPT预测下一个token的训练任务,等同于对训练数据进行无损压缩。GPT是目前最好的数据无损压缩算法,因此具备最强的智能。压缩即泛化,泛化即智能, 大模型
Summarization as a New Paradigm for Data Reduction
• Extractive Summarization Approach to Feature Selection
• Abstract Summarization Approach to Dimension Reduction
• Protein and DNA Language Models are Extremely Important to Genetics, Population Genetics, Molecular Biology, Clinics and Drug Development.
作为一个人工智能模型,GPT确实具备强大的泛化能力。泛化能力是指一个模型能够从训练数据中学到普遍规律,并能够将这些规律应用到新的、之前没有见过的数据上。这是实现通用人工智能(AGI)的关键能力之一。
压缩确实是一种形式的泛化。通过无损压缩一个数据集,可以从中提取出其中的规律和模式,从而更好地理解这个数据集。因此,无损压缩可以被看作是一种数据的泛化方法。
GPT模型的主要训练任务是对给定文本序列中下一个单词的预测。这个任务可以被看作是对给定数据集进行无损压缩的过程。GPT通过学习语言中的规律和模式来预测下一个单词,从而能够理解文本的含义和结构。因此,GPT确实具备很强的智能能力。
总的来说,压缩确实可以被看作是一种泛化方法,而GPT作为一种数据无损压缩算法,具备很强的智能能力。
The pursuit of AGI is focused on achieving stronger generalization capabilities, which increases intelligence levels. Compression is equivalent to generalization, where the best lossless compression for a dataset is the best generalization for data outside of the dataset. GPT’s task of predicting the next token is equivalent to lossless compression of the training data, making it the best data compression algorithm and therefore possessing the strongest intelligence. Summarization is a new paradigm for data reduction, with extractive summarization serving as a feature selection approach and abstract summarization serving as a dimension reduction approach. Protein and DNA language models are critical in genetics, population genetics, molecular biology, clinics, and drug development.
Replicating the ChatGPT training process requires access to a large dataset of text, high-performance computing resources, and expertise in machine learning and natural language processing. The training process is also highly proprietary and specific to OpenAI’s technology stack, which may not be fully available to the public.
However, there are several open source deep learning frameworks and libraries available that can be used to build and train language models. Some popular options include TensorFlow, PyTorch, and Keras.
To replicate the ChatGPT training process, you would need to:
Acquire a large dataset of text. This could include web pages, news articles, books, and other sources of text. The quality and diversity of the data is critical to the success of the language model.
Preprocess the data to prepare it for training. This includes tokenizing the text, normalizing it, and encoding it in a format that can be used by the deep learning framework.
Choose a deep learning framework and set up a high-performance computing environment to train the model. This may involve using GPU-accelerated hardware, cloud computing resources, or a cluster of machines.
Build a language model architecture based on the transformer architecture used in ChatGPT. This involves designing the model architecture, including the number of layers, attention mechanisms, and other hyperparameters.
Train the model on the dataset using the chosen deep learning framework. This may involve using techniques such as gradient descent, backpropagation, and regularization to optimize the model.