douban(豆瓣多轮)包括基于检索的聊天机器人的训练数据集、开发数据集和测试集。
MS MARCO是一种新的大规模阅读理解和问答数据集。在MS MARCO中,所有问题都是从真正的匿名用户查询中抽取的。使用先进的Bing搜索引擎版本,从实际的Web文档中提取数据集中的答案的上下文段落。
chatterbot是一个开源中文对话语料库,语言库数量为560,已按类型进行分类。
源于Quora 的包含重复/语义相似性标签的数据集。数据集由超过40万行的潜在问题的问答组成。每行数据包含问题ID、问题全文以及指示该行是否真正包含重复对的二进制值。
斯坦福问答回答数据集(SQuAD)是一个新的阅读理解数据集,从维基百科中提炼出的问题组成,每个问题的答案都是相应段落的一段文本。在500多篇文章中有超过10万个问答对。
Sentiment analysis is the task of classifying the polarity of a given text.
Question Answering is the task of answering questions (typically reading comprehension questions), but abstaining when presented with a question that ...
Language modeling is the task of predicting the next word or character in a document.* indicates models using dynamic evaluation; where, at test time,...
Machine translation is the task of translating a sentence in a source language to a different target language
Papers With Code highlights trending ML research and the code to implement it.1452 leaderboards • 1323 tasks • 1318 datasets • 16864 papers with code....
Industrial-strength Natural Language Processing (NLP) with Python and Cython
A TensorFlow implementation of Baidu's DeepSpeech architecture
Tensorflow Implementation of Deep Voice 3
Topic Modelling for Humans
Cleaned code for paper "Natural Language Inference over Interaction Space"
Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.
This project provides 100+ Chinese Word Vectors (embeddings) trained with different representations (dense and sparse), context features (word, ngram,...
News, full-text, and article metadata extraction in Python 3. Advanced docs: