User Tools

Site Tools


mai_nlp

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

mai_nlp [2022/09/23 23:38] – created jhagstrandmai_nlp [2023/01/12 11:16] (current) – removed jhagstrand
Line 1: Line 1:
-====== Mai NLP ====== 
- 
-Natural Language Processing (NLP) 
- 
-see wordnet \\ 
-a topic map maintained at princeton \\ 
-https://wordnet.princeton.edu/ \\ 
-(note. three pages on wordnet have been copied into the wordnet project and should be deleted from here.) 
- 
-===== Alice Zhao ===== 
-Data Science Instructor at Metis, Chicago, Illinois \\ 
-MS Northwestern University, Evanston, Ill 
- 
-two hour tutorial \\ 
-https://www.youtube.com/watch?v=xvqsFTUsOmc 
- 
-three products 
-  * sentiment analysis 
-  * topic modeling 
-  * text generation 
- 
-python package for Web Scraping 
-  * Requests, make HTTP requests 
-  * Beautiful Soup, parse HTML documents 
-  * Pickle, serialize python objects for later use 
-  * Pandas, data analysis, DataFrame = table 
- 
-Text data formats 
-  -  corpus, prep as Pandas DataFrame, two-column table: author, transcript 
-  -  Document-term matrix 
- 
-procedure 
-  * clean: remove punctuation, lowercase, remove numbers 
-  * tokenize: words 
-  * remove stop words (articles) 
-  * matricize: columns=words, rows=documents, cells=word counts 
- 
-two output formats:  
-  * corpus, original text 
-  * document-term matrix 
- 
-sentiment analysis 1:08:56 
-  * input: corpus 
-  * nltk: natural language toolkit 
-  * python libraries: TextBlob, built on top of nltk 
- 
-sentiment 
-  * from textblob import TextBlob 
-  * TextBlob("I love Naiyana").sentiment 
-  * output: Sentiment(polarity=0.5, subjectivity=0.6) 
-  * polarity: -1 to +1 
-  * subjectivity: 0 to +1, higher score means opinionated 
-  * TextBlob uses a sentiment lexicon labeled by Tom De Smedt 
- 
-word-net, compiled at Princeton, columns for each word:  
-  * word form: great 
-  * wordnet id: a-01123879 
-  * POS: JJ 
-  * Sense: very good 
-  * Polarity: 0.8 
-  * Subjectivity: 1.0 
- 
-example: movie review database 
- 
-topic modeling 1:22:52 
-  * input: document-term matrix 
-  * python libraries: 
-  * nltk, for pos tagging 
-  * gensim, built by Radim Rehurek specifically for topic modeling 
- 
-Latent Dirichlet Allocation (DLA) 
-  * latent = hidden 
-  * Dirichlet = a type of probability distribution 
- 
-goal: learn the topic mix in each document, and the word mix in each topic 
-  * input: document-term matrix, number of topics, number of iterations 
-  * output: the top words in each topic 
- 
-other techniques, also available in gensim: 
-  * Latent Semantic Indexing (LSI) 
-  * Non-Negative Matrix Factorization (NMF) 
- 
-id2word = dict((v,k) for k, v in cv.vocabulary_.items()) 
- 
-Part of speech tag set \\ 
-https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html 
- 
-text generation 1:44:50 \\ 
-input: corpus, include punctuation 
- 
-Markov Chains, the current word predicts the next word \\ 
-LSTM, the current string of words predicts the next word \\ 
- 
-===== Siraj, NLP  ===== 
- 
-https://www.youtube.com/watch?v=bDxFvr1gpSU 
- 
-History 
- 
-Feed forward networks \\ 
-a vanilla neural network like a multilayer perceptron with fully connected layers. A feed forward network treats all input features as unique and independent of one another, discrete. 
- 
-Convolutional networks \\ 
-An image processing adjacent pixels are related, and similar patterns repeated in the image are related. Proximity matters. 
- 
-Recurrent networks \\ 
-Process a string of words. Predict the end of the sentence given the beginning of the sentence. 
- 
-LSTM networks, A variant of RNN. 
- 
-Attention networks 
- 
-Can match a pronoun to its noun antecedent. 
- 
-Transformer \\ 
-Encoder 
- 
-===== Jesse Moore, Using BERT to Accelerate NLP ===== 
- 
-https://m.youtube.com/watch?v=4Z_TzZJ-v3o 
- 
-  * BERT Google, Bi-directional encoder representation from transformers 
-  * GPT2 OpenAI, for story-telling 
- 
-Any time you're trying to do something with text.   
-  * Classify it. 
-  * Make use of it 
-  * Translate 
-  * Sentence completion 
-  * Auto complete 
-  * Story telling 
- 
-F1 Score. A number from zero to 1. 1 is better. A way to evaluate classification problems. 
- 
-Transformers have become the basic building block of most state-of-the-art architectures in NLP, replacing gated recurrent neural network models such as the long short-term memory (LSTM) 
- 
-Both BERT and GPT-2 are based on transformers. 
- 
-===== nltk, natural language toolkit, python library ===== 
- 
-corpus, corpora 
- 
-  * text 
-  * concordance 
-  * common_contexts 
-  * dispersion_plot 
-  * generate 
-  * set 
-  * len 
-  * sorted 
- 
-===== notes ===== 
- 
-Sam, coach, conversation, info capture, python server, SQL, Ajax 
- 
-Python  
-  * Read config.ini 
-  * psycopg2 
-  * Calc level (to be used in rapgen) 
-  * Load grammar table. Not just grammar but infonetgrab 
-  * Interrogation 
- 
-Recognize Thai handwriting \\ 
-This could be an academic project that would result in a dictionary and corpus. Which academic institutions are working on NLP for Thai? 
- 
-Teach vocabulary, grammar, many subject domains simultaneously with principles of repitition, reinforcement, building gradually on previously mastered material. 
- 
-Let the teacher learn even while teaching. \\ 
-Let the teacher teach like a parent: continuously, while going about your day. 
- 
-Databit 
-  * Questions to illicit this databit 
-  * Answers valid values 
- 
-Network of databits 
-  * Which question to ask next? 
- 
-Chat server 
- 
-Ajax server 
- 
-Webserver 
- 
-Ajax Chat client 
- 
-GitHub frug Ajax chat \\ 
-Uses Ruby socket server \\ 
-Client Uses js-flash bridge or fall-back to Ajax polling 
- 
-Polling every 1sec \\ 
-Degrade to 5sec after disuse 
- 
-Requires Ajax server 
- 
-A2hosting \\ 
-Run webserver \\ 
-Run python scripts from Ajax post \\ 
-Access postgresql from python post \\ 
- 
-=== con === 
- 
-Gensen \\ 
-Gencon \\ 
-Thirst for knowledge about person \\ 
-Ask about friends \\ 
-Compare answers from multiple persons \\ 
-Database structure for personal data \\ 
- 
-Dialog   
-A. Gather info \\ 
-B. Drill student \\ 
- 
-Empathy, know the other's vocabulary, use it, help him expand it 
- 
-=== Hm === 
- 
-เข้าร่วม. Join \\ 
-เข้าสู่ระบบ. Login \\ 
- 
-Thai Corpus - use to calculate level 
- 
-Thai wordnet,  \\ 
-English wordnet: Princeton 
- 
-===== Thai National Corpus (TNC) ===== 
- 
-Chulalongkorn U. Bangkok \\ 
-http://www.arts.chula.ac.th/ling/tnc/works/ 
- 
-TNC Online, broken php \\ 
-http://www.arts.chula.ac.th/~ling/TNCII/corp.php 
- 
-Research paper about TNC: \\ 
-Aroonmanakun, Wirote & Tansiri, Kachen & Nittayanuparp, Pairit. (2009). Thai National Corpus. 153-158. 10.3115/1690299.1690321. \\  
-https://www.researchgate.net/publication/271429101_Thai_National_Corpus 
- 
-===== Resources ===== 
-Khun Wannaphong, Khon Kaen U., Using Thani, 2017-2020 \\ 
-http://thainlp.wannaphong.com/ 
- 
-Dictionary \\ 
-https://lexitron.nectec.or.th/2009_1/ 
- 
-Wordlist  \\ 
-https://www.expatden.com/thai/thai-frequency-lists-with-english-definitions/ 
  
mai_nlp.1663990716.txt.gz · Last modified: 2022/09/23 23:38 by jhagstrand

Except where otherwise noted, content on this wiki is licensed under the following license: Public Domain
Public Domain Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki