Differences

This shows you the differences between two versions of the page.

--- mai_nlp [2022/09/23 23:38] – created jhagstrand
+++ mai_nlp [2023/01/12 11:16] (current) – removed jhagstrand
@@ Line 1: / Line 1: @@
-====== Mai NLP ======
-Natural Language Processing (NLP)
-see wordnet \\
-a topic map maintained at princeton \\
-https://wordnet.princeton.edu/ \\
-(note. three pages on wordnet have been copied into the wordnet project and should be deleted from here.)
-===== Alice Zhao =====
-Data Science Instructor at Metis, Chicago, Illinois \\
-MS Northwestern University, Evanston, Ill
-two hour tutorial \\
-https://www.youtube.com/watch?v=xvqsFTUsOmc
-three products
-  * sentiment analysis
-  * topic modeling
-  * text generation
-python package for Web Scraping
-  * Requests, make HTTP requests
-  * Beautiful Soup, parse HTML documents
-  * Pickle, serialize python objects for later use
-  * Pandas, data analysis, DataFrame = table
-Text data formats
-  -  corpus, prep as Pandas DataFrame, two-column table: author, transcript
-  -  Document-term matrix
-procedure
-  * clean: remove punctuation, lowercase, remove numbers
-  * tokenize: words
-  * remove stop words (articles)
-  * matricize: columns=words, rows=documents, cells=word counts
-two output formats:
-  * corpus, original text
-  * document-term matrix
-sentiment analysis 1:08:56
-  * input: corpus
-  * nltk: natural language toolkit
-  * python libraries: TextBlob, built on top of nltk
-sentiment
-  * from textblob import TextBlob
-  * TextBlob("I love Naiyana").sentiment
-  * output: Sentiment(polarity=0.5, subjectivity=0.6)
-  * polarity: -1 to +1
-  * subjectivity: 0 to +1, higher score means opinionated
-  * TextBlob uses a sentiment lexicon labeled by Tom De Smedt
-word-net, compiled at Princeton, columns for each word:
-  * word form: great
-  * wordnet id: a-01123879
-  * POS: JJ
-  * Sense: very good
-  * Polarity: 0.8
-  * Subjectivity: 1.0
-example: movie review database
-topic modeling 1:22:52
-  * input: document-term matrix
-  * python libraries:
-  * nltk, for pos tagging
-  * gensim, built by Radim Rehurek specifically for topic modeling
-Latent Dirichlet Allocation (DLA)
-  * latent = hidden
-  * Dirichlet = a type of probability distribution
-goal: learn the topic mix in each document, and the word mix in each topic
-  * input: document-term matrix, number of topics, number of iterations
-  * output: the top words in each topic
-other techniques, also available in gensim:
-  * Latent Semantic Indexing (LSI)
-  * Non-Negative Matrix Factorization (NMF)
-id2word = dict((v,k) for k, v in cv.vocabulary_.items())
-Part of speech tag set \\
-https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
-text generation 1:44:50 \\
-input: corpus, include punctuation
-Markov Chains, the current word predicts the next word \\
-LSTM, the current string of words predicts the next word \\
-===== Siraj, NLP  =====
-https://www.youtube.com/watch?v=bDxFvr1gpSU
-History
-Feed forward networks \\
-a vanilla neural network like a multilayer perceptron with fully connected layers. A feed forward network treats all input features as unique and independent of one another, discrete.
-Convolutional networks \\
-An image processing adjacent pixels are related, and similar patterns repeated in the image are related. Proximity matters.
-Recurrent networks \\
-Process a string of words. Predict the end of the sentence given the beginning of the sentence.
-LSTM networks, A variant of RNN.
-Attention networks
-Can match a pronoun to its noun antecedent.
-Transformer \\
-Encoder
-===== Jesse Moore, Using BERT to Accelerate NLP =====
-https://m.youtube.com/watch?v=4Z_TzZJ-v3o
-  * BERT Google, Bi-directional encoder representation from transformers
-  * GPT2 OpenAI, for story-telling
-Any time you're trying to do something with text.
-  * Classify it.
-  * Make use of it
-  * Translate
-  * Sentence completion
-  * Auto complete
-  * Story telling
-F1 Score. A number from zero to 1. 1 is better. A way to evaluate classification problems.
-Transformers have become the basic building block of most state-of-the-art architectures in NLP, replacing gated recurrent neural network models such as the long short-term memory (LSTM)
-Both BERT and GPT-2 are based on transformers.
-===== nltk, natural language toolkit, python library =====
-corpus, corpora
-  * text
-  * concordance
-  * common_contexts
-  * dispersion_plot
-  * generate
-  * set
-  * len
-  * sorted
-===== notes =====
-Sam, coach, conversation, info capture, python server, SQL, Ajax
-Python
-  * Read config.ini
-  * psycopg2
-  * Calc level (to be used in rapgen)
-  * Load grammar table. Not just grammar but infonetgrab
-  * Interrogation
-Recognize Thai handwriting \\
-This could be an academic project that would result in a dictionary and corpus. Which academic institutions are working on NLP for Thai?
-Teach vocabulary, grammar, many subject domains simultaneously with principles of repitition, reinforcement, building gradually on previously mastered material.
-Let the teacher learn even while teaching. \\
-Let the teacher teach like a parent: continuously, while going about your day.
-Databit
-  * Questions to illicit this databit
-  * Answers valid values
-Network of databits
-  * Which question to ask next?
-Chat server
-Ajax server
-Webserver
-Ajax Chat client
-GitHub frug Ajax chat \\
-Uses Ruby socket server \\
-Client Uses js-flash bridge or fall-back to Ajax polling
-Polling every 1sec \\
-Degrade to 5sec after disuse
-Requires Ajax server
-A2hosting \\
-Run webserver \\
-Run python scripts from Ajax post \\
-Access postgresql from python post \\
-=== con ===
-Gensen \\
-Gencon \\
-Thirst for knowledge about person \\
-Ask about friends \\
-Compare answers from multiple persons \\
-Database structure for personal data \\
-Dialog
-A. Gather info \\
-B. Drill student \\
-Empathy, know the other's vocabulary, use it, help him expand it
-=== Hm ===
-เข้าร่วม. Join \\
-เข้าสู่ระบบ. Login \\
-Thai Corpus - use to calculate level
-Thai wordnet,  \\
-English wordnet: Princeton
-===== Thai National Corpus (TNC) =====
-Chulalongkorn U. Bangkok \\
-http://www.arts.chula.ac.th/ling/tnc/works/
-TNC Online, broken php \\
-http://www.arts.chula.ac.th/~ling/TNCII/corp.php
-Research paper about TNC: \\
-Aroonmanakun, Wirote & Tansiri, Kachen & Nittayanuparp, Pairit. (2009). Thai National Corpus. 153-158. 10.3115/1690299.1690321. \\
-https://www.researchgate.net/publication/271429101_Thai_National_Corpus
-===== Resources =====
-Khun Wannaphong, Khon Kaen U., Using Thani, 2017-2020 \\
-http://thainlp.wannaphong.com/
-Dictionary \\
-https://lexitron.nectec.or.th/2009_1/
-Wordlist  \\
-https://www.expatden.com/thai/thai-frequency-lists-with-english-definitions/