mai_nlp
Differences
This shows you the differences between two versions of the page.
| mai_nlp [2022/09/23 23:38] – created jhagstrand | mai_nlp [2023/01/12 11:16] (current) – removed jhagstrand | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| - | ====== Mai NLP ====== | ||
| - | |||
| - | Natural Language Processing (NLP) | ||
| - | |||
| - | see wordnet \\ | ||
| - | a topic map maintained at princeton \\ | ||
| - | https:// | ||
| - | (note. three pages on wordnet have been copied into the wordnet project and should be deleted from here.) | ||
| - | |||
| - | ===== Alice Zhao ===== | ||
| - | Data Science Instructor at Metis, Chicago, Illinois \\ | ||
| - | MS Northwestern University, Evanston, Ill | ||
| - | |||
| - | two hour tutorial \\ | ||
| - | https:// | ||
| - | |||
| - | three products | ||
| - | * sentiment analysis | ||
| - | * topic modeling | ||
| - | * text generation | ||
| - | |||
| - | python package for Web Scraping | ||
| - | * Requests, make HTTP requests | ||
| - | * Beautiful Soup, parse HTML documents | ||
| - | * Pickle, serialize python objects for later use | ||
| - | * Pandas, data analysis, DataFrame = table | ||
| - | |||
| - | Text data formats | ||
| - | - corpus, prep as Pandas DataFrame, two-column table: author, transcript | ||
| - | - Document-term matrix | ||
| - | |||
| - | procedure | ||
| - | * clean: remove punctuation, | ||
| - | * tokenize: words | ||
| - | * remove stop words (articles) | ||
| - | * matricize: columns=words, | ||
| - | |||
| - | two output formats: | ||
| - | * corpus, original text | ||
| - | * document-term matrix | ||
| - | |||
| - | sentiment analysis 1:08:56 | ||
| - | * input: corpus | ||
| - | * nltk: natural language toolkit | ||
| - | * python libraries: TextBlob, built on top of nltk | ||
| - | |||
| - | sentiment | ||
| - | * from textblob import TextBlob | ||
| - | * TextBlob(" | ||
| - | * output: Sentiment(polarity=0.5, | ||
| - | * polarity: -1 to +1 | ||
| - | * subjectivity: | ||
| - | * TextBlob uses a sentiment lexicon labeled by Tom De Smedt | ||
| - | |||
| - | word-net, compiled at Princeton, columns for each word: | ||
| - | * word form: great | ||
| - | * wordnet id: a-01123879 | ||
| - | * POS: JJ | ||
| - | * Sense: very good | ||
| - | * Polarity: 0.8 | ||
| - | * Subjectivity: | ||
| - | |||
| - | example: movie review database | ||
| - | |||
| - | topic modeling 1:22:52 | ||
| - | * input: document-term matrix | ||
| - | * python libraries: | ||
| - | * nltk, for pos tagging | ||
| - | * gensim, built by Radim Rehurek specifically for topic modeling | ||
| - | |||
| - | Latent Dirichlet Allocation (DLA) | ||
| - | * latent = hidden | ||
| - | * Dirichlet = a type of probability distribution | ||
| - | |||
| - | goal: learn the topic mix in each document, and the word mix in each topic | ||
| - | * input: document-term matrix, number of topics, number of iterations | ||
| - | * output: the top words in each topic | ||
| - | |||
| - | other techniques, also available in gensim: | ||
| - | * Latent Semantic Indexing (LSI) | ||
| - | * Non-Negative Matrix Factorization (NMF) | ||
| - | |||
| - | id2word = dict((v,k) for k, v in cv.vocabulary_.items()) | ||
| - | |||
| - | Part of speech tag set \\ | ||
| - | https:// | ||
| - | |||
| - | text generation 1:44:50 \\ | ||
| - | input: corpus, include punctuation | ||
| - | |||
| - | Markov Chains, the current word predicts the next word \\ | ||
| - | LSTM, the current string of words predicts the next word \\ | ||
| - | |||
| - | ===== Siraj, NLP ===== | ||
| - | |||
| - | https:// | ||
| - | |||
| - | History | ||
| - | |||
| - | Feed forward networks \\ | ||
| - | a vanilla neural network like a multilayer perceptron with fully connected layers. A feed forward network treats all input features as unique and independent of one another, discrete. | ||
| - | |||
| - | Convolutional networks \\ | ||
| - | An image processing adjacent pixels are related, and similar patterns repeated in the image are related. Proximity matters. | ||
| - | |||
| - | Recurrent networks \\ | ||
| - | Process a string of words. Predict the end of the sentence given the beginning of the sentence. | ||
| - | |||
| - | LSTM networks, A variant of RNN. | ||
| - | |||
| - | Attention networks | ||
| - | |||
| - | Can match a pronoun to its noun antecedent. | ||
| - | |||
| - | Transformer \\ | ||
| - | Encoder | ||
| - | |||
| - | ===== Jesse Moore, Using BERT to Accelerate NLP ===== | ||
| - | |||
| - | https:// | ||
| - | |||
| - | * BERT Google, Bi-directional encoder representation from transformers | ||
| - | * GPT2 OpenAI, for story-telling | ||
| - | |||
| - | Any time you're trying to do something with text. | ||
| - | * Classify it. | ||
| - | * Make use of it | ||
| - | * Translate | ||
| - | * Sentence completion | ||
| - | * Auto complete | ||
| - | * Story telling | ||
| - | |||
| - | F1 Score. A number from zero to 1. 1 is better. A way to evaluate classification problems. | ||
| - | |||
| - | Transformers have become the basic building block of most state-of-the-art architectures in NLP, replacing gated recurrent neural network models such as the long short-term memory (LSTM) | ||
| - | |||
| - | Both BERT and GPT-2 are based on transformers. | ||
| - | |||
| - | ===== nltk, natural language toolkit, python library ===== | ||
| - | |||
| - | corpus, corpora | ||
| - | |||
| - | * text | ||
| - | * concordance | ||
| - | * common_contexts | ||
| - | * dispersion_plot | ||
| - | * generate | ||
| - | * set | ||
| - | * len | ||
| - | * sorted | ||
| - | |||
| - | ===== notes ===== | ||
| - | |||
| - | Sam, coach, conversation, | ||
| - | |||
| - | Python | ||
| - | * Read config.ini | ||
| - | * psycopg2 | ||
| - | * Calc level (to be used in rapgen) | ||
| - | * Load grammar table. Not just grammar but infonetgrab | ||
| - | * Interrogation | ||
| - | |||
| - | Recognize Thai handwriting \\ | ||
| - | This could be an academic project that would result in a dictionary and corpus. Which academic institutions are working on NLP for Thai? | ||
| - | |||
| - | Teach vocabulary, grammar, many subject domains simultaneously with principles of repitition, reinforcement, | ||
| - | |||
| - | Let the teacher learn even while teaching. \\ | ||
| - | Let the teacher teach like a parent: continuously, | ||
| - | |||
| - | Databit | ||
| - | * Questions to illicit this databit | ||
| - | * Answers valid values | ||
| - | |||
| - | Network of databits | ||
| - | * Which question to ask next? | ||
| - | |||
| - | Chat server | ||
| - | |||
| - | Ajax server | ||
| - | |||
| - | Webserver | ||
| - | |||
| - | Ajax Chat client | ||
| - | |||
| - | GitHub frug Ajax chat \\ | ||
| - | Uses Ruby socket server \\ | ||
| - | Client Uses js-flash bridge or fall-back to Ajax polling | ||
| - | |||
| - | Polling every 1sec \\ | ||
| - | Degrade to 5sec after disuse | ||
| - | |||
| - | Requires Ajax server | ||
| - | |||
| - | A2hosting \\ | ||
| - | Run webserver \\ | ||
| - | Run python scripts from Ajax post \\ | ||
| - | Access postgresql from python post \\ | ||
| - | |||
| - | === con === | ||
| - | |||
| - | Gensen \\ | ||
| - | Gencon \\ | ||
| - | Thirst for knowledge about person \\ | ||
| - | Ask about friends \\ | ||
| - | Compare answers from multiple persons \\ | ||
| - | Database structure for personal data \\ | ||
| - | |||
| - | Dialog | ||
| - | A. Gather info \\ | ||
| - | B. Drill student \\ | ||
| - | |||
| - | Empathy, know the other' | ||
| - | |||
| - | === Hm === | ||
| - | |||
| - | เข้าร่วม. Join \\ | ||
| - | เข้าสู่ระบบ. Login \\ | ||
| - | |||
| - | Thai Corpus - use to calculate level | ||
| - | |||
| - | Thai wordnet, | ||
| - | English wordnet: Princeton | ||
| - | |||
| - | ===== Thai National Corpus (TNC) ===== | ||
| - | |||
| - | Chulalongkorn U. Bangkok \\ | ||
| - | http:// | ||
| - | |||
| - | TNC Online, broken php \\ | ||
| - | http:// | ||
| - | |||
| - | Research paper about TNC: \\ | ||
| - | Aroonmanakun, | ||
| - | https:// | ||
| - | |||
| - | ===== Resources ===== | ||
| - | Khun Wannaphong, Khon Kaen U., Using Thani, 2017-2020 \\ | ||
| - | http:// | ||
| - | |||
| - | Dictionary \\ | ||
| - | https:// | ||
| - | |||
| - | Wordlist | ||
| - | https:// | ||
mai_nlp.1663990716.txt.gz · Last modified: 2022/09/23 23:38 by jhagstrand